Transfer Learning for NLP 08202019 - files.devnetwork.cloud · •Transfer Learning for Computer...
Transcript of Transfer Learning for NLP 08202019 - files.devnetwork.cloud · •Transfer Learning for Computer...
![Page 1: Transfer Learning for NLP 08202019 - files.devnetwork.cloud · •Transfer Learning for Computer Vision •Transfer Learning for NLP (prior to 2018) •Language models in 2018 •Language](https://reader033.fdocuments.in/reader033/viewer/2022042220/5ec695d45cd67c7b0735c289/html5/thumbnails/1.jpg)
Transfer Learning for NLPJoan XiaoPrincipal Data ScientistLinc GlobalOct 8 2019
![Page 2: Transfer Learning for NLP 08202019 - files.devnetwork.cloud · •Transfer Learning for Computer Vision •Transfer Learning for NLP (prior to 2018) •Language models in 2018 •Language](https://reader033.fdocuments.in/reader033/viewer/2022042220/5ec695d45cd67c7b0735c289/html5/thumbnails/2.jpg)
Linc Customer Care Automation combines data, commerce workflows and distributed AI to create an automated branded assistant that can serve, engage and sell to your customers across the channels they use every day.
![Page 3: Transfer Learning for NLP 08202019 - files.devnetwork.cloud · •Transfer Learning for Computer Vision •Transfer Learning for NLP (prior to 2018) •Language models in 2018 •Language](https://reader033.fdocuments.in/reader033/viewer/2022042220/5ec695d45cd67c7b0735c289/html5/thumbnails/3.jpg)
Joan XiaoWork Experience: - Principal Data Scientist, Linc Global- Lead Machine Learning Scientist, Figure Eight- Sr. Manager, Data Science, H5- Software Development Manager, HP
Education:- PhD in Mathematics, University of Pennsylvania- MS in Computer Science, University of Pennsylvania
![Page 4: Transfer Learning for NLP 08202019 - files.devnetwork.cloud · •Transfer Learning for Computer Vision •Transfer Learning for NLP (prior to 2018) •Language models in 2018 •Language](https://reader033.fdocuments.in/reader033/viewer/2022042220/5ec695d45cd67c7b0735c289/html5/thumbnails/4.jpg)
Agenda• What is Transfer Learning
• Transfer Learning for Computer Vision
• Transfer Learning for NLP (prior to 2018)
• Language models in 2018
• Language models in 2019
![Page 5: Transfer Learning for NLP 08202019 - files.devnetwork.cloud · •Transfer Learning for Computer Vision •Transfer Learning for NLP (prior to 2018) •Language models in 2018 •Language](https://reader033.fdocuments.in/reader033/viewer/2022042220/5ec695d45cd67c7b0735c289/html5/thumbnails/5.jpg)
- Andrew Ng, NIPS 2016
![Page 6: Transfer Learning for NLP 08202019 - files.devnetwork.cloud · •Transfer Learning for Computer Vision •Transfer Learning for NLP (prior to 2018) •Language models in 2018 •Language](https://reader033.fdocuments.in/reader033/viewer/2022042220/5ec695d45cd67c7b0735c289/html5/thumbnails/6.jpg)
![Page 7: Transfer Learning for NLP 08202019 - files.devnetwork.cloud · •Transfer Learning for Computer Vision •Transfer Learning for NLP (prior to 2018) •Language models in 2018 •Language](https://reader033.fdocuments.in/reader033/viewer/2022042220/5ec695d45cd67c7b0735c289/html5/thumbnails/7.jpg)
![Page 8: Transfer Learning for NLP 08202019 - files.devnetwork.cloud · •Transfer Learning for Computer Vision •Transfer Learning for NLP (prior to 2018) •Language models in 2018 •Language](https://reader033.fdocuments.in/reader033/viewer/2022042220/5ec695d45cd67c7b0735c289/html5/thumbnails/8.jpg)
![Page 9: Transfer Learning for NLP 08202019 - files.devnetwork.cloud · •Transfer Learning for Computer Vision •Transfer Learning for NLP (prior to 2018) •Language models in 2018 •Language](https://reader033.fdocuments.in/reader033/viewer/2022042220/5ec695d45cd67c7b0735c289/html5/thumbnails/9.jpg)
ImageNet Large Scale Visual Recognition Challenge
![Page 10: Transfer Learning for NLP 08202019 - files.devnetwork.cloud · •Transfer Learning for Computer Vision •Transfer Learning for NLP (prior to 2018) •Language models in 2018 •Language](https://reader033.fdocuments.in/reader033/viewer/2022042220/5ec695d45cd67c7b0735c289/html5/thumbnails/10.jpg)
![Page 11: Transfer Learning for NLP 08202019 - files.devnetwork.cloud · •Transfer Learning for Computer Vision •Transfer Learning for NLP (prior to 2018) •Language models in 2018 •Language](https://reader033.fdocuments.in/reader033/viewer/2022042220/5ec695d45cd67c7b0735c289/html5/thumbnails/11.jpg)
• Feature extractor: remove the last fully-connected layer, then treat the rest of the model as a fixed feature extractor for the new dataset.
• Fine-tune: replace and retrain the classifier on top of the model on the new dataset, but to also fine-tune the weights of the pretrained network by continuing the backpropagation. One can fine-tune all the layers of the model, or only fine-tune some higher-level portion of the network.
![Page 12: Transfer Learning for NLP 08202019 - files.devnetwork.cloud · •Transfer Learning for Computer Vision •Transfer Learning for NLP (prior to 2018) •Language models in 2018 •Language](https://reader033.fdocuments.in/reader033/viewer/2022042220/5ec695d45cd67c7b0735c289/html5/thumbnails/12.jpg)
![Page 13: Transfer Learning for NLP 08202019 - files.devnetwork.cloud · •Transfer Learning for Computer Vision •Transfer Learning for NLP (prior to 2018) •Language models in 2018 •Language](https://reader033.fdocuments.in/reader033/viewer/2022042220/5ec695d45cd67c7b0735c289/html5/thumbnails/13.jpg)
![Page 14: Transfer Learning for NLP 08202019 - files.devnetwork.cloud · •Transfer Learning for Computer Vision •Transfer Learning for NLP (prior to 2018) •Language models in 2018 •Language](https://reader033.fdocuments.in/reader033/viewer/2022042220/5ec695d45cd67c7b0735c289/html5/thumbnails/14.jpg)
![Page 15: Transfer Learning for NLP 08202019 - files.devnetwork.cloud · •Transfer Learning for Computer Vision •Transfer Learning for NLP (prior to 2018) •Language models in 2018 •Language](https://reader033.fdocuments.in/reader033/viewer/2022042220/5ec695d45cd67c7b0735c289/html5/thumbnails/15.jpg)
§ Word2vec
§ Glove
§ FastText
§ Etc.
![Page 16: Transfer Learning for NLP 08202019 - files.devnetwork.cloud · •Transfer Learning for Computer Vision •Transfer Learning for NLP (prior to 2018) •Language models in 2018 •Language](https://reader033.fdocuments.in/reader033/viewer/2022042220/5ec695d45cd67c7b0735c289/html5/thumbnails/16.jpg)
![Page 17: Transfer Learning for NLP 08202019 - files.devnetwork.cloud · •Transfer Learning for Computer Vision •Transfer Learning for NLP (prior to 2018) •Language models in 2018 •Language](https://reader033.fdocuments.in/reader033/viewer/2022042220/5ec695d45cd67c7b0735c289/html5/thumbnails/17.jpg)
• Embeddings from Language Models
• Contextual
• Deep
• Character based
• Trained on 1 billion word benchmark
• 2 layer bi-LSTM network with residual connections
• SOTA results on six NLP tasks, including question answering, textual entailment and sentiment analysis
![Page 18: Transfer Learning for NLP 08202019 - files.devnetwork.cloud · •Transfer Learning for Computer Vision •Transfer Learning for NLP (prior to 2018) •Language models in 2018 •Language](https://reader033.fdocuments.in/reader033/viewer/2022042220/5ec695d45cd67c7b0735c289/html5/thumbnails/18.jpg)
![Page 19: Transfer Learning for NLP 08202019 - files.devnetwork.cloud · •Transfer Learning for Computer Vision •Transfer Learning for NLP (prior to 2018) •Language models in 2018 •Language](https://reader033.fdocuments.in/reader033/viewer/2022042220/5ec695d45cd67c7b0735c289/html5/thumbnails/19.jpg)
ULMFiT• Universal language model fine-tuning
• 3 layer LSTM network
• Trained on wikitext
• Key fine-tuning techniques: discriminative fine-tuning, slanted triangular learning rates, and gradual unfreezing
• SOTA results on text classification on six datasets
![Page 20: Transfer Learning for NLP 08202019 - files.devnetwork.cloud · •Transfer Learning for Computer Vision •Transfer Learning for NLP (prior to 2018) •Language models in 2018 •Language](https://reader033.fdocuments.in/reader033/viewer/2022042220/5ec695d45cd67c7b0735c289/html5/thumbnails/20.jpg)
ULMFiT
![Page 21: Transfer Learning for NLP 08202019 - files.devnetwork.cloud · •Transfer Learning for Computer Vision •Transfer Learning for NLP (prior to 2018) •Language models in 2018 •Language](https://reader033.fdocuments.in/reader033/viewer/2022042220/5ec695d45cd67c7b0735c289/html5/thumbnails/21.jpg)
ULMFiT
![Page 22: Transfer Learning for NLP 08202019 - files.devnetwork.cloud · •Transfer Learning for Computer Vision •Transfer Learning for NLP (prior to 2018) •Language models in 2018 •Language](https://reader033.fdocuments.in/reader033/viewer/2022042220/5ec695d45cd67c7b0735c289/html5/thumbnails/22.jpg)
OpenAI GPT
• Generative Pre-training
• 12-layer decoder-only transformer with masked self-attention heads
• Trained on BooksCorpus dataset
• SOTA results on nine NLP tasks, including question answering, commonsense reasoning and textual entailment
![Page 23: Transfer Learning for NLP 08202019 - files.devnetwork.cloud · •Transfer Learning for Computer Vision •Transfer Learning for NLP (prior to 2018) •Language models in 2018 •Language](https://reader033.fdocuments.in/reader033/viewer/2022042220/5ec695d45cd67c7b0735c289/html5/thumbnails/23.jpg)
OpenAI GPT
![Page 24: Transfer Learning for NLP 08202019 - files.devnetwork.cloud · •Transfer Learning for Computer Vision •Transfer Learning for NLP (prior to 2018) •Language models in 2018 •Language](https://reader033.fdocuments.in/reader033/viewer/2022042220/5ec695d45cd67c7b0735c289/html5/thumbnails/24.jpg)
BERT§ Bidirectional Encoder Representations from Transformers
§ Masked language model (MLM) pre-training objective
§ “next sentence prediction” task that jointly pretrains text-pair representations
§ Trained on BooksCorpus (800M words, same as OpenAI GPT) and Wikipedia (2,500M words) – total of 13GB text.
§ Achieved SOTA results on eleven tasks (GLUE, SQuAD 1.1 and 2.0, and SWAG).
§ Base model: L=12, A=12 (cased, uncased, multi-lingual cased, Chinese)
§ Large model: L=24, A=16 (cased, uncased, cased with whole word masking, uncased with whole word masking)
![Page 25: Transfer Learning for NLP 08202019 - files.devnetwork.cloud · •Transfer Learning for Computer Vision •Transfer Learning for NLP (prior to 2018) •Language models in 2018 •Language](https://reader033.fdocuments.in/reader033/viewer/2022042220/5ec695d45cd67c7b0735c289/html5/thumbnails/25.jpg)
![Page 26: Transfer Learning for NLP 08202019 - files.devnetwork.cloud · •Transfer Learning for Computer Vision •Transfer Learning for NLP (prior to 2018) •Language models in 2018 •Language](https://reader033.fdocuments.in/reader033/viewer/2022042220/5ec695d45cd67c7b0735c289/html5/thumbnails/26.jpg)
![Page 27: Transfer Learning for NLP 08202019 - files.devnetwork.cloud · •Transfer Learning for Computer Vision •Transfer Learning for NLP (prior to 2018) •Language models in 2018 •Language](https://reader033.fdocuments.in/reader033/viewer/2022042220/5ec695d45cd67c7b0735c289/html5/thumbnails/27.jpg)
![Page 28: Transfer Learning for NLP 08202019 - files.devnetwork.cloud · •Transfer Learning for Computer Vision •Transfer Learning for NLP (prior to 2018) •Language models in 2018 •Language](https://reader033.fdocuments.in/reader033/viewer/2022042220/5ec695d45cd67c7b0735c289/html5/thumbnails/28.jpg)
![Page 29: Transfer Learning for NLP 08202019 - files.devnetwork.cloud · •Transfer Learning for Computer Vision •Transfer Learning for NLP (prior to 2018) •Language models in 2018 •Language](https://reader033.fdocuments.in/reader033/viewer/2022042220/5ec695d45cd67c7b0735c289/html5/thumbnails/29.jpg)
CoNLL-2003 Named Entity Recognition results
![Page 30: Transfer Learning for NLP 08202019 - files.devnetwork.cloud · •Transfer Learning for Computer Vision •Transfer Learning for NLP (prior to 2018) •Language models in 2018 •Language](https://reader033.fdocuments.in/reader033/viewer/2022042220/5ec695d45cd67c7b0735c289/html5/thumbnails/30.jpg)
![Page 31: Transfer Learning for NLP 08202019 - files.devnetwork.cloud · •Transfer Learning for Computer Vision •Transfer Learning for NLP (prior to 2018) •Language models in 2018 •Language](https://reader033.fdocuments.in/reader033/viewer/2022042220/5ec695d45cd67c7b0735c289/html5/thumbnails/31.jpg)
Number of Computer Science Publications on arXiv, as of Oct 1, 2019
§Contains “BERT” in title: 106§Contains “BERT” in abstract: 351§Contains “BERT” in paper: 448
![Page 32: Transfer Learning for NLP 08202019 - files.devnetwork.cloud · •Transfer Learning for Computer Vision •Transfer Learning for NLP (prior to 2018) •Language models in 2018 •Language](https://reader033.fdocuments.in/reader033/viewer/2022042220/5ec695d45cd67c7b0735c289/html5/thumbnails/32.jpg)
![Page 33: Transfer Learning for NLP 08202019 - files.devnetwork.cloud · •Transfer Learning for Computer Vision •Transfer Learning for NLP (prior to 2018) •Language models in 2018 •Language](https://reader033.fdocuments.in/reader033/viewer/2022042220/5ec695d45cd67c7b0735c289/html5/thumbnails/33.jpg)
§ Do ELMo and BERT encode English dependency trees in their contextual representations?
§ The authors propose a structural probe, which evaluates whether syntax trees are embedded in a linear transformation of a neural network’s word representation space.
§ The experiments show that ELMo and BERT representations embed parse trees with high consistency
![Page 34: Transfer Learning for NLP 08202019 - files.devnetwork.cloud · •Transfer Learning for Computer Vision •Transfer Learning for NLP (prior to 2018) •Language models in 2018 •Language](https://reader033.fdocuments.in/reader033/viewer/2022042220/5ec695d45cd67c7b0735c289/html5/thumbnails/34.jpg)
![Page 35: Transfer Learning for NLP 08202019 - files.devnetwork.cloud · •Transfer Learning for Computer Vision •Transfer Learning for NLP (prior to 2018) •Language models in 2018 •Language](https://reader033.fdocuments.in/reader033/viewer/2022042220/5ec695d45cd67c7b0735c289/html5/thumbnails/35.jpg)
§ The authors found evidence of syntactic representation in attention matrices, and provided a mathematical justification for tree embedding found in previous paper.
§ There is evidence for subspaces that represent semantic information.
§ The authors conjecture that the internal geometry of BERT may be broken into multiple linear subspaces, with separate spaces for different syntactic and semantic information.
![Page 36: Transfer Learning for NLP 08202019 - files.devnetwork.cloud · •Transfer Learning for Computer Vision •Transfer Learning for NLP (prior to 2018) •Language models in 2018 •Language](https://reader033.fdocuments.in/reader033/viewer/2022042220/5ec695d45cd67c7b0735c289/html5/thumbnails/36.jpg)
![Page 37: Transfer Learning for NLP 08202019 - files.devnetwork.cloud · •Transfer Learning for Computer Vision •Transfer Learning for NLP (prior to 2018) •Language models in 2018 •Language](https://reader033.fdocuments.in/reader033/viewer/2022042220/5ec695d45cd67c7b0735c289/html5/thumbnails/37.jpg)
![Page 38: Transfer Learning for NLP 08202019 - files.devnetwork.cloud · •Transfer Learning for Computer Vision •Transfer Learning for NLP (prior to 2018) •Language models in 2018 •Language](https://reader033.fdocuments.in/reader033/viewer/2022042220/5ec695d45cd67c7b0735c289/html5/thumbnails/38.jpg)
![Page 39: Transfer Learning for NLP 08202019 - files.devnetwork.cloud · •Transfer Learning for Computer Vision •Transfer Learning for NLP (prior to 2018) •Language models in 2018 •Language](https://reader033.fdocuments.in/reader033/viewer/2022042220/5ec695d45cd67c7b0735c289/html5/thumbnails/39.jpg)
![Page 40: Transfer Learning for NLP 08202019 - files.devnetwork.cloud · •Transfer Learning for Computer Vision •Transfer Learning for NLP (prior to 2018) •Language models in 2018 •Language](https://reader033.fdocuments.in/reader033/viewer/2022042220/5ec695d45cd67c7b0735c289/html5/thumbnails/40.jpg)
What’s new in 2019§ OpenAI GPT-2
§ XLNet
§ ERNIE
![Page 41: Transfer Learning for NLP 08202019 - files.devnetwork.cloud · •Transfer Learning for Computer Vision •Transfer Learning for NLP (prior to 2018) •Language models in 2018 •Language](https://reader033.fdocuments.in/reader033/viewer/2022042220/5ec695d45cd67c7b0735c289/html5/thumbnails/41.jpg)
https://techcrunch.com/2019/02/17/openai-text-generator-dangerous/
![Page 42: Transfer Learning for NLP 08202019 - files.devnetwork.cloud · •Transfer Learning for Computer Vision •Transfer Learning for NLP (prior to 2018) •Language models in 2018 •Language](https://reader033.fdocuments.in/reader033/viewer/2022042220/5ec695d45cd67c7b0735c289/html5/thumbnails/42.jpg)
https://openai.com/blog/better-language-models/
![Page 43: Transfer Learning for NLP 08202019 - files.devnetwork.cloud · •Transfer Learning for Computer Vision •Transfer Learning for NLP (prior to 2018) •Language models in 2018 •Language](https://reader033.fdocuments.in/reader033/viewer/2022042220/5ec695d45cd67c7b0735c289/html5/thumbnails/43.jpg)
OpenAI GPT§ Uses Transformer architecture.
§ Trained on a dataset of 8 million web pages (40GB text)
§ 4 models, with 12, 24, 36, 48 layers respectively.
§ The largest model achieves state of the art results on 7 out of 8 tested language modeling datasets in zero-shot setting.
§ Initially only the smallest 12-layer models was released. The 24 and 36 layer models have been released over the last a few months.
![Page 44: Transfer Learning for NLP 08202019 - files.devnetwork.cloud · •Transfer Learning for Computer Vision •Transfer Learning for NLP (prior to 2018) •Language models in 2018 •Language](https://reader033.fdocuments.in/reader033/viewer/2022042220/5ec695d45cd67c7b0735c289/html5/thumbnails/44.jpg)
XLNet§ Based on TransformerXL architecture, an extension of Transformer, with relative
positional embeddings and recurrence at segment level.
§ Addresses two issues with BERT:o The [MASK] token used in training does not appear during fine-tuningo BERT predicts masked tokens in parallel
§ Introduces permutation language modeling: instead of predicting tokens in sequential order, it predicts tokens in some random order.
§ Trained on BooksCorpus + Wikipedia + Giga5 + ClueWeb + Common Crawl
§ Achieves state-of-the-art performance (beats BERT) across 18 tasks
![Page 45: Transfer Learning for NLP 08202019 - files.devnetwork.cloud · •Transfer Learning for Computer Vision •Transfer Learning for NLP (prior to 2018) •Language models in 2018 •Language](https://reader033.fdocuments.in/reader033/viewer/2022042220/5ec695d45cd67c7b0735c289/html5/thumbnails/45.jpg)
![Page 46: Transfer Learning for NLP 08202019 - files.devnetwork.cloud · •Transfer Learning for Computer Vision •Transfer Learning for NLP (prior to 2018) •Language models in 2018 •Language](https://reader033.fdocuments.in/reader033/viewer/2022042220/5ec695d45cd67c7b0735c289/html5/thumbnails/46.jpg)
§ Enhanced Representation through kNowledge IntEgration
§ A continual pre-training framework which could incrementally build and train a large variety of pre-training tasks through constant multi-task learning
§ Learns incrementally pre-training tasks through constant multi-task learning
§ Trained on BooksCorpus + Wikipedia + Reddit + Discovery Relation + Chinese corpus
§ Achieves significant improvements over BERT and XLNet on 16 tasks including English GLUE benchmarks and several Chinese tasks.
![Page 47: Transfer Learning for NLP 08202019 - files.devnetwork.cloud · •Transfer Learning for Computer Vision •Transfer Learning for NLP (prior to 2018) •Language models in 2018 •Language](https://reader033.fdocuments.in/reader033/viewer/2022042220/5ec695d45cd67c7b0735c289/html5/thumbnails/47.jpg)
![Page 48: Transfer Learning for NLP 08202019 - files.devnetwork.cloud · •Transfer Learning for Computer Vision •Transfer Learning for NLP (prior to 2018) •Language models in 2018 •Language](https://reader033.fdocuments.in/reader033/viewer/2022042220/5ec695d45cd67c7b0735c289/html5/thumbnails/48.jpg)
![Page 49: Transfer Learning for NLP 08202019 - files.devnetwork.cloud · •Transfer Learning for Computer Vision •Transfer Learning for NLP (prior to 2018) •Language models in 2018 •Language](https://reader033.fdocuments.in/reader033/viewer/2022042220/5ec695d45cd67c7b0735c289/html5/thumbnails/49.jpg)
Summary• What is Transfer Learning
• Transfer Learning for Computer Vision
• Transfer Learning for NLP
• State-of-the-art language models
![Page 50: Transfer Learning for NLP 08202019 - files.devnetwork.cloud · •Transfer Learning for Computer Vision •Transfer Learning for NLP (prior to 2018) •Language models in 2018 •Language](https://reader033.fdocuments.in/reader033/viewer/2022042220/5ec695d45cd67c7b0735c289/html5/thumbnails/50.jpg)
THANK YOU!https://www.linkedin.com/joanxiao
![Page 51: Transfer Learning for NLP 08202019 - files.devnetwork.cloud · •Transfer Learning for Computer Vision •Transfer Learning for NLP (prior to 2018) •Language models in 2018 •Language](https://reader033.fdocuments.in/reader033/viewer/2022042220/5ec695d45cd67c7b0735c289/html5/thumbnails/51.jpg)
§ http://ruder.io/transfer-learning/
§ Xavier Giro-o-Nieto, Image classification on Imagenet (D1L4 2017 UPC Deep Learning for Computer Vision)
§ http://cs231n.github.io/transfer-learning/
§ https://www.tensorflow.org/tutorials/representation/word2vec
§ http://www.wildml.com/2016/01/attention-and-memory-in-deep-learning-and-nlp/
§ Neural machine translation by jointly learning to align and translate
§ ELMo
§ ULMFiT
§ OpenAI GPT
§ BERT
![Page 52: Transfer Learning for NLP 08202019 - files.devnetwork.cloud · •Transfer Learning for Computer Vision •Transfer Learning for NLP (prior to 2018) •Language models in 2018 •Language](https://reader033.fdocuments.in/reader033/viewer/2022042220/5ec695d45cd67c7b0735c289/html5/thumbnails/52.jpg)
§ bert-as-service
§ To Tune or Not to Tune?
§ What does BERT look at? An Analysis of BERT’s Attention
§ Are Sixteen Heads Really Better than One?
§ WHAT DO YOU LEARN FROM CONTEXT? PROBING FOR SENTENCE STRUCTURE IN CONTEXTUALIZED WORD REPRESENTATIONS
§ A Structural Probe for Finding Syntax in Word Representations
§ Visualizing and Measuring the Geometry of BERT
§ Language Models are Unsupervised Multitask Learners
§ XLNet: Generalized Autoregressive Pretraining for Language Understanding
§ ERNIE 2.0
§ Paper Dissected: BERT
§ Paper Dissected: XLNet