Machine Learning for NLP -...

Machine Learning for NLP

Introduction session

Aurélie Herbelot

2018

Centre for Mind/Brain SciencesUniversity of Trento

1

Machine Learning: what is it?

2

Machines that learn...

3

Everybody wants to do it...

Forrester (Nov. 2016):

‘Insight-drivenbusinesses’ will steal$1.2 trillion/year fromcompetitors by 2020.

4

Why should machines be able to learn?

• “Good old AI” assumed it might be possible to program anintelligent machine by hand. It failed.

• The world is complex and ‘rules’ are not so easy to writedown.Exercise: what is a chair?

• For AI purposes, it is essential that a machine has theflexibility to change in response to your new data.

5

Difference with human learning

• There are parallels between human and machine learning:incremental procedure involving interaction with the world(data) and possibly human beings as well as fellowmachines, but...

• Children grow in environments that are very different fromwhat we can offer machines. They are born withsensory-motor capabilities that current machines do nothave. They have innate knowledge.

• Machines, on the other hand, can be trained 24/7, on a lotmore data (and are not particularly good with small data...)

6

Why learn language?

7

Applications

• ML for NLP is used in many real-life applications. Classicuses are:

• Information retrieval (search engines).• Machine translation.• Automatic essay grading.• Recommendation systems.• Spam filtering.• (Not-so-clever) conversational agents...

8

Applications

• More recent developments:• Automatic medical diagnoses

(more today).• Automatic court judgments.• Cleverer conversation agents.• Fixing the world –

fact-checking, debiasing, etc(????)

9

One of the most fundamental human abilities

• Language lets us communicate about things that are nothere:

• Please sit down. (You haven’t yet.)• Bring me the chair from the living room.• If it rains tomorrow...• Once upon a time, there was a unicorn...• Let x be a variable...

10


• Language allows us to speak about complex objects andprocesses:

• Look at that chair with the velvet back, the one with theflowery English pattern.

• Insulin is a peptide hormone produced by beta cells of thepancreatic islets.

• I’m jealous. It’s not that I want that car, but I don’t think heshould have it either.

• Bring the curd to the boil, let it boil for exactly three minuteswhilst gently stirring.

11


• Language allows us to change the world as we know it:• Let’s build a bridge.• What if the Universe didn’t have 3, but 10 dimensions?• We must change our political system.• I love you.

12

Language and AI

• See A Roadmap towards Machine Intelligence (Mikolov etal 2016)

• The last decades have been spent on research focusing onspecific applications.

• We have today tremendous computational power and hugedata. Can we go back to the goal of simulating generalintelligence?

• One crucial characteristic of an intelligent machine is theability to communicate. How do we get a machine to learnlanguage?

13

Communication and language

• Language is the most powerful communication device atour disposal.

• A system mastering natural language can ‘teach itself’through written material.

• It can also learn through different modes of interaction (seelecture vs reading group).

14

Communication and perception

• Natural language can conveynon-linguistic information viadescriptions of the environmentand associated perceptions.

• ‘Hallucinate’ perception fromlanguage.

Odena et al (2017)

15

Several levels of question answering

• Q: What is the density of gold?The machine searches the Internet for the answer: 19.3g/cm3.

• Q: What is a good starting point to study reinforcementlearning?The machine searches several websites, gets and idea of theirpopularity, and matches them to the user’s learning style.

• Q: What is the most promising direction to cure cancer, andwhere should I start to meaningfully contribute?1) The machine reads many research articles about the topic. 2)It finds out about the user’s perspective and currentspecialisation. 3) It may engage with other experts/machines toanswer the question.

16

Language and cognition

• Input and output are linguistic, but not claim is made aboutinternal representations (is there a ‘language of thought’?)

• The language of thought hypothesis (LOTH): Jerry Fodor.

• Thoughts have compositional structure, like language.

• Concepts combine to produce thoughts, following grammarrules.

17

Language and cognition

• There might not be a language of thought. And still,language and concepts are tightly related.

• See some results in vector-space semantics:• Psycholinguistic validity:

• Vectors reproduce similarity judgements.• They account for priming effects.• Also for some aspects of language acquisition.

• Neurolinguistic validity:• At least at a coarse level, vectors map onto brain activity.• There is actually some evidence that ‘composition’ by vector

addition reproduces brain imaging obtained for simplesentences.

18

What do we have to learn to learn language?

19

ML and complexity

• Deep learning as multiple featurelearning stage, followed by alogistic regression.

• Typically, implemented as severallayers of a neural network,learning more and morefine-grained features of someinput data.

• Last layer generally correspondsto some classification task.

Bojarski et al (2016)

20

A standard ‘NLP pipeline’

Example NLP pipeline for a Spoken Dialogue System.

http://www.nltk.org/book_1ed/ch01.html.

21

Learning morphology

• The meaning of a word is to some extent predictable fromits parts.

• The predictable aspects are learnable, using e.g.distributional semantics techniques.

Marelli & Baroni (2015)

22

Learning syntax (recap)

Probabilistic CKY

S → NP VP 1.0VP → VP PP 0.7VP → V NP 0.5VP → eats 0.1PP → P NP 0.8NP → Det N 0.7NP → NP PP 0.2NP → she 0.1NP → cake 0.1V → eats 0.1P → with 0.2N → fork 0.1Det → a 0.2

1 2 3 4 5 61 NP V,VP NP P Det N2 S VP NP3 S PP45 VP6 S

she eats cake with a fork

(she(eats (cake))(with(a (fork))))P(T ) =

0.1∗0.1∗0.1∗0.2∗0.2∗0.1∗0.5∗0.7∗0.8∗0.7∗1.0 =

7.84.10−7

23

Learning syntax (recap)

Probabilistic CKY

S → NP VP 1.0VP → VP PP 0.7VP → V NP 0.5VP → eats 0.1PP → P NP 0.8NP → Det N 0.7NP → NP PP 0.2NP → she 0.1NP → cake 0.1V → eats 0.1P → with 0.2N → fork 0.1Det → a 0.2

1 2 3 4 5 61 NP V,VP NP P Det N2 S NP3 S PP4 NP5 VP6 S

she eats cake with a fork

(she (eats)(cake(with(a(fork)))))P(T ) =

0.1∗0.1∗0.1∗0.2∗0.2∗0.1∗0.7∗0.8∗0.2∗0.5∗1.0 =

2.24.10−7

23

Learning semantics

• The many facets of meaning:• meaning is extension / reference: it ‘points’ at things in the

world;• meaning is intension: the Morning Star and the Evening

Star point at the same object (their extension), butlinguistically, they are not the same;

• meaning is conceptual: linguistic constituents activatereproducible cognitive processes involving extra-linguisticfeatures;

• meaning is use: words that occur in similar contexts aresemantically similar.

24

Learning semantics

• Meaning is probably all of those things...

Lazaridou et al (2017)

25

Learning pragmatics

• How does the broader context affect meaning? (E.g.situation of utterance, community of the speaker, etc...)

• Example 1: (In a reference letter for an academic job) “MrSmith was always very punctual.”Does the letter writer think much of Mr Smith?

• Example 2: how does a community contribute to theemergence and spread of meanings? (del Tredici &Fernàndez, 2017)

26

Integration with the world

A system that learns referencemust be able to link language tothe world, including to itsperception of the world.

27

Productivity, creativity: learning to generalise

• A proficient speaker should understand that turning leftand turning right share properties.

• Fregian compositionality: the meaning of the whole isgiven by the meaning of the parts.

• Also Frege’s ‘context principle’: parts have meaning invirtue of the whole.

28

Productivity, creativity: learning to generalise

• The described AI requires a long-term memory to storeconcepts and algorithms. The content of the long-termmemory is extendable.

• It is essential for the agent to understand which primitivesand composition processes it should store to be asefficient and flexible as possible.

• Hard question: to what extent are linguistic expressionsdecomposable? Are there semantic primitives? (Boledaand Erk 2015)

29

Is there a pipeline?

• Problem: it is not clear that ‘the NLP pipeline’ is so cleanlydivided into task-specific modules.

• See e.g. Baayen et al (2015): language is not a formalcalculus but an information-theoretic process overphonemes, producing so-called lexomes, which encodeexperience-dependent meaning.

• Fundamental question: when learning, what we shouldlearn from, and what can we expect to learn?

30

Is there a pipeline?

31

ML and complexity

• Coming back to our deeplearning car...

• If learning language is acollection of NN layers... what dothe layers encode?

• Is language even a clean stack oflinguistic skills? Probably not...

Bojarski et al (2016)

32

Oh, and check it works for all languages...

Languages by proportion of native speakers,

https://commons.wikimedia.org/w/index.php?curid=41715483

33

What do we have to do to learn?

34

Example: IBM Watson for oncology

• IBM Watson:

• Watson for Oncology (WFO): an AI expert for automaticallydiagnosing cancer and making appropriate treatmentrecommendations.

• The instance of WFO in this paper has learnt from 300medical journals and textbooks, treatment guidelines,actual breast cancer cases, including patientcharacteristics and laboratory findings.

• 93% concordance between medical experts and systemwhen recommending treatment for breast cancer.

35

A Twitter review of WFO

https://twitter.com/EnricoCoiera/status/971647548875186178

36

Comparing learning data and real-world data

Are we learning from theright data? Can thelearning be transferred tothe setting where theapplication will bedeployed?(Say you learn to drive acar, does that mean youcan drive anything? Oreven any car? What if youlearnt on an automatic?)

37

How to represent the data we are learning from?

What is important in thedata? How are we going topresent it to the learner?(Is the brand of your car animportant factor in knowinghow to drive? Perhaps,perhaps not.)

38

How to deal with human data used to train the system?

If there is manual humanintervention by humans,either in training or testing,what is humanperformance on the task?

39

How to learn?

What algorithm is used tolearn?

40

How to evaluate the system?

What did we want to learn?Are we sure we learnt it?Does our evaluationmeasure strictly assess thebehaviour we want to train?

41

Course overview

42

Goals

1. Understand core machine learning algorithms for NLP.

2. Be able to read and criticise related literature.

3. Acquire some fundamental computational skills to run MLcode and interpret its output.

43

Session structure

• An introductory week, followed by 9 topics, eachassociated with 3 classes:

1. A lecture presenting the topic for that week.2. A lecture (with audience participation!) presenting one or

two papers using the presented algorithm(s) / metric(s).3. A practical with a task and/or some code to play with.

44

Week 1: introduction

• Today’s lecture!

• Basic principles ofstatistical NLP.

• Run a simple authorshipclassification algorithm.

45

Week 2: data preparation

• How to choose data. Inter-annotator agreement metrics.

• How to fool a image captioning system? (Hint: give it difficult data.)How to fool oneself? (Hint: by thinking one’s annotation scheme was detailedenough.)

• Hands-on intro to crowdsourcing. Annotate and calculate your inter-annotatoragreement.

46

Week 3: supervised learning

• Introduction to regressiontechniques.

• Using regression to understandthe performance of a system forcompositional morphology.

• Introducing regression in Python.

47

Week 4: unsupervised learning

• Clustering and dimensionalityreduction.

• Latent Semantic Analysis: “Howdo children learn as much asthey do, given the littleinformation they get?”

• Document clustering forinformation retrieval. We’ll beplaying with the code for thePeARS search engine.

48

Week 5: Support Vector Machines

• Introduction to kernel machines.

• Detection of semantic errors inthe prose of non-Englishspeakers with SVMs.

• Introduction to running SVMlight.

49

Week 6: intro to Neural Nets

• Basics of NNs and generalAI concepts.

• What do NNs really haveto do with neuroscience?

• Implement a Neural Netfrom scratch! http://www.wildml.com/2015/09/implementing-a-neural-

network-from-scratch

50

Week 7: RNNs and LSTMs

• Sequence learning withneural networks.

• How to generate text withRNNs.

• Implement an RNN inTheano.

https://colah.github.io/posts/2015-08-Understanding-LSTMs/

51

Week 8: Reinforcement learning

• Basics of RL.

• Multi-agent emergence of natural language.

• Try the openAI gym! A gym for artificial agents...

52

Week 9: ML and ethics

• Ethical issues with ML. Bias in distributional vectors.

• Literature on bias and on de-biasing.

• Visualisation of word embeddings for bias detection.

53

Material

All material will be posted at:http://aurelieherbelot.net/teaching/

Any question, worry, complaint... write to:[email protected]

54

http://aurelieherbelot.net/teaching/

Machine Learning for NLP -...

Documents

Transcript of Machine Learning for NLP -...