NLP: a peek into a day of a computational linguist

Post on 13-Apr-2017

361 views 0 download

Transcript of NLP: a peek into a day of a computational linguist

NLP: a peek into a dayof a computational linguist

Mariana RomanyshynGrammarly, Inc.

1. NLP applications in our world

2. What computational linguists do

3. Language levels

4. A closer look at part-of-speech tagging

5. A closer look at syntactic parsing

6. Let’s build something: error correction

2

Contents

3

Disclaimer

1. NLP applications in our world

5

What NLP applications do you know?

• Analysis

• Transformation

• Misc

6

Types of NLP Applications

ANALYSIS

Spam Filtering…

7

Types of NLP Applications

ANALYSIS

Spam FilteringSearch Engines…

8

Types of NLP Applications

ANALYSIS

Spam FilteringSearch EnginesSentiment Analysis…

9

Types of NLP Applications

Sentiment maps

10

11

It tastes amazing!It tastes horrible!It tastes normal.ABC tastes much better than DEF.

Sentiment Analysis

12

It tastes amazing!It tastes horrible!It tastes normal.ABC tastes much better than DEF.

It tastes like beer!It tastes interesting!It tastes like my mom said it would!If it was served with milk, it would taste great!

Sentiment Analysis

13

“That young girl is one of the least benightedly unintelligent organic life forms [that] it has been my profound lack of pleasure not to be able to avoid meeting.”— Douglas Adams

Terminal cases

14

“That young girl is one of the least benightedly unintelligent organic life forms [that] it has been my profound lack of pleasure not to be able to avoid meeting.”— Douglas Adams

Terminal cases

15

Sentiment Analysis

ANALYSIS

Spam FilteringSearch EnginesSentiment AnalysisSarcasm Detection…

16

Types of NLP Applications

17

Quite interesting

ANALYSIS

Spam FilteringSearch EnginesSentiment AnalysisSarcasm DetectionEssay Grading…

18

Types of NLP Applications

ANALYSIS

Spam FilteringSearch EnginesSentiment AnalysisSarcasm DetectionEssay GradingGood/Evil Characters…

19

Types of NLP Applications

TRANSFORMATION

Machine Translation…

20

Types of NLP Applications

Transformations in MT

21

TRANSFORMATION

Machine TranslationError Correction…

22

Types of NLP Applications

GEC should be smart

23

TRANSFORMATION

Machine TranslationError CorrectionSpeech to Text / Text to Speech…

24

Types of NLP Applications

TRANSFORMATION

Machine TranslationError CorrectionSpeech to Text / Text to SpeechQuestion Answering...

25

Types of NLP Applications

TRANSFORMATION

Machine TranslationError CorrectionSpeech to Text / Text to SpeechQuestion AnsweringText Summarization...

26

Types of NLP Applications

MISC

News reports generation…

27

Types of NLP Applications

MISC

News reports generationConversational Agents…

28

Types of NLP Applications

“I remember the first time we loaded these data sources into Siri. I typed “start over” into the system, and Siri came back saying, “Looking for businesses named ‘Over’ in Start, Louisiana.”— Adam Cheyer

29

Siri

30

The story of Tay

MISC

News reports generationConversational AgentsLanguage learning…

31

Types of NLP Applications

32

Duolingo

33

Duolingo

MISC

News & weather reports generationConversational AgentsLanguage learningStory Cloze Task…

34

Types of NLP Applications

Tom and Sheryl have been together for two years. One day, they went to a carnival. Tom won Sheryl several stuffed bears. When they reached the Ferris wheel, he got down on one knee.

Which ending is more probable?• Tom asked Sheryl to marry him.• He wiped mud off of his boot.

35

Story Cloze

2. What computational linguists do

37

38

39

Just FYI

3. Language levels

“Noam-enclature” and the structural linguistics

41

Language Levels

1) Language has a structure

2) Language is a system of signs

42

Units of language levels

Written text ?

Written text Paragraph

Sentence Word

Morpheme Letter

43

Units of language levels

How do we split...• text into paragraph?

44

Splitting problems

45

Splitting problemsHow do we split...• text into paragraph?

bullet points, word wrapping• paragraph into sentences?

46

Splitting problemsHow do we split...• text into paragraph?

bullet points, word wrapping• paragraph into sentences?

Dr. Jones lectures at U.C.L.A.• sentence into words?

47

Splitting problemsHow do we split...• text into paragraph?

bullet points, word wrapping• paragraph into sentences?

Dr. Jones lectures at U.C.L.A.• sentence into words?

computer-aided, the d.t.s, San Francisco, 3$B deal• word into morphemes?

48

Splitting problemsHow do we split...• text into paragraph?

bullet points, word wrapping• paragraph into sentences?

Dr. Jones lectures at U.C.L.A.• sentence into words?

computer-aided, the d.t.s, San Francisco, 3$B deal• word into morphemes?

misadventuremisleadmistake - ?

49

FeaturesQuantitative features:• number of sentences, words, words per sentence, etc.• size and arrangement of paragraphs• word length• word position in a sentence• number of syllables in a word• ratio of vowels vs consonants

• depth of the word in the dependency tree of the sentence

• number of word senses

• ngrams

50

NgramsSequences of elements and their frequencies:• unigrams, bigrams, 3-grams, 4-grams, … n-grams

• at different language levels

– token ngrams:

• ("handsome”, ”man"): 160,000 ("pretty”, ”man"): 5,000

– character ngrams

• “st”: 14,000; “ct”: 4,000; “str”: 1,500; “ctr”: 50; “stra”: 400; “ctra”: 0

• adding grammar

– parts of speech

• (“go”, IN, “school”) : 600,000 (“go”, RB, “school”) : 10

– syntactic relations

• (“go”, nsubj, “kids”) : 200,000 (“go”, nsubj, “school”) : 20,000

51

FeaturesGrammatical features:• POS tag• morphemes: affixes, roots, endings• constituency spans• dependency relations• coreference• grammatical characteristics of various parts of speech:

– countability of nouns– tense of verbs– degree of comparison of adjectives– pronoun type– connector type

52

FeaturesSpelling features:• capitalized word?• hyphenated word?• compound word?

Lexical-semantic features:• WordNet• VerbNet• dictionaries and thesauri• word embeddings• modality of verbs

4. A closer look at part-of-speech tagging

Goal: categorize words by their functions.

English:• notional: noun, verb, adjective, adverb, pronoun (?), numeral (?)• functional: determiner, preposition, conjunction, particle, and

interjection

54

POS: recap

Wow, two hungry cats chased down the mouse to the corner and quickly ate it!

55

POS: practice

All you need is love . Love is all at the way you love me all the time

. And never mind that noise you heard . fire and of things that will bite , yeah

було так давно , коли в руках тримаю цейПросто налийте трохи коли на пошкоджену ділянку .

ударом . Я хочу мати всьо , і всьо наа на полі спозаранку мати жито жала , та

56

POS: more practice

Time flies like an arrow.I saw her duck with a telescope.She is calculating.We watched an Indian dance.They can fish.More lies ahead...

Це мало мало значення.Коло друзів та незнайомців.

57

POS: impossible cases

Time flies[Verb/Noun] like[Preposition/Verb] an arrow.I saw her duck[Verb/Noun] with a telescope.She is calculating[Verb/Adjective].We watched an Indian[Adjective/Noun] dance.They can[Modal Verb/Verb] fish[Verb/Noun].More lies[Verb/Noun] ahead...

Це мало[Дієслово/Прислівник] мало[Дієслово/Прислівник] значення.Коло[Іменник/Прийменник] друзів та незнайомців.

58

POS: impossible cases

59

What POS should gotta be?

I gotta tell you something.I’ve gotta fix that thingy for her, Jack.So, she gotta this gorgeous dress.So, she gotta gun.

60

POS: disputable cases

What POS should gotta be?

I gotta[modal verb] tell you something.I’ve gotta[verb, 3rd form] fix that thingy for her, Jack.So, she gotta[verb, 2nd form] this gorgeous dress.So, she gotta[verb, 2nd form] gun.

61

POS: disputable cases

62

If you don’t know, how would the machine know?

63

So, what do we do?

Penn Treebank tagset:• noun: NN, NNS, NNP, NNPS• verb: VB, VBP, VBZ, VBG, VBD, VBN, MD• adjective: JJ, JJR, JJS• adverb: RB, RBR, RBS• preposition and sub. conjunction: IN• pronoun: PRP, PRP$• determiner: DT• numeral: CD• particle: RP, TO• interjection: UH• coord. conjunction: CC• wh-words: WDT, WP, WP$, WRB• more: PDT, POS, SYM, FW, EX, LS, $, |,|, |.|, |:|, |''|, |``|, -RRB-, -LRB-

64

POS: tagsets

Very_RB peculiar_JJ retribution_NN indeed_RB seems_VBZ to_TO overtake_VB such_JJ jokers_NNS ._.Have_VBP you_PRP ever_RB heard_VBN of_IN Thuggee_NNP ?_.Sort_NN of_IN remorseless_JJ ,_, is_VBZ n't_RB it_PRP ?_.In_IN short_JJ ,_, and_CC to_TO borrow_VB an_DT arboreal_JJ phrase_NN ,_, slash_VB timber_NN ._.As_IN you_PRP can_MD count_VB on_IN me_PRP to_TO do_VB the_DT same_JJ ._.Compassionately_RB yours_PRP ,_, S.J._NNP Perelman_NNPWe_PRP caught_VBD the_DT early_JJ train_NN to_IN New_NNP York_NNP ._.Petite_JJ ,_, lovely_JJ Yvette_NNP Chadroe_NNP plays_VBZ the_DT nymphomaniac_NN engagingly_RB ._.He_PRP looked_VBD so_RB comfortable_JJ being_VBG straight_JJ ._.They_PRP wanted_VBD to_TO touch_VB the_DT mystery_NN ._....

65

POS: corpora

• Use a classifier to tag each word independently• Features

– left/right context: words, POS tags, words + POS tags– probability of word + POS tag– additional:

• possible tags for the word• morphological characteristics (tense, plurality, degree of comparison)• the word’s spelling (suffixes, capitalization, hyphenation)

Input: Chewie[NNP] ,[,] we[PRP] 're[VBP] home[NN/RB] - ? .[.]

Output: RB66

POS: Classification

• Map the sentence to the most probable POS tag sequence• Features

– left/right context: words, POS tags, words + POS tags– probability of word + POS tag– additional:

• possible tags for the word• morphological characteristics (tense, plurality, degree of comparison)• the word’s spelling (suffixes, capitalization, hyphenation)

Input: Chewie , we 're home .Output: NNP , PRP VBP RB .

67

POS: Sequence Labelling

Notation:• V - vocabulary• T - POS tags• x - sentence (observation)• y - tag sequences (state)• S - all sentence/tag-sequence pairs {x1 . . . xn, y1 . . . yn}

– n > 0– xi V∈– yi T∈

68

Hidden Markov Models

S - all sentence/tag-sequence pairs {x1 . . . xn, y1 . . . yn}

x: Chewie , we 're home .y: NNP , PRP VBP RB .

NN , PRP VBP RB . NNP , PRP VBP NN . NN , PRP VBP NN . …

Aim: find {x1 . . . xn, y1 . . . yn} with the highest probability.69

Hidden Markov Models

• Markov Assumption: "The future is independent of the past given the present."

– Trigram HMM: each state depends only on the previous two states in the sequence

• Independence assumption:– the state of xi depends only on the value of yi, independent of the

previous observations and states

70

HMM: assumptions

S - all sentence/tag-sequence pairs {x1 . . . xn, y1 . . . yn}

x: Chewie , we 're home .y: NNP , PRP VBP RB .

NN , PRP VBP RB . NNP , PRP VBP NN . NN , PRP VBP NN . ...

71

HMM: assumptions

• q(s|u, v) - the probability of tag s after the tags (u, v)– s, u, v T∈

• e(x|s) - the probability of observation x paired with state s– x V, s T∈ ∈

72

Trigram HMM: parameters

• q(s|u, v) - the probability of tag s after the tags (u, v)– s, u, v T∈

• e(x|s) - the probability of observation x paired with state s– x V, s T∈ ∈

73

Trigram HMM: parameters

74

For example

x: Chewie , we 're home .y: NNP , PRP VBP RB .

How do we get p(x, y)?

75

For example

x: Chewie , we 're home .y: NNP , PRP VBP RB .

p(x, y) = c(NNP,|,|,PRP)/c(NNP,|,|) * c(|,|,PRP, VBP)/c(|,|,PRP) * c(PRP, VBP,RB)/c(PRP,VBP) * c(VBP,RB,|.|)/c(VBP,RB) * c(NNP->Chewie)/c(NNP) * c(|,|->,)/c(|,|) * c(PRP->we)/c(PRP) * c(VBP->’re)/c(VBP) * c(RB->home)/c(RB) * c(|.|->.)/c(|.|)

76

One thing missing

x: Chewie , we 're home .y: <S> <S> NNP , PRP VBP RB . </S>

p(x, y) = c(NNP,|,|,PRP)/c(NNP,|,|) * c(|,|,PRP, VBP)/c(|,|,PRP) * c(PRP, VBP,RB)/c(PRP,VBP) * c(VBP,RB,|.|)/c(VBP,RB) * c(<S>,<S>,NNP)/c(<S>,<S>) * c(<S>,NNP,|,|)/c(<S>,NNP) * c(RB,|.|,</S>)/c(RB,|.|) * c(NNP->Chewie)/c(NNP) * c(|,|->,)/c(|,|) * c(PRP->we)/c(PRP) * c(VBP->’re)/c(VBP) * c(RB->home)/c(RB) * c(|.|->.)/c(|.|)

Enumerating all possible tag sequences is not feasible — Tn.E.g.:

44 tags ** 6-token sentence = 7,256,313,856 tag sequences

Ideas:• use dynamic programming (the Viterbi algorithm)• limit the number of candidates with a dictionary

77

HMM: problem 1

78

HMM: the Viterbi algorithm

Idea: remember decisions on the way — n*T3.

x: Chewie , we 're home .y: <S> <S> NN , RB NNP VBP . </S> NNP , CD WP VB . NNS , EX PRP$ RB . NNPS , CC VBP NN . JJ , IN PRP JJ . JJR , NNP JJS TO . RRB , PRP RBS RP . VBZ , LS CD IN . ...

79

HMM: with dictionary

Idea: use a dictionary — n*83. (Worst case is still n*T3.)

x: Chewie , we 're home .y: <S> <S> NNP , PRP VBP VB . </S>

NN VBP RB NN

Zero probabilities can occur because of OOV or rare words.

Idea: use smoothing!• add-1: pretend you saw each word one more time

(P.S. It’s usually a horrible choice, but we’ll use it today. Don’t tell anyone.)• Good-Turing: reallocate the probability of n-grams that occur

r+1 times to the n-grams that occur r times• Kneser-Ney: when the bigram count is near 0, rely on unigram• ...

80

HMM: problem 2

81

Implementationhttps://github.com/mariana-scorp/one-day-with-cling

Conclusion

82

“Data is ten times more powerful than algorithms.”— Peter NorvigThe Unreasonable Effectiveness of Datahttp://youtu.be/yvDCzhbjYWs

5. A closer look at syntactic parsing

Goal: categorize sentence parts by their functions and define dependencies.

Sentence:• main clause• subordinate clause

Clause:• subject• predicate• direct/indirect/prepositional object• modifier• complement 84

Syntax: recap

Sentence:If you want to receive e-mails about my upcoming shows, then please give me money so I can buy a computer.

85

Syntax: practice

Sentence:If you want to receive e-mails about my upcoming shows, then please give me money so I can buy a computer.

Clauses:• [[you] want [to receive [e-mails about my upcoming shows]]]• [please give [me] [money]]• [[I] can buy [a computer]]

86

Syntax: practice

Identify the subject:

• The walrus and the carpenter were walking close at hand.

• The greatest trick the devil ever pulled was convincing the world he didn't exist.

• What we've got here is a failure to communicate.

• Actually being funny is mostly telling the truth about things.

• To be idle is a short road to death, and to be diligent is a way of life.

• Sitting in a tree at the bottom of the garden was a huge black bird with long blue tail feathers.

87

Syntax: the subject

Identify the subject:

• The walrus and the carpenter were walking close at hand.

• The greatest trick the devil ever pulled was convincing the world he didn't exist.

• What we've got here is a failure to communicate.

• Actually being funny is mostly telling the truth about things.

• To be idle is a short road to death, and to be diligent is a way of life.

• Sitting in a tree at the bottom of the garden was a huge black bird with long blue tail feathers.

88

Syntax: the subject

Identify the subject:

• The walrus and the carpenter were walking close at hand.

• The greatest trick the devil ever pulled was convincing the world he didn't exist.

• What we've got here is a failure to communicate.

• Actually being funny is mostly telling the truth about things.

• To be idle is a short road to death, and to be diligent is a way of life.

• Sitting in a tree at the bottom of the garden was a huge black bird with long blue tail feathers.

89

Syntax: the subject

Identify the role of the infinitive:

• The two politicians failed [to communicate].• What we've got here is a failure [to communicate].• [To be idle] is a short road to death, and [to be diligent] is a way of

life.• [To become extroverted], you need to go out and socialize.• You have [to be able [to actually quote the line]] for it [to be a

memorable quote].

90

Syntax: the infinitives

91

How do we formalize the syntactic structure?

92

Answer:

Types:• constituency tree

– every token is a part of some phrase constituent (parent node)– includes terminal and non-terminal nodes– shows relations among the constituents

• dependency tree– for every token, there is one node– includes only terminal nodes– shows relations among words

93

Syntactic Trees (or Parse Trees)

If you want to receive e-mails about my upcoming shows, then please give me money so I can buy a computer.

94

Constituency Tree

If you want to receive e-mails about my upcoming shows, then please give me money so I can buy a computer.

95

Constituency Tree

96

Constituency Treebank(TOP (S (NP (ADJP (RB Very) (JJ peculiar)) (NN retribution)) (ADVP (RB indeed)) (VP (VBZ seems) (S (VP (TO to) (VP (VB overtake) (NP (JJ such) (NNS jokers)))))) (. .)))(TOP (SQ (VBP Have) (NP (PRP you)) (ADVP (RB ever)) (VP (VBN heard) (PP (IN of) (NP (NNP Thuggee)))) (. ?)))(TOP (UCP (ADJP (ADVP (NN Sort) (IN of)) (JJ remorseless)) (, ,) (SQ (VBZ is) (RB n't) (NP (PRP it))) (. ?)))(TOP (SBAR (IN As) (S (NP (PRP you)) (VP (MD can) (VP (VB count) (PP (IN on) (NP (PRP me))) (S (VP (TO to) (VP (VB do) (NP (DT the) (JJ same)))))))) (. .)))(TOP (FRAG (ADJP (RB Compassionately) (PRP yours)) (, ,) (NP (NNP S.J.) (NNP Perelman))))(TOP (S (NP (PRP We)) (VP (VBD caught) (NP (NP (DT the) (JJ early) (NN train)) (PP (IN to) (NP (NNP New) (NNP York))))) (. .)))(TOP (S (NP (JJ Petite) (, ,) (JJ lovely) (NNP Yvette) (NNP Chadroe)) (VP (VBZ plays) (NP (DT the) (NN nymphomaniac)) (ADVP (RB engagingly))) (. .)))...

Penn Treebank tagset:• top level: TOP• sentence: S, SBAR, SQ, SBARQ, SINV• fragment: FRAG• noun phrase: NP• verb phrase: VP• prepositional phrase: PP• adjectival phrase: ADJP• adverbial phrase: ADVP• compound conjunction: CONJP• wh-phrases: WHNP, WHPP, WHADJP, WHADVP• more: LST, PRT, INTJ, NAC, PRN, QP, RRC, UCP, X

97

Constituency Labels

• Algorithms:– top-down– chart– bottom-up

• Features include:– grammar (a.k.a. transitions)– spans of nodes– labels– right/left/right and left context– split point, etc.

• Weights are trained on the treebank.98

Constituency Parsing

99

Shift-reduce constituency parsing

• Data– queue: the words of the sentence– stack: partially completed trees

• Actions– shift: move the word from the queue onto the stack– reduce: add a new label on top of the first n constituents on

the stack

101

Syntax: impossible casesMost cats and dogs with fleas live in the neighbourhood.

102

Syntax: impossible casesMost cats and dogs with fleas live in the neighbourhood.

103

Syntax: impossible casesWanted: a nurse for a baby about twenty years old.

104

Syntax: impossible casesWanted: a nurse for a baby about twenty years old.

105

Syntax: impossible casesI shot an elephant in my pajamas.

106

Syntax: impossible casesI shot an elephant in my pajamas.

107

Syntax: impossible casesI once saw a deer riding my bicycle.

108

Syntax: impossible casesI once saw a deer riding my bicycle.

109

Syntax: impossible casesI’m glad I’m a man, and so is Lola.

110

Syntax: impossible casesI’m glad I’m a man, and so is Lola.

Types:• constituency tree

– every token is a part of some phrase constituent (parent node)– includes terminal and non-terminal nodes– shows relations among the constituents

• dependency tree– for every token, there is one node– includes only terminal nodes– shows relations among words

111

Syntactic Trees (or Parse Trees)

Universal dependencies:• subject: NSUBJ, NSUBJPASS, CSUBJ, CSUBJPASS• object: DATIVE, DOBJ, AGENT, OPRD• complement: ACOMP, CCOMP, XCOMP, PCOMP• auxiliary: AUX, AUXPASS• clausal modifier: ACL, ADVCL, RELCL• different modifier: ADVMOD, NPADVMOD, AMOD, COMPOUND, NEG, NUMMOD,

QUANTMOD• determiner: DET, PREDET• apposition: APPOS• coordinating conjunction and conjuct: CC, CONJ• prepositional modifier and its object: PREP, POBJ• more: POSS, CASE, DEP, EXPL, INTJ, MARK, PRECONJ, PRT, PUNCT, PARATAXIS

112

Dependency Relations

113

Dependency TreeIf you want to receive e-mails about my upcoming shows, then please give me money so I can buy a computer.

114

Dependency TreeIf you want to receive e-mails about my upcoming shows, then please give me money so I can buy a computer.

• Graph-Based Parsing– find the highest score tree from a complete graph– slow, but performs better on long-distance dependencies– e.g., MSTParser

• Transition-Based Parsing– apply transition actions one by one– faster, but performs better on short-distance dependencies– e.g., MaltParser, the Stanford Parser, ZPAR

115

Algorithms

116

Graph-Based Parsing

• Data– queue: the words of the sentence– stack: partially completed trees

• Actions:– shift: move the word from the queue onto the stack– reduce: pop the stack, removing only its top item, as long as that

item has a head– right-arc: create a right dependency arc between the word on top of

the stack and the next token in the queue– left-arc: create a left dependency arc between the word on top of

the stack and the next token in the queue117

Transition-Based Parsing

Features

119

120

Implementationhttps://github.com/mariana-scorp/one-day-with-cling

121

Conclusion

122

Syntax: impossible casesWe eat pizza with anchovy.

123

Syntax: impossible casesWe eat pizza with anchovy.

124

Syntax: impossible casesНасильство твій макіяж не приховає!

125

Syntax: impossible casesНасильство твій макіяж не приховає!

6. Let’s build something: error correction

We likes pizza with anchovy.Children like and cherishes her kindness and cooking skills.Some is watching the way she knits and loving it.Colorless green ideas sleeps furiously.Barry and Mary, whom I met at the New Year 's party, is just the

cutest people.There is two cats and a dog.

127

Subject-verb disagreement

Text processing: tokenization, POS tagging, syntactic parsing, etc.

Detection: find a VBZ

Rules: if the verb has nsubj relation and the subject does not have a conjunct, we should correct it…

Correction: use a dictionary of transformations

128

Rule-based Toy Solution

Text processing: tokenization, POS tagging, syntactic parsing, etc.

Detection: find a VBZ

Classifier + features: POS tag of the subject, does the subject have a conjunct...

Correction: use a dictionary of transformations

129

ML-based Toy Solution

130

Implementationgithub.com/mariana-scorp/one-day-with-cling

131

Presenter:Mariana Romanyshynmariana.romanyshyn@grammarly.com

With the help of:Oksana Kunikevychoksana.kunikevych@grammarly.com

Khrystyna Skopykkhrystyna.skopyk@grammarly.com

Tetiana Myronivskatetiana.myronivska@grammarly.com

Tetiana Turchyntetiana.turchyn@grammarly.com

Contact us

132