Natural Language Processing and Machine LearningCourse Goals Understand a natural language...

56
Natural Language Processing and Machine Learning Aarhus Universitet, 2016 Leon Derczynski

Transcript of Natural Language Processing and Machine LearningCourse Goals Understand a natural language...

Page 1: Natural Language Processing and Machine LearningCourse Goals Understand a natural language processing pipeline Build a small search engine Code, use and evaluate a statistical machine

Natural Language Processing and Machine Learning

Aarhus Universitet, 2016

Leon Derczynski

Page 2: Natural Language Processing and Machine LearningCourse Goals Understand a natural language processing pipeline Build a small search engine Code, use and evaluate a statistical machine

Course structure

● Four parts:– Starting NLP and Information retrieval

– Machine learning and sentiment analysis

– Structured prediction, feature extraction

– Entity extraction, social media, unsupervised

● Weeks 6, 7, 9, 11

Page 3: Natural Language Processing and Machine LearningCourse Goals Understand a natural language processing pipeline Build a small search engine Code, use and evaluate a statistical machine

Course structure

● Two lectures per week– Wednesday afternoon: Theory

– Friday: Practical exercises

● Assessment via hand-ins– One per week; two weeks to complete

– Mixture of coding and analysis

– Final one is free choice

– Submission via e-mail

● Also brief oral exam on topic of your choice

Page 4: Natural Language Processing and Machine LearningCourse Goals Understand a natural language processing pipeline Build a small search engine Code, use and evaluate a statistical machine

Course Goals

● Understand a natural language processing pipeline● Build a small search engine● Code, use and evaluate a statistical machine

learning tool● Describe many types of machine learning● Describe biases present in ML approaches● Choose the right approach for an NLP problem● Do fundamental NLP tasks● Program Python & NLTK

Page 5: Natural Language Processing and Machine LearningCourse Goals Understand a natural language processing pipeline Build a small search engine Code, use and evaluate a statistical machine

Natural Language Processing

“Human knowledge is expressed in language. So computational linguistics is very important.”

● Mark Steedman

Page 6: Natural Language Processing and Machine LearningCourse Goals Understand a natural language processing pipeline Build a small search engine Code, use and evaluate a statistical machine

Natural Language Processing

● Basic AI task– Language presumed unique

– Still a sign of intellect

● Replicating languge comprehension and production is difficult

Page 7: Natural Language Processing and Machine LearningCourse Goals Understand a natural language processing pipeline Build a small search engine Code, use and evaluate a statistical machine

Natural Language Processing

● What is language?

– Physiological

– Vocal apparatus: velar

– Arose in humans 2M-300k years ago

Page 8: Natural Language Processing and Machine LearningCourse Goals Understand a natural language processing pipeline Build a small search engine Code, use and evaluate a statistical machine

Natural Language Processing

● Written language much newer

– ~3000 years old

– Not every language has it

– Transiency of spoken vs. intransiency of written

– Allows communication without speaker's presence

Page 9: Natural Language Processing and Machine LearningCourse Goals Understand a natural language processing pipeline Build a small search engine Code, use and evaluate a statistical machine

Natural Language Processing

● Huge scale of machines and also written data

● Writing is a way of mapping speech to text– Not necessarily a rational mapping

– And speech isn't rational either!

● Eg, kamsteg● Thought Though through tough .. ouch!

Page 10: Natural Language Processing and Machine LearningCourse Goals Understand a natural language processing pipeline Build a small search engine Code, use and evaluate a statistical machine

Grand Challenges

● Machine translation– Sequence to sequence

● I like your hat● J'aime ton chapeau● Jeg kan godt lide din hat● W x huān n de màoziǒ ǐ ǐ

– Humans seem able to teach languages to each other reasonably well

Page 11: Natural Language Processing and Machine LearningCourse Goals Understand a natural language processing pipeline Build a small search engine Code, use and evaluate a statistical machine

Grand Challenges

● Machine translation– Old test of AI: John Searle

Page 12: Natural Language Processing and Machine LearningCourse Goals Understand a natural language processing pipeline Build a small search engine Code, use and evaluate a statistical machine

Grand Challenges

● Machine translation– Old test of AI: John Searle

– Problematic: Are Chinese people still intelligent if they also speak English?

– Point: symbol manipulation ≠ consciousness

Page 13: Natural Language Processing and Machine LearningCourse Goals Understand a natural language processing pipeline Build a small search engine Code, use and evaluate a statistical machine

Grand Challenges

● Machine translation– When will we finish?

● 1950: Two years! ● 1960: Five years! ● 1970: Twenty years! ● 1985: Never!● 2015: “NLP is kind of like a rabbit in the headlights of the

Deep Learning machine, waiting to be flattened.” - ...– Neil Lawrence (U.Sheffield) @ ICML panel

– “Every time I fire a linguist, accuracy goes up!”● Fred Jelinek @ IBM (allegedly)

Page 14: Natural Language Processing and Machine LearningCourse Goals Understand a natural language processing pipeline Build a small search engine Code, use and evaluate a statistical machine

Grand Challenges

● Dialogue– This is how we prefer to communicate

– Easy, intuitive mode to engage in

Page 15: Natural Language Processing and Machine LearningCourse Goals Understand a natural language processing pipeline Build a small search engine Code, use and evaluate a statistical machine

Grand Challenges

● Dialogue– Early systems: SHRDLU

Page 16: Natural Language Processing and Machine LearningCourse Goals Understand a natural language processing pipeline Build a small search engine Code, use and evaluate a statistical machine

Grand Challenges

● Dialogue– Early systems: SHRDLU

Page 17: Natural Language Processing and Machine LearningCourse Goals Understand a natural language processing pipeline Build a small search engine Code, use and evaluate a statistical machine

Grand Challenges

● Dialogue– Early systems: SHRDLU, ELIZA

– ELIZA: Please tell me what’s been bothering you.

– ME: Oh, hello, ELIZA. It’s been a long time.

– ELIZA: Hi. What seems to be your problem ?

– ME: Lately, I’ve been feeling a little depressed.

– ELIZA: And how do you feel right now ?

– ME: I feel sad, but relieved to be writing.

– ELIZA: Of what does feeling sad remind you ?

– ME: Well, I’m an alcoholic, and I’ve been sober for a long time, but I drank yesterday. And then I decided to stop drinking again. But still; that makes me feel sad.

– ELIZA: Is it because you are an alcoholic that you came to me ?

– ME: I suppose that that’s one of the reasons.

– ELIZA: Can you elaborate on that ?

– ME: I mean, it’s not the only reason, but alcoholism is a primary thing that has messed up my life.

– http://www.masswerk.at/elizabot/eliza.html

Page 18: Natural Language Processing and Machine LearningCourse Goals Understand a natural language processing pipeline Build a small search engine Code, use and evaluate a statistical machine

Grand Challenges

● Dialogue

– Many nuances● Timing, context

– Turing test● Tough problem – trick questions

– What does your wife do?– How do you prefer your steaks?– Which route did you take to get here?

● Silly hacks– Eugene Goostman: pose as young & foreign– Lowers audience expectations

Page 19: Natural Language Processing and Machine LearningCourse Goals Understand a natural language processing pipeline Build a small search engine Code, use and evaluate a statistical machine

Grand Challenges

● Semantic extraction– Textual entailment

– “Frames” - Minsky● World knowledge problem● Break down world into scripts

– Legal understanding

– Inference● All women are humans● If X is a human, are they a woman?

Page 20: Natural Language Processing and Machine LearningCourse Goals Understand a natural language processing pipeline Build a small search engine Code, use and evaluate a statistical machine

Grand Challenges

● Language Generation– Summarisation

– Descriptions

Page 21: Natural Language Processing and Machine LearningCourse Goals Understand a natural language processing pipeline Build a small search engine Code, use and evaluate a statistical machine

Grand Challenges

● Language Generation– Journalism

Similar / related past event

Article creation timestamp

Precise sub-event mentions

Event summarydescription

Prior context

Page 22: Natural Language Processing and Machine LearningCourse Goals Understand a natural language processing pipeline Build a small search engine Code, use and evaluate a statistical machine

Grand Challenges

● Language Generation– Information filtering: what to include

● Not all information is relevant● “5 months, 6 days,

11 hours, 2 minutes-”

– Legal issues● Who owns the content?● Who's responsible?

– “Russia launched a nuclear missile at Chicago”– “Lars Løkke is not very good at chess”

Page 23: Natural Language Processing and Machine LearningCourse Goals Understand a natural language processing pipeline Build a small search engine Code, use and evaluate a statistical machine

Grand Challenges

● Question Answering

– Take: question; return: answer

– Natural interrogation mode● Where are you going?● How tall is the Eiffel tower?● What... is the air-speed velocity of an unladen swallow?

– Knowledge-base population

Page 24: Natural Language Processing and Machine LearningCourse Goals Understand a natural language processing pipeline Build a small search engine Code, use and evaluate a statistical machine

Grand Challenges

● Google is getting better!

● Problems:– Searching habits

– How to capturequestions?

Page 25: Natural Language Processing and Machine LearningCourse Goals Understand a natural language processing pipeline Build a small search engine Code, use and evaluate a statistical machine

Big problem

● Hard to tackle● Some cases can be cast in computational terms

– MT: sequence-to-sequence

– Hard to represent entire world knowledge

● Language provides theoretical background● Decompose!

– Let's build pipelines

Page 26: Natural Language Processing and Machine LearningCourse Goals Understand a natural language processing pipeline Build a small search engine Code, use and evaluate a statistical machine

Smaller challenges

● Letters in words● Words● Words in a sentence

Page 27: Natural Language Processing and Machine LearningCourse Goals Understand a natural language processing pipeline Build a small search engine Code, use and evaluate a statistical machine

Phonology and morphology

● Letters in words can describe their sounds● Grouped to form phonemes

– p/b in pit / bit

● Letters also group to form semantic units● Smallest groups are morphemes

– [disagree]/d/ment/s

– [run]ning/ner/s

Page 28: Natural Language Processing and Machine LearningCourse Goals Understand a natural language processing pipeline Build a small search engine Code, use and evaluate a statistical machine

Parts of speech

● Categories of word– Verb, noun, adjective

● Each word can belong to more than one category– The river bank

– Bank this money

– I work at Google

– I google at work

Page 29: Natural Language Processing and Machine LearningCourse Goals Understand a natural language processing pipeline Build a small search engine Code, use and evaluate a statistical machine

Grammar

● Chunking– Find sequences of words that represent a concept

– I [ran quickly] towards [the blue bus]

– Useful for getting key phrases

● Parsing– Build tree structure

of sentence

Page 30: Natural Language Processing and Machine LearningCourse Goals Understand a natural language processing pipeline Build a small search engine Code, use and evaluate a statistical machine

But first…

● We need words!

Page 31: Natural Language Processing and Machine LearningCourse Goals Understand a natural language processing pipeline Build a small search engine Code, use and evaluate a statistical machine

First step: tokenisation

● Text commonly comes as a sequence of bytes– Difficult to process and design processes for!

● Convert to a sequence of tokens– Mostly, these are like words

● Tokenisation: converting bytes to words– Simple: split by spaces

– [“Simple:”, “split”, “by”, “spaces”]

– Oh dear...

Page 32: Natural Language Processing and Machine LearningCourse Goals Understand a natural language processing pipeline Build a small search engine Code, use and evaluate a statistical machine

Tokenisation edge cases

● Pretty good, in the general case, isn't it?● What problems can you think of?

– Punctuation: doesn't → doesn, ', t

– Abbreviations: Mr. Gates Mr, ., Gates

– There's always a long tail effect:● More effort, less reward

Page 33: Natural Language Processing and Machine LearningCourse Goals Understand a natural language processing pipeline Build a small search engine Code, use and evaluate a statistical machine

What if there are no spaces?

● Whatiftherearenospaces● #nowthatcherisdead

Page 34: Natural Language Processing and Machine LearningCourse Goals Understand a natural language processing pipeline Build a small search engine Code, use and evaluate a statistical machine

What if there are no spaces?

● Whatiftherearenospaces● #nowthatcherisdead

● Or.. what if we're speaking Chinese?– 848M native speakers

– 1197M speakers total

– c.f. 1162 people in USA + Europe

Page 35: Natural Language Processing and Machine LearningCourse Goals Understand a natural language processing pipeline Build a small search engine Code, use and evaluate a statistical machine

What if there are no spaces?

– I like your hat

– W x huān n de màoziǒ ǐ ǐ– 我喜欢你的帽子

● Sentences can become long!– 丹麦后卫处理以保持英国在欧盟– Denmark backs deal to keep Britain in the EU

● How can we deal with this?– Denmark = 丹麦– Britain = 英国– EU = 欧洲联盟 or 欧盟

Page 36: Natural Language Processing and Machine LearningCourse Goals Understand a natural language processing pipeline Build a small search engine Code, use and evaluate a statistical machine

What if there are no spaces?

– I like your hat

– W x huān n de màoziǒ ǐ ǐ– 我喜欢你的帽子

● Sentences can become long!– 丹麦后卫处理以保持英国在欧盟– Denmark backs deal to keep Britain in the EU

● How can we deal with this?– Denmark = 丹麦– Britain = 英国– EU = 欧洲联盟 or 欧盟

Page 37: Natural Language Processing and Machine LearningCourse Goals Understand a natural language processing pipeline Build a small search engine Code, use and evaluate a statistical machine

Gazetteer lookup

● Gazetteer is a list of words– A dictionary is a big one

● Its entries may help segment the sentence● Some long words could be many other words

– Hit dǎ 打– Fire huǒ 火– Machine jī 机– Lighter 打火机

● How can we handle this?

Page 38: Natural Language Processing and Machine LearningCourse Goals Understand a natural language processing pipeline Build a small search engine Code, use and evaluate a statistical machine

Greedy methods

● Find the longest match first– This is a typical “greedy” search method

#nowthatcherisdead– #

– # now

– # now thatcher

– # now thatcher is

– # now thatcher is dead

– # nowt

– # nowt hat ?

Page 39: Natural Language Processing and Machine LearningCourse Goals Understand a natural language processing pipeline Build a small search engine Code, use and evaluate a statistical machine

Greedy methods

● What problems can you think of?– OOV: out-of-vocabulary words

● 扑热息痛可能有毒 (paracetamol can be poisonous)

● 扑热息痛 – 可能 – 有毒● New words are guaranteed to arise

– Misalignment● Now that vs. nowt hat

● What recovery strategies?– One word at a time:

● 扑 – 热 – 息 – 痛 – 可能 – 有毒

Page 40: Natural Language Processing and Machine LearningCourse Goals Understand a natural language processing pipeline Build a small search engine Code, use and evaluate a statistical machine

Now we have words!

● What do they mean?● Many senses per word:

– Bank

● How to separate these?

Page 41: Natural Language Processing and Machine LearningCourse Goals Understand a natural language processing pipeline Build a small search engine Code, use and evaluate a statistical machine

Word Sense Disambiguation

● Goal: to determine a word's intended sense● WordNet

Overview of noun bank

The noun bank has 10 senses (first 4 from tagged texts)

1. (25) bank -- (sloping land (especially the slope beside a body of water); "they pulled the canoe up on the bank"; "he sat on the bank of the river and watched the currents")

2. (20) depository financial institution, bank, banking concern, banking company -- (a financial institution that accepts deposits and channels the money into lending activities; "he cashed a check at the bank"; "that bank holds the mortgage on my home")

3. (2) bank -- (a long ridge or pile; "a huge bank of earth")

4. (1) bank -- (an arrangement of similar objects in a row or in tiers; "he operated a bank of switches")

The verb bank has 8 senses (first 2 from tagged texts)

● Polysemy is everywhere!

Page 42: Natural Language Processing and Machine LearningCourse Goals Understand a natural language processing pipeline Build a small search engine Code, use and evaluate a statistical machine

Sense and Distributionality

● “We shall know a word by the company it keeps”● Context gives clues to meaning

– The words before or after help determine the sense

– This context is distributionality: i.e., where the item is distributed.

● Allows adding semantics to unknown words– Wow, that Blart is faster than my Volvo and half the price!

– One way of addressing language acquisition problem

● Not always super-simple– Arabic is FWO

Page 43: Natural Language Processing and Machine LearningCourse Goals Understand a natural language processing pipeline Build a small search engine Code, use and evaluate a statistical machine

Information Retrieval

● How can we satisfy information needs?

● Related to NLP:– Questions expressed in language

– Information stored linguistically

Page 44: Natural Language Processing and Machine LearningCourse Goals Understand a natural language processing pipeline Build a small search engine Code, use and evaluate a statistical machine

History

● We do retrieval every day– Finding oranges

– Finding a clean pair of socks

● Finding information is harder– Reading a book every time a question arises

– Linear search!

● Libraries– Rely on indexing..

Page 45: Natural Language Processing and Machine LearningCourse Goals Understand a natural language processing pipeline Build a small search engine Code, use and evaluate a statistical machine

Explosion

● Now, we do advanced IR dozens of times daily– First Altavista, Google et al.

– Now ubiquitous

● In your Facebook search box,

On your phone, ..

Page 46: Natural Language Processing and Machine LearningCourse Goals Understand a natural language processing pipeline Build a small search engine Code, use and evaluate a statistical machine

Performance

● Simple method:– Store every document we need to search over

– When we get a query, look through the documents and pick out useful ones

● Problems?– Slow! Bigger collections take longer to search

Page 47: Natural Language Processing and Machine LearningCourse Goals Understand a natural language processing pipeline Build a small search engine Code, use and evaluate a statistical machine

Invert it!

● Typical research approach● Instead of looking up documents one by one..● Build list of words and the documents that

contain them● Faster access: what drives the speed of

lookup?– Not number of documents

– Number of words seen

Page 48: Natural Language Processing and Machine LearningCourse Goals Understand a natural language processing pipeline Build a small search engine Code, use and evaluate a statistical machine

Invert it!

● Toy index – six documents– ham: [1,4,6]

– cheap: [2,3,4]

– Aarhus: [1,5,6]

● Sample queries– cheap ham ?

● Document 4

– Aarhus ham ?● Document 1 & 6

● Will this scale for the user?

Page 49: Natural Language Processing and Machine LearningCourse Goals Understand a natural language processing pipeline Build a small search engine Code, use and evaluate a statistical machine

Not all words are created equal

● Very common words useless– e.g. search for “the”

– Stopwords: words we stop looking for

– Metric: DF – how many documents contain a term

– Any problems?

● Mentioning a term more than once is important– A passing mention suggests less relevance than

regular mentions

– Metric: TF – how many times a term is mentioned

Page 50: Natural Language Processing and Machine LearningCourse Goals Understand a natural language processing pipeline Build a small search engine Code, use and evaluate a statistical machine

TFIDF

● Term frequency . Inverse Document Frequency● Basic ranking metric

– Rewards terms frequent in a document

– Rewards terms that are rare in the dataset

● Definition:– D: set of documents d; t: search term (token)

– TF(t,d) =

– IDF(t,D) = |D| /

● Variants: +1 smoothing; log normalisation

Page 51: Natural Language Processing and Machine LearningCourse Goals Understand a natural language processing pipeline Build a small search engine Code, use and evaluate a statistical machine

Retrieval with TFIDFTerm TF DF |D| IDF TF.IDF

the 312 28799 30000 0.018 5.54

in 179 26452 30000 0.055 9.78

cheap 136 179 30000 2.224 302.50

ham 131 231 30000 2.114 276.87

Aarhus 63 98 30000 2.486 156.61

vegetarian 45 142 30000 2.325 104.62

heaven 37 227 30000 2.121 78.48

For the term the:IDF(the) = log10(30000 / 28799) = 0.018TF(the) = 312TFIDF = 0.018 x 312 = 5.54

Page 52: Natural Language Processing and Machine LearningCourse Goals Understand a natural language processing pipeline Build a small search engine Code, use and evaluate a statistical machine

TFIDF hacks

● Short, focused document vs. long general description– Who wins?

– Long document; TF is higher

● What's the speed like?– Slow: lots of computation per-term for every document

● Will an article on goldfish rank well for a fish query?– Not unless it mentions fish as well as goldfish!

– TFIDF is agnostic to semantics

Page 53: Natural Language Processing and Machine LearningCourse Goals Understand a natural language processing pipeline Build a small search engine Code, use and evaluate a statistical machine

Vector Space Model

● VSM:– Each word is a dimension

– Plot document according to word frequencies in that dimension

– Cosine, or other Euclideandistance metrics, cangive a similaritymeasure

Page 54: Natural Language Processing and Machine LearningCourse Goals Understand a natural language processing pipeline Build a small search engine Code, use and evaluate a statistical machine

Vector Space Model

● Imagine dimensions: bird, birds, house, already● Problems?

● Not all concepts/wordsare orthogonal;

● Poor representationof underlyingsemantics

● Advantage: fast

Page 55: Natural Language Processing and Machine LearningCourse Goals Understand a natural language processing pipeline Build a small search engine Code, use and evaluate a statistical machine

Summary

● NLP: definition● Challenges● Pipeline, tokenisation and segmentation● WSD● First application of tokens: IR● Indexing and term weighting

Page 56: Natural Language Processing and Machine LearningCourse Goals Understand a natural language processing pipeline Build a small search engine Code, use and evaluate a statistical machine

Practical requirements

● Software: Python 3, NLTK, Scikit-learn● Reading:

– Jurafsky & Martin: “Weighted automata and segmentation” section

– Manning & Schutze: “Corpus based work” first two subsections

– Salton et al.: “Extended Boolean Information Retrieval”

– Brin & Page: “The anatomy of a large-scale hypertextual web search engine”

(this is just an old rejected SIGIR paper from the 90s)

http://www.deepsky.com/~merovech/voynich/voynich_manchu_reference_materials/PDFs/jurafsky_martin.pdfhttp://ics.upjs.sk/~pero/web/documents/pillar/Manning_Schuetze_StatisticalNLP.pdf