Natural Language Processing...Approaches to NLP •Rule based approach – circa from 1950...

98
Natural Language Processing 10 April 2017

Transcript of Natural Language Processing...Approaches to NLP •Rule based approach – circa from 1950...

Page 1: Natural Language Processing...Approaches to NLP •Rule based approach – circa from 1950 •Machine learning – circa from 1980 Approaches to NLP – Rule based • Chomsky (159):

Natural Language Processing

10 April 2017

Page 2: Natural Language Processing...Approaches to NLP •Rule based approach – circa from 1950 •Machine learning – circa from 1980 Approaches to NLP – Rule based • Chomsky (159):

Who are we?

Jiri Tom Rado Martin

Page 3: Natural Language Processing...Approaches to NLP •Rule based approach – circa from 1950 •Machine learning – circa from 1980 Approaches to NLP – Rule based • Chomsky (159):

Infrastructure

● Keboola project invitation● Python 3+ (preferably Anaconda) installed● cmder.net on Windows (mac and linux should be fine)

Page 4: Natural Language Processing...Approaches to NLP •Rule based approach – circa from 1950 •Machine learning – circa from 1980 Approaches to NLP – Rule based • Chomsky (159):

NLP – Why do we care?

Page 5: Natural Language Processing...Approaches to NLP •Rule based approach – circa from 1950 •Machine learning – circa from 1980 Approaches to NLP – Rule based • Chomsky (159):

Problem

Huge amount of text, growing faster and fasterComputers process mostly structured data

Businesses are forced to ignore crucial data

Page 6: Natural Language Processing...Approaches to NLP •Rule based approach – circa from 1950 •Machine learning – circa from 1980 Approaches to NLP – Rule based • Chomsky (159):

We had a great time in Royal Plaza. We had to wait a while at hte reception before check in, Mike seemed to be very busy handling other guests. Bathrooms were a little bit rusty, but the room service was escellent. Speed Taxi service to Gatwick was way too expensive!

Im completely dissappointed. nothing works. The phone battery lasts only few hours before I have to charge it again. I barely have any reception at home. And the last invoice was wrong again. Third time in a row! You either fix it now or I’m leaving.

Problem

Page 7: Natural Language Processing...Approaches to NLP •Rule based approach – circa from 1950 •Machine learning – circa from 1980 Approaches to NLP – Rule based • Chomsky (159):

We had a great time in Royal Plaza. We had to wait a while at hte reception before check in, Mike seemed to be very busy handling other guests. Bathrooms were a little bit rusty, but the room service was escellent. Speed Taxi service to Gatwick was way too expensive!

Im completely dissappointed. nothing works. The phone battery lasts only few hours before I have to charge it again. I barely have any reception at home. And the last invoice was wrong again. Third time in a row! You either fix it now or I’m leaving.

Solution = analyze text automatically

Page 8: Natural Language Processing...Approaches to NLP •Rule based approach – circa from 1950 •Machine learning – circa from 1980 Approaches to NLP – Rule based • Chomsky (159):

We had a great time in Royal Plaza. We had to wait a while at the reception before check in, Mike seemed to be very busy handling other guests. Bathrooms were a little bit rusty, but the room service was escellent. Speed Taxi service to Gatwick was way too expensive!

Im completely dissappointed. nothing works. The phone battery lasts only few hours before I have to charge it again. I barely have any reception at home. And the last invoice was wrong again. Third time in a row! You either fix it now or I’m leaving.

Royal Plaza: greatRoom service: excellent

Reception: waiting Bathrooms: rusty Speed Taxi: expensive!

Retention: I’m leavingBilling: last invoice was wrongTechnical support: bad signal at homeCustomer support: battery lasts only few hours

Solution = analyze text automatically

Page 9: Natural Language Processing...Approaches to NLP •Rule based approach – circa from 1950 •Machine learning – circa from 1980 Approaches to NLP – Rule based • Chomsky (159):

Example – demo.geneea.com

Page 10: Natural Language Processing...Approaches to NLP •Rule based approach – circa from 1950 •Machine learning – circa from 1980 Approaches to NLP – Rule based • Chomsky (159):

Example – Customer feedback – Relations

Page 11: Natural Language Processing...Approaches to NLP •Rule based approach – circa from 1950 •Machine learning – circa from 1980 Approaches to NLP – Rule based • Chomsky (159):
Page 12: Natural Language Processing...Approaches to NLP •Rule based approach – circa from 1950 •Machine learning – circa from 1980 Approaches to NLP – Rule based • Chomsky (159):

Just text analysis is not enough

Connection with structured data:• Time• Location• Popularity of text (likes/dislikes, retweets)• Financial data• Age, gender of the author of the text

Page 13: Natural Language Processing...Approaches to NLP •Rule based approach – circa from 1950 •Machine learning – circa from 1980 Approaches to NLP – Rule based • Chomsky (159):

What else?

• Machine translation• Information retrieval• Finding similar documents (plagiarism)• Summarization• Dictation, IVR, automatic closed-captioning, text-to-speech• Email/ticket routing • Grammar/spelling checking• Recognition of people, companies …; relations between them• Detection of sentiment, uncertainty• Intent detection• much more

Page 14: Natural Language Processing...Approaches to NLP •Rule based approach – circa from 1950 •Machine learning – circa from 1980 Approaches to NLP – Rule based • Chomsky (159):

Low level tasks

• Sentence boundary detection

• Tokenization

• Part-of-speech tagging: Can he can me for kicking a can?

• Lemmatization

• Parsing

• Coreference resolution

• Understanding dates: 10th of April 2017; April 10, 2017; 04/10/2017; 04/10/17; 04/10; April 10; 2017-04-10

Page 15: Natural Language Processing...Approaches to NLP •Rule based approach – circa from 1950 •Machine learning – circa from 1980 Approaches to NLP – Rule based • Chomsky (159):

Approaches to NLP

Page 16: Natural Language Processing...Approaches to NLP •Rule based approach – circa from 1950 •Machine learning – circa from 1980 Approaches to NLP – Rule based • Chomsky (159):

Approaches to NLP

• Rule based approach – circa from 1950

• Machine learning – circa from 1980

Page 17: Natural Language Processing...Approaches to NLP •Rule based approach – circa from 1950 •Machine learning – circa from 1980 Approaches to NLP – Rule based • Chomsky (159):

Approaches to NLP – Rule based

• Chomsky (159): Syntactic structures

• Machine translation from IBM & Georgetown

Page 18: Natural Language Processing...Approaches to NLP •Rule based approach – circa from 1950 •Machine learning – circa from 1980 Approaches to NLP – Rule based • Chomsky (159):
Page 19: Natural Language Processing...Approaches to NLP •Rule based approach – circa from 1950 •Machine learning – circa from 1980 Approaches to NLP – Rule based • Chomsky (159):

Time flies like an arrow.

Page 20: Natural Language Processing...Approaches to NLP •Rule based approach – circa from 1950 •Machine learning – circa from 1980 Approaches to NLP – Rule based • Chomsky (159):

Approaches to NLP – Machine learning

• Statistical methods, machine learning

• Importance of exact evaluation

• Data, data, data

• Annotation

Page 21: Natural Language Processing...Approaches to NLP •Rule based approach – circa from 1950 •Machine learning – circa from 1980 Approaches to NLP – Rule based • Chomsky (159):

Machine learning

•Unsupervised • Finding hidden structure in data• For example clustering

•Supervised • Requires training data with correct answers

Page 22: Natural Language Processing...Approaches to NLP •Rule based approach – circa from 1950 •Machine learning – circa from 1980 Approaches to NLP – Rule based • Chomsky (159):

Data – Corpora

● Morphology, tagging: Penn Treebank, PDT, ...● Parallel corpora:

European, Canadian parliament, movie subtitles● Specialised (e.g. for sentiment)

Page 23: Natural Language Processing...Approaches to NLP •Rule based approach – circa from 1950 •Machine learning – circa from 1980 Approaches to NLP – Rule based • Chomsky (159):

Sentiment analysis

Page 24: Natural Language Processing...Approaches to NLP •Rule based approach – circa from 1950 •Machine learning – circa from 1980 Approaches to NLP – Rule based • Chomsky (159):

Sentiment Analysis

• German economy is booming.

• Human trafficking is booming in California.

• Burger King has better fries than McDonald.

• Battery is good, but the display is terrible.

Page 25: Natural Language Processing...Approaches to NLP •Rule based approach – circa from 1950 •Machine learning – circa from 1980 Approaches to NLP – Rule based • Chomsky (159):

Sentiment Analysis

• I was happy.

• I was sad.

• I was not happy.

• I have never been happy in my life.

• I have never been so happy in my life.

• It’s not good, but I still love it.

Page 26: Natural Language Processing...Approaches to NLP •Rule based approach – circa from 1950 •Machine learning – circa from 1980 Approaches to NLP – Rule based • Chomsky (159):

Sentiment Analysis

•She is pretty.

•She is pretty annoying.

Page 27: Natural Language Processing...Approaches to NLP •Rule based approach – circa from 1950 •Machine learning – circa from 1980 Approaches to NLP – Rule based • Chomsky (159):

Sentiment Analysis

•Well, that was a success. (sarcasm?)

•Go read the book.

Page 28: Natural Language Processing...Approaches to NLP •Rule based approach – circa from 1950 •Machine learning – circa from 1980 Approaches to NLP – Rule based • Chomsky (159):

Sentiment Analysis

• The previous version was absolutely great, it was a pleasure to work with, but now, I am a little confused.

Page 29: Natural Language Processing...Approaches to NLP •Rule based approach – circa from 1950 •Machine learning – circa from 1980 Approaches to NLP – Rule based • Chomsky (159):

Sentiment Analysis

•Harmony Smith drives me crazy.

•Bob’s Bad Breath Burger is delicious.

•That was a bad ass burger!

Page 30: Natural Language Processing...Approaches to NLP •Rule based approach – circa from 1950 •Machine learning – circa from 1980 Approaches to NLP – Rule based • Chomsky (159):

Sentiment classification - python

go to https://jupyter.geneea.com

Page 31: Natural Language Processing...Approaches to NLP •Rule based approach – circa from 1950 •Machine learning – circa from 1980 Approaches to NLP – Rule based • Chomsky (159):
Page 32: Natural Language Processing...Approaches to NLP •Rule based approach – circa from 1950 •Machine learning – circa from 1980 Approaches to NLP – Rule based • Chomsky (159):

Evaluation

Page 33: Natural Language Processing...Approaches to NLP •Rule based approach – circa from 1950 •Machine learning – circa from 1980 Approaches to NLP – Rule based • Chomsky (159):

Evaluation: Precision/Recall

False positiveBad Precision

False negativeBad Recall

Page 34: Natural Language Processing...Approaches to NLP •Rule based approach – circa from 1950 •Machine learning – circa from 1980 Approaches to NLP – Rule based • Chomsky (159):

Evaluation: Confusion matrix

Prediction

Predicted positive

Predicted Negative

Reality

Real positive

True positive

False negative

Real negative

False positive

True negative

Page 35: Natural Language Processing...Approaches to NLP •Rule based approach – circa from 1950 •Machine learning – circa from 1980 Approaches to NLP – Rule based • Chomsky (159):

Evaluation: Confusion matrix

Prediction

Predicted positive

Predicted Negative

Reality

Real positive

10 5

Real negative

3 16

Page 36: Natural Language Processing...Approaches to NLP •Rule based approach – circa from 1950 •Machine learning – circa from 1980 Approaches to NLP – Rule based • Chomsky (159):

Evaluation

• Multiple possible answers

• Not all errors are equally important

• Inter-annotator agreement (very low for tagging with an open tag set)

Page 37: Natural Language Processing...Approaches to NLP •Rule based approach – circa from 1950 •Machine learning – circa from 1980 Approaches to NLP – Rule based • Chomsky (159):

Machine learning – Overfitting

Page 38: Natural Language Processing...Approaches to NLP •Rule based approach – circa from 1950 •Machine learning – circa from 1980 Approaches to NLP – Rule based • Chomsky (159):

Discovery analysis● explore the data● prefer recall over precision● malformed or irrelevant tags not a big deal

(as opposed to media tags)

Page 39: Natural Language Processing...Approaches to NLP •Rule based approach – circa from 1950 •Machine learning – circa from 1980 Approaches to NLP – Rule based • Chomsky (159):

Yelp Sample – 160k Restaurant Reviews

Page 40: Natural Language Processing...Approaches to NLP •Rule based approach – circa from 1950 •Machine learning – circa from 1980 Approaches to NLP – Rule based • Chomsky (159):
Page 41: Natural Language Processing...Approaches to NLP •Rule based approach – circa from 1950 •Machine learning – circa from 1980 Approaches to NLP – Rule based • Chomsky (159):

Command line tools

• Simple and generic tools for text transformation

• Where to get:• Linux (and Mac) – part of the

operating system• Windows – install: cmder.net or

cygwin

https://jupyter.geneea.com/tree/data(click the file, then choose File > Save; DO NOT R-Click and Save as !!)

Page 42: Natural Language Processing...Approaches to NLP •Rule based approach – circa from 1950 •Machine learning – circa from 1980 Approaches to NLP – Rule based • Chomsky (159):

Example

• How many lines, words and characters?• The first command: wc <filename>

• How to show :• cat <filename> (prints the whole file)• less <filename> (pages, use space and then “q”) • head / tail <filename>

• All commands: parameter --help

Page 43: Natural Language Processing...Approaches to NLP •Rule based approach – circa from 1950 •Machine learning – circa from 1980 Approaches to NLP – Rule based • Chomsky (159):

Encoding

• https://en.wikipedia.org/wiki/Character_encoding

• ASCII – 7bits• Does not cover all languages’

characters• More encodings for the same

language (e.g. windows-1250, iso-8859-2, utf-8 for Czech)

Conversion from one encoding to another (iso-8859-2 to utf-8)

iconv -f iso-8859-2 -t utf-8 text_orig.txt > text_utf8.txt

Page 44: Natural Language Processing...Approaches to NLP •Rule based approach – circa from 1950 •Machine learning – circa from 1980 Approaches to NLP – Rule based • Chomsky (159):

Other commands and generic principles

• sort• Sorting the file alphabetically or numerically (-n)

• Each command has input and output

• It is possible to make a chain of commands (output of one command becomes input for another one)

• Using character | (pipe)• cat text_utf8.txt | sort

• How to send output to a file?• Using character >• cat text_utf8.txt | sort > text_sorted.txt

Page 45: Natural Language Processing...Approaches to NLP •Rule based approach – circa from 1950 •Machine learning – circa from 1980 Approaches to NLP – Rule based • Chomsky (159):

CSV processing

• cvskit• https://csvkit.readthedocs.io • pip install csvkit

• csvcut -c 2 file.csv

• in2csv data.xls > data.csv• Conversion from Excel to csv

• csvcut -n data.csv• Print column names

Page 46: Natural Language Processing...Approaches to NLP •Rule based approach – circa from 1950 •Machine learning – circa from 1980 Approaches to NLP – Rule based • Chomsky (159):

Other commands – tr, uniq, cut

• tr• Replace one character with another:

• tr 'a' 'b'• tr ' ' '\n'• tr '[:punct:]' '\n'

• uniq• Exclude repeating rows• It’s necessary to have the input sorted• cat text_uf8.txt | tr '[:punct:]' '\n' | tr ' ' '\n' | sort | uniq -c | sort

• cut

• Filters columns or characters and prints only selected ones• cut –f 1 –d “ “• Works well with tsv (tab separated)

Page 47: Natural Language Processing...Approaches to NLP •Rule based approach – circa from 1950 •Machine learning – circa from 1980 Approaches to NLP – Rule based • Chomsky (159):

Other commands - grep

• Filtering rows• grep “foo”

• Regular expressions:• Template matching more words/texts• [a-z] … characters from a to z

• Exercises:• Print rows containing at least one number• How many unique words does the file contain?• What are the most frequent words starting with a specific letter?

Page 48: Natural Language Processing...Approaches to NLP •Rule based approach – circa from 1950 •Machine learning – circa from 1980 Approaches to NLP – Rule based • Chomsky (159):

Other commands

• wget• Crawl pages from a website

• echo• Prints text to console

• sed• More complex tool for replacing strings• echo "wine" | sed -e 's/wine/beer/

•dos2unix, unix2dos• Encoding of end-of-line characters

Page 49: Natural Language Processing...Approaches to NLP •Rule based approach – circa from 1950 •Machine learning – circa from 1980 Approaches to NLP – Rule based • Chomsky (159):

Other tools

• Notepad++• With TextFX plug-in

Page 50: Natural Language Processing...Approaches to NLP •Rule based approach – circa from 1950 •Machine learning – circa from 1980 Approaches to NLP – Rule based • Chomsky (159):

Simple tagging

1. Tokenization: split text into words

2. Drop unimportant words: stop words

3. Find important words: tf-idf

Page 51: Natural Language Processing...Approaches to NLP •Rule based approach – circa from 1950 •Machine learning – circa from 1980 Approaches to NLP – Rule based • Chomsky (159):

tf – term frequencySo bummed, our company ordered pizza from here today and I tried to give the person who answered the phone the name of our company but she stated just give me your first name...due to that fact when the pizza was delivered over an hour later and we are less then 3 minutes down the street it was ICE COLD!!! I even called Dina's up questioning if the pizza was on the way yet and was told it left a while a go I don't understand how you don't have it yet. We ordered the taco specialty pizza so try warming that up with the lettuce and tomato all over the top of it. Not to mention the fact that the toppings completely fell off the pizza and were all on the corner of the box. WHAT A WASTE OF MONEY THIS WAS!!!!!!!!!!

tf(pizza) = 5

Page 52: Natural Language Processing...Approaches to NLP •Rule based approach – circa from 1950 •Machine learning – circa from 1980 Approaches to NLP – Rule based • Chomsky (159):

tf – term frequencySo bummed, our company ordered pizza from here today and I tried to give the person who answered the phone the name of our company but she stated just give me your first name...due to that fact when the pizza was delivered over an hour later and we are less then 3 minutes down the street it was ICE COLD!!! I even called Dina's up questioning if the pizza was on the way yet and was told it left a while a go I don't understand how you don't have it yet. We ordered the taco specialty pizza so try warming that up with the lettuce and tomato all over the top of it. Not to mention the fact that the toppings completely fell off the pizza and were all on the corner of the box. WHAT A WASTE OF MONEY THIS WAS!!!!!!!!!!

tf(pizza) = 5tf(the) = 15

Page 53: Natural Language Processing...Approaches to NLP •Rule based approach – circa from 1950 •Machine learning – circa from 1980 Approaches to NLP – Rule based • Chomsky (159):

idf – inverse document frequency

idf(the) = log(160,000 / 152,000) = log(1.05)

idf(pizza) = log(160,000 / 11,800) = log(13.56)

idf(Tokyo) = log(160,000 / 194) = log(824.74)

Page 54: Natural Language Processing...Approaches to NLP •Rule based approach – circa from 1950 •Machine learning – circa from 1980 Approaches to NLP – Rule based • Chomsky (159):

tf-idf

w(t,doc) = log(1 + tf(t) ) * log (N / df(t))

idf(the) = log(1+15) * log(1.05) = 0.28

idf(pizza) = log(1+5) * log(13.56) = 9.72

idf(Tokyo) = log(1+0) * log(824.74) = 0

Simple, yet works well

Page 55: Natural Language Processing...Approaches to NLP •Rule based approach – circa from 1950 •Machine learning – circa from 1980 Approaches to NLP – Rule based • Chomsky (159):

TokenizationMerriam-Webster's, Tumu-M'Pongo

10-year, one-liners, self-proclaimed

United Kingdom-United States relations

5-3, 5-3+1, U+2010, 2:4, 14:34, 10:00-14:00

10000, 10 000, 10,000

3.14159, 10.12., 10. prosince, U.S.A., H2O

km/h, A/C, s/he, °C

N40° 44.9064', W073° 59.0735'

www.some-news.com/article-about-stuff, [email protected]

Page 56: Natural Language Processing...Approaches to NLP •Rule based approach – circa from 1950 •Machine learning – circa from 1980 Approaches to NLP – Rule based • Chomsky (159):

Tokenization

Arbitrary decisions have to be made.

Stick to them consistently. Pre-trained models might work poorly if fed with differently tokenized data

Page 57: Natural Language Processing...Approaches to NLP •Rule based approach – circa from 1950 •Machine learning – circa from 1980 Approaches to NLP – Rule based • Chomsky (159):

eat x ate x eatenOK, not cheap but not outrageously expensive either. I've eaten here twice, the last time during May 2009, I enjoyed both the food & atmosphere. I suppose you could call the place a Bistro. The food is Scottish & locally sourced, caters for vegetarians & has a pretty varied menu without being ridiculously extensive. I seem to remember a good selection of wines but don't think they serve anything but bottled beer. Damned if I can remember what I ate but had fish once that was extremely tasty & their veg isn't undercooked that can be the fashion. The service was friendly with no unseemly waiting! A great night out in New Town. There are two sister restaurants: A Room in the West End & A Room in Leith. Enjoy a great place to eat in a fabulous city! tf(eat) + tf(ate) + tf(eaten) = 3

tf(eat) = 1

Page 58: Natural Language Processing...Approaches to NLP •Rule based approach – circa from 1950 •Machine learning – circa from 1980 Approaches to NLP – Rule based • Chomsky (159):

Lemmatization & Morphology

Page 59: Natural Language Processing...Approaches to NLP •Rule based approach – circa from 1950 •Machine learning – circa from 1980 Approaches to NLP – Rule based • Chomsky (159):

Processing MorphologyLemmatization: word → lemma (dictionary form) Peter saw her. → Peter see she .

POS Tagging: word → tagPeter saw her. → noun, verb, pronoun, punct

Morphological analysis: ignores contextsaw → {[see, verb.past], [saw, noun.sg]}

Morpheme segmentation: de-nation-al-iz-ation

Generation: see + verb.past → saw

Page 60: Natural Language Processing...Approaches to NLP •Rule based approach – circa from 1950 •Machine learning – circa from 1980 Approaches to NLP – Rule based • Chomsky (159):

Morphology – not so easy

city – citi-es, goose – geese, sheep – sheep, go – went

Stuhl – Stühl-e, Vater – Väter

matk-a – mat-e-k – matc-e – matč-in

Page 61: Natural Language Processing...Approaches to NLP •Rule based approach – circa from 1950 •Machine learning – circa from 1980 Approaches to NLP – Rule based • Chomsky (159):

Morphology – not so easy

Tagalog (Philippines):

basa ‘read’      b-um-asa ‘readpast’sulat ‘write’     s-um-ulat ‘wrote’

rare in English:   abso-bloody-lutely

Arabic, Hebrew – templates

Page 62: Natural Language Processing...Approaches to NLP •Rule based approach – circa from 1950 •Machine learning – circa from 1980 Approaches to NLP – Rule based • Chomsky (159):

Choice of lemma depth

inflection: debates → debatebrought, brings, bringing → bring

negation: unreasonable → reasonable

gradation: highest → high (Highest Court)

Page 63: Natural Language Processing...Approaches to NLP •Rule based approach – circa from 1950 •Machine learning – circa from 1980 Approaches to NLP – Rule based • Chomsky (159):

Morphology: not so easy - derivation

solution – solve; kind – kindly – kindness

un-happy – in-comprehensive – im-possible – ir-rational

unloosen = loosen

unnerve, unearth

Page 64: Natural Language Processing...Approaches to NLP •Rule based approach – circa from 1950 •Machine learning – circa from 1980 Approaches to NLP – Rule based • Chomsky (159):

Zipf’s lawword frequency is inversely proportional to its freq rankUnique token count: 145kTotal token count: 20M rank word freq

1 the 801132

2 and 635035

3 I 521421

4 a 509089

5 to 398695

...

53038 turorials 2

>68000 1

First Covers

1% 84%

10% 97.7%

20% 98.9%

Page 65: Natural Language Processing...Approaches to NLP •Rule based approach – circa from 1950 •Machine learning – circa from 1980 Approaches to NLP – Rule based • Chomsky (159):

Consequences

Pareto’s rule (80 : 20)One can achieve “reasonable” quality fast

Costs of additional improvements rise “exponentially” (long tail)

Ambiguity and fuzziness on every layer of language

Page 66: Natural Language Processing...Approaches to NLP •Rule based approach – circa from 1950 •Machine learning – circa from 1980 Approaches to NLP – Rule based • Chomsky (159):

Part-of-speech tagging

Page 67: Natural Language Processing...Approaches to NLP •Rule based approach – circa from 1950 •Machine learning – circa from 1980 Approaches to NLP – Rule based • Chomsky (159):

Part-of-speech tagging

I love hiking through the woods on weekends .

PRP VBP N IN DT NNS IN NNS .

Page 68: Natural Language Processing...Approaches to NLP •Rule based approach – circa from 1950 •Machine learning – circa from 1980 Approaches to NLP – Rule based • Chomsky (159):

Petrov et al – (Google) Universal POS TagsetVERB - verbs (all tenses and modes)

NOUN - nouns (common and proper)

PRON - pronouns

ADJ - adjectives

ADV - adverbs

ADP - adpositions (prepositions and postpositions)

CONJ - conjunctions

DET - determiners

NUM - cardinal numbers

PRT - particles or other function words

X - other: foreign words, typos, abbreviations

. - punctuation

Page 69: Natural Language Processing...Approaches to NLP •Rule based approach – circa from 1950 •Machine learning – circa from 1980 Approaches to NLP – Rule based • Chomsky (159):

Penn Treebank tagset

Page 70: Natural Language Processing...Approaches to NLP •Rule based approach – circa from 1950 •Machine learning – circa from 1980 Approaches to NLP – Rule based • Chomsky (159):

Ambiguity

Mrs. Shaefer never got around/RP to joining.

All we gotta do is go around/IN the corner.

Chateau Petrus costs around/RB 2,500.

They were married/VBN by the Justice of the Peace yesterday at 5:00.

At the time, she was already married/JJ.

Page 71: Natural Language Processing...Approaches to NLP •Rule based approach – circa from 1950 •Machine learning – circa from 1980 Approaches to NLP – Rule based • Chomsky (159):

Entities

Page 72: Natural Language Processing...Approaches to NLP •Rule based approach – circa from 1950 •Machine learning – circa from 1980 Approaches to NLP – Rule based • Chomsky (159):
Page 73: Natural Language Processing...Approaches to NLP •Rule based approach – circa from 1950 •Machine learning – circa from 1980 Approaches to NLP – Rule based • Chomsky (159):

Example – Svejk – characters

Page 74: Natural Language Processing...Approaches to NLP •Rule based approach – circa from 1950 •Machine learning – circa from 1980 Approaches to NLP – Rule based • Chomsky (159):
Page 75: Natural Language Processing...Approaches to NLP •Rule based approach – circa from 1950 •Machine learning – circa from 1980 Approaches to NLP – Rule based • Chomsky (159):

Švejk

Page 76: Natural Language Processing...Approaches to NLP •Rule based approach – circa from 1950 •Machine learning – circa from 1980 Approaches to NLP – Rule based • Chomsky (159):
Page 77: Natural Language Processing...Approaches to NLP •Rule based approach – circa from 1950 •Machine learning – circa from 1980 Approaches to NLP – Rule based • Chomsky (159):

Entities – named and non-named

Named entities: personal names, organizations, geographical names

Other interesting entities: URL, e-mail, phone numbers, money amounts and other quantities, date and time

Custom entities for given domain: bacon, onion, tomato, cheese for a burger chain

Page 78: Natural Language Processing...Approaches to NLP •Rule based approach – circa from 1950 •Machine learning – circa from 1980 Approaches to NLP – Rule based • Chomsky (159):

Entities – some basic challengesTypes – fuzziness, hierarchy

Facebook – product or company?

European Union – organization or place

Embedded entities

[Dr.] Martin Luther King [Jr.]

[The [New England] Journal of Medicine]

[Gymnázium [Jozefa Gregora Tajovského] v [Banskej Bystrici]]

List look-up not enough

Washington, The police, ANO (Yes)

Page 79: Natural Language Processing...Approaches to NLP •Rule based approach – circa from 1950 •Machine learning – circa from 1980 Approaches to NLP – Rule based • Chomsky (159):

Entities – ML

• annotation – tag tokens with labels like PERSON_START, PERSON_CONT• popular classifier – CRF• features

• word shape (case, is alphanumeric etc.)

• morphological features

• gazetteers

• distsim, word2vec

• labels already assigned to previous word(s)

• add features of surrounding tokens, previous instances of the same word, use n-grams …

• could use two passes

• can use two passes

Page 80: Natural Language Processing...Approaches to NLP •Rule based approach – circa from 1950 •Machine learning – circa from 1980 Approaches to NLP – Rule based • Chomsky (159):

Entities/Tags – remaining issues

Correference – increase tf pronouns, the president

StandardizationiPads > iPad

Windows != Window, United States != Unite State

The first stage has landed on Of Course I Still Love You.

He sang Bratříčku zavírej vrátka.

NormalizationUSA = United States of America = United States ~ America

Hillary Rodham = Hillary Clinton

Page 81: Natural Language Processing...Approaches to NLP •Rule based approach – circa from 1950 •Machine learning – circa from 1980 Approaches to NLP – Rule based • Chomsky (159):

Syntax & Parsing

Page 82: Natural Language Processing...Approaches to NLP •Rule based approach – circa from 1950 •Machine learning – circa from 1980 Approaches to NLP – Rule based • Chomsky (159):
Page 83: Natural Language Processing...Approaches to NLP •Rule based approach – circa from 1950 •Machine learning – circa from 1980 Approaches to NLP – Rule based • Chomsky (159):
Page 84: Natural Language Processing...Approaches to NLP •Rule based approach – circa from 1950 •Machine learning – circa from 1980 Approaches to NLP – Rule based • Chomsky (159):
Page 85: Natural Language Processing...Approaches to NLP •Rule based approach – circa from 1950 •Machine learning – circa from 1980 Approaches to NLP – Rule based • Chomsky (159):
Page 86: Natural Language Processing...Approaches to NLP •Rule based approach – circa from 1950 •Machine learning – circa from 1980 Approaches to NLP – Rule based • Chomsky (159):
Page 87: Natural Language Processing...Approaches to NLP •Rule based approach – circa from 1950 •Machine learning – circa from 1980 Approaches to NLP – Rule based • Chomsky (159):

Old men and women are hard to live with.

I saw her duck.

The chicken are too hot to eat.

The mayor is a dirty street fighter.

Happily they left.

Terry loves his wife and so do I.

Page 88: Natural Language Processing...Approaches to NLP •Rule based approach – circa from 1950 •Machine learning – circa from 1980 Approaches to NLP – Rule based • Chomsky (159):

Vectors

Page 89: Natural Language Processing...Approaches to NLP •Rule based approach – circa from 1950 •Machine learning – circa from 1980 Approaches to NLP – Rule based • Chomsky (159):
Page 90: Natural Language Processing...Approaches to NLP •Rule based approach – circa from 1950 •Machine learning – circa from 1980 Approaches to NLP – Rule based • Chomsky (159):

Vector methods

One way to bridge natural language and classical ML

After transforming to vectors, integration with ML systems is easy

Applications: Search

Text classification

Preprocessing / feature extraction for any ML task e.g., neural networks image -> vector -> text

Page 91: Natural Language Processing...Approaches to NLP •Rule based approach – circa from 1950 •Machine learning – circa from 1980 Approaches to NLP – Rule based • Chomsky (159):

Vector methods: bag of words

Preprocessing – tokenization, stemming/lemmatization, cleaning

Create a vector d with dimension V (size of vocabulary)

di = tfi (term frequency of the i-th word)

A black cat and a white cat slept on a mat -> {black:1, white:1, cat:2, sleep:1, mat:1} -> [1, 1, 2, 1, 1, ...]

Page 92: Natural Language Processing...Approaches to NLP •Rule based approach – circa from 1950 •Machine learning – circa from 1980 Approaches to NLP – Rule based • Chomsky (159):

Vector methods: bag of words improved

Fancier values instead of tf. (e.g., tf-idf)

Add n-grams/phrases/entities to the bag {..., black cat:1, white cat:1, ...}

Page 93: Natural Language Processing...Approaches to NLP •Rule based approach – circa from 1950 •Machine learning – circa from 1980 Approaches to NLP – Rule based • Chomsky (159):

Vector methods: dimensionality reduction

Each term is a feature - very big dimension

Dimensionality reductionLSI (LSA) – term-document matrix decomposition

LDA – topic inference using probabilistic graphical model

Word2vec – transform words to vectors of given size, capture their context

gensim Python library

Page 94: Natural Language Processing...Approaches to NLP •Rule based approach – circa from 1950 •Machine learning – circa from 1980 Approaches to NLP – Rule based • Chomsky (159):

Vector methods: Latent Semantic Indexing (LSI)goal: map semantically similar documents to similar vectors

{(car), (truck), (flower)} –> a{(1.345 * car + 0.282 * truck), (flower)}

reduce dimensionality by singular value decomposition (SVD) of the term-document matrix

somehow addresses synonymy, in lesser extent homonymy

From EP corpus: 0.365*fishery + 0.342*fishing + 0.197*fish + -0.153*tax + -0.140*food + 0.116*aquaculture + ...

Source: Jialu Liu: Topic Model

Page 95: Natural Language Processing...Approaches to NLP •Rule based approach – circa from 1950 •Machine learning – circa from 1980 Approaches to NLP – Rule based • Chomsky (159):

Vector methods: Latent Dirichlet Allocation (LDA)

topic1 –> 0.1 milk, 0.09 meow, 0.08 kitten

topic2 –> 0.12 bark, 0.11 bone, 0.07 puppy

Finds probability distributions of topics for documents and words for topics

Source: Jialu Liu: Topic Model

Page 96: Natural Language Processing...Approaches to NLP •Rule based approach – circa from 1950 •Machine learning – circa from 1980 Approaches to NLP – Rule based • Chomsky (159):

Vector methods: Latent Dirichlet Allocation (LDA)

From EP corpus: 0.018*transport + 0.013*passenger + 0.011*airline + 0.010*road + 0.009*safety + 0.007*simplify + 0.007*rail + 0.006*travel + ...

0.025*Israel + 0.017*Palestinian + 0.015*Jerusalem + 0.015*Gaza + 0.012*Prime + 0.011*Israeli + 0.009*peace +

Page 97: Natural Language Processing...Approaches to NLP •Rule based approach – circa from 1950 •Machine learning – circa from 1980 Approaches to NLP – Rule based • Chomsky (159):

Vector methods: Word2vecDoesn’t ignore word order, uses either skip-grams or continuous bag of words (CBOW)

vector arithmetic king - man + woman = queen

uses neural networks

research shows analogy to matrix factorization

Page 98: Natural Language Processing...Approaches to NLP •Rule based approach – circa from 1950 •Machine learning – circa from 1980 Approaches to NLP – Rule based • Chomsky (159):

Vector methods: Word2vec

model.most_similar(positive=['nuclear'])

[('stations', 0.6321508884429932),('reactor', 0.6199184060096741),('plants', 0.6013395190238953),('atomic', 0.5934208035469055),('coal-fired', 0.5920413732528687),('reactors', 0.549136221408844),('solar', 0.5483176112174988),('weapons', 0.5343624353408813),('disarmament', 0.5275484919548035),('plant', 0.5141536593437195)]