Natural Language Processing...Approaches to NLP •Rule based approach – circa from 1950...

Natural Language Processing

10 April 2017

Who are we?

Jiri Tom Rado Martin

Infrastructure

● Keboola project invitation● Python 3+ (preferably Anaconda) installed● cmder.net on Windows (mac and linux should be fine)

NLP – Why do we care?

Problem

Huge amount of text, growing faster and fasterComputers process mostly structured data

Businesses are forced to ignore crucial data

We had a great time in Royal Plaza. We had to wait a while at hte reception before check in, Mike seemed to be very busy handling other guests. Bathrooms were a little bit rusty, but the room service was escellent. Speed Taxi service to Gatwick was way too expensive!

Im completely dissappointed. nothing works. The phone battery lasts only few hours before I have to charge it again. I barely have any reception at home. And the last invoice was wrong again. Third time in a row! You either fix it now or I’m leaving.

Problem

We had a great time in Royal Plaza. We had to wait a while at hte reception before check in, Mike seemed to be very busy handling other guests. Bathrooms were a little bit rusty, but the room service was escellent. Speed Taxi service to Gatwick was way too expensive!


Solution = analyze text automatically

We had a great time in Royal Plaza. We had to wait a while at the reception before check in, Mike seemed to be very busy handling other guests. Bathrooms were a little bit rusty, but the room service was escellent. Speed Taxi service to Gatwick was way too expensive!


Royal Plaza: greatRoom service: excellent

Reception: waiting Bathrooms: rusty Speed Taxi: expensive!

Retention: I’m leavingBilling: last invoice was wrongTechnical support: bad signal at homeCustomer support: battery lasts only few hours

Solution = analyze text automatically

Example – demo.geneea.com

Example – Customer feedback – Relations

Just text analysis is not enough

Connection with structured data:• Time• Location• Popularity of text (likes/dislikes, retweets)• Financial data• Age, gender of the author of the text

What else?

• Machine translation• Information retrieval• Finding similar documents (plagiarism)• Summarization• Dictation, IVR, automatic closed-captioning, text-to-speech• Email/ticket routing • Grammar/spelling checking• Recognition of people, companies …; relations between them• Detection of sentiment, uncertainty• Intent detection• much more

Low level tasks

• Sentence boundary detection

• Tokenization

• Part-of-speech tagging: Can he can me for kicking a can?

• Lemmatization

• Parsing

• Coreference resolution

• Understanding dates: 10th of April 2017; April 10, 2017; 04/10/2017; 04/10/17; 04/10; April 10; 2017-04-10

Approaches to NLP

Approaches to NLP

• Rule based approach – circa from 1950

• Machine learning – circa from 1980

Approaches to NLP – Rule based

• Chomsky (159): Syntactic structures

• Machine translation from IBM & Georgetown

Time flies like an arrow.

Approaches to NLP – Machine learning

• Statistical methods, machine learning

• Importance of exact evaluation

• Data, data, data

• Annotation

Machine learning

•Unsupervised • Finding hidden structure in data• For example clustering

•Supervised • Requires training data with correct answers

Data – Corpora

● Morphology, tagging: Penn Treebank, PDT, ...● Parallel corpora:

European, Canadian parliament, movie subtitles● Specialised (e.g. for sentiment)

Sentiment analysis

Sentiment Analysis

• German economy is booming.

• Human trafficking is booming in California.

• Burger King has better fries than McDonald.

• Battery is good, but the display is terrible.

Sentiment Analysis

• I was happy.

• I was sad.

• I was not happy.

• I have never been happy in my life.

• I have never been so happy in my life.

• It’s not good, but I still love it.

Sentiment Analysis

•She is pretty.

•She is pretty annoying.

Sentiment Analysis

•Well, that was a success. (sarcasm?)

•Go read the book.

Sentiment Analysis

• The previous version was absolutely great, it was a pleasure to work with, but now, I am a little confused.

Sentiment Analysis

•Harmony Smith drives me crazy.

•Bob’s Bad Breath Burger is delicious.

•That was a bad ass burger!

Sentiment classification - python

go to https://jupyter.geneea.com

Evaluation

Evaluation: Precision/Recall

False positiveBad Precision

False negativeBad Recall

Evaluation: Confusion matrix

Prediction

Predicted positive

Predicted Negative

Reality

Real positive

True positive

False negative

Real negative

False positive

True negative

Evaluation: Confusion matrix

Prediction

Predicted positive

Predicted Negative

Reality

Real positive

10 5

Real negative

3 16

Evaluation

• Multiple possible answers

• Not all errors are equally important

• Inter-annotator agreement (very low for tagging with an open tag set)

Machine learning – Overfitting

Discovery analysis● explore the data● prefer recall over precision● malformed or irrelevant tags not a big deal

(as opposed to media tags)

Yelp Sample – 160k Restaurant Reviews

Command line tools

• Simple and generic tools for text transformation

• Where to get:• Linux (and Mac) – part of the

operating system• Windows – install: cmder.net or

cygwin

https://jupyter.geneea.com/tree/data(click the file, then choose File > Save; DO NOT R-Click and Save as !!)

https://jupyter.geneea.com/tree/data



Example

• How many lines, words and characters?• The first command: wc <filename>

• How to show :• cat <filename> (prints the whole file)• less <filename> (pages, use space and then “q”) • head / tail <filename>

• All commands: parameter --help

Encoding

• https://en.wikipedia.org/wiki/Character_encoding

• ASCII – 7bits• Does not cover all languages’

characters• More encodings for the same

language (e.g. windows-1250, iso-8859-2, utf-8 for Czech)

Conversion from one encoding to another (iso-8859-2 to utf-8)

iconv -f iso-8859-2 -t utf-8 text_orig.txt > text_utf8.txt

https://en.wikipedia.org/wiki/Character_encoding

https://en.wikipedia.org/wiki/Character_encoding

Other commands and generic principles

• sort• Sorting the file alphabetically or numerically (-n)

• Each command has input and output

• It is possible to make a chain of commands (output of one command becomes input for another one)

• Using character | (pipe)• cat text_utf8.txt | sort

• How to send output to a file?• Using character >• cat text_utf8.txt | sort > text_sorted.txt

CSV processing

• cvskit• https://csvkit.readthedocs.io • pip install csvkit

• csvcut -c 2 file.csv

• in2csv data.xls > data.csv• Conversion from Excel to csv

• csvcut -n data.csv• Print column names

https://csvkit.readthedocs.io/

https://csvkit.readthedocs.io/

Other commands – tr, uniq, cut

• tr• Replace one character with another:

• tr 'a' 'b'• tr ' ' '\n'• tr '[:punct:]' '\n'

• uniq• Exclude repeating rows• It’s necessary to have the input sorted• cat text_uf8.txt | tr '[:punct:]' '\n' | tr ' ' '\n' | sort | uniq -c | sort

• cut

• Filters columns or characters and prints only selected ones• cut –f 1 –d “ “• Works well with tsv (tab separated)

Other commands - grep

• Filtering rows• grep “foo”

• Regular expressions:• Template matching more words/texts• [a-z] … characters from a to z

• Exercises:• Print rows containing at least one number• How many unique words does the file contain?• What are the most frequent words starting with a specific letter?

Other commands

• wget• Crawl pages from a website

• echo• Prints text to console

• sed• More complex tool for replacing strings• echo "wine" | sed -e 's/wine/beer/

•dos2unix, unix2dos• Encoding of end-of-line characters

Other tools

• Notepad++• With TextFX plug-in

Simple tagging

1. Tokenization: split text into words

2. Drop unimportant words: stop words

3. Find important words: tf-idf

tf – term frequencySo bummed, our company ordered pizza from here today and I tried to give the person who answered the phone the name of our company but she stated just give me your first name...due to that fact when the pizza was delivered over an hour later and we are less then 3 minutes down the street it was ICE COLD!!! I even called Dina's up questioning if the pizza was on the way yet and was told it left a while a go I don't understand how you don't have it yet. We ordered the taco specialty pizza so try warming that up with the lettuce and tomato all over the top of it. Not to mention the fact that the toppings completely fell off the pizza and were all on the corner of the box. WHAT A WASTE OF MONEY THIS WAS!!!!!!!!!!

tf(pizza) = 5

tf – term frequencySo bummed, our company ordered pizza from here today and I tried to give the person who answered the phone the name of our company but she stated just give me your first name...due to that fact when the pizza was delivered over an hour later and we are less then 3 minutes down the street it was ICE COLD!!! I even called Dina's up questioning if the pizza was on the way yet and was told it left a while a go I don't understand how you don't have it yet. We ordered the taco specialty pizza so try warming that up with the lettuce and tomato all over the top of it. Not to mention the fact that the toppings completely fell off the pizza and were all on the corner of the box. WHAT A WASTE OF MONEY THIS WAS!!!!!!!!!!

tf(pizza) = 5tf(the) = 15

idf – inverse document frequency

idf(the) = log(160,000 / 152,000) = log(1.05)

idf(pizza) = log(160,000 / 11,800) = log(13.56)

idf(Tokyo) = log(160,000 / 194) = log(824.74)

tf-idf

w(t,doc) = log(1 + tf(t) ) * log (N / df(t))

idf(the) = log(1+15) * log(1.05) = 0.28

idf(pizza) = log(1+5) * log(13.56) = 9.72

idf(Tokyo) = log(1+0) * log(824.74) = 0

Simple, yet works well

TokenizationMerriam-Webster's, Tumu-M'Pongo

10-year, one-liners, self-proclaimed

United Kingdom-United States relations

5-3, 5-3+1, U+2010, 2:4, 14:34, 10:00-14:00

10000, 10 000, 10,000

3.14159, 10.12., 10. prosince, U.S.A., H2O

km/h, A/C, s/he, °C

N40° 44.9064', W073° 59.0735'

www.some-news.com/article-about-stuff, [email protected]

Tokenization

Arbitrary decisions have to be made.

Stick to them consistently. Pre-trained models might work poorly if fed with differently tokenized data

eat x ate x eatenOK, not cheap but not outrageously expensive either. I've eaten here twice, the last time during May 2009, I enjoyed both the food & atmosphere. I suppose you could call the place a Bistro. The food is Scottish & locally sourced, caters for vegetarians & has a pretty varied menu without being ridiculously extensive. I seem to remember a good selection of wines but don't think they serve anything but bottled beer. Damned if I can remember what I ate but had fish once that was extremely tasty & their veg isn't undercooked that can be the fashion. The service was friendly with no unseemly waiting! A great night out in New Town. There are two sister restaurants: A Room in the West End & A Room in Leith. Enjoy a great place to eat in a fabulous city! tf(eat) + tf(ate) + tf(eaten) = 3

tf(eat) = 1

Lemmatization & Morphology

Processing MorphologyLemmatization: word → lemma (dictionary form) Peter saw her. → Peter see she .

POS Tagging: word → tagPeter saw her. → noun, verb, pronoun, punct

Morphological analysis: ignores contextsaw → {[see, verb.past], [saw, noun.sg]}

Morpheme segmentation: de-nation-al-iz-ation

Generation: see + verb.past → saw

Morphology – not so easy

city – citi-es, goose – geese, sheep – sheep, go – went

Stuhl – Stühl-e, Vater – Väter

matk-a – mat-e-k – matc-e – matč-in

Morphology – not so easy

Tagalog (Philippines):

basa ‘read’ b-um-asa ‘readpast’sulat ‘write’ s-um-ulat ‘wrote’

rare in English: abso-bloody-lutely

Arabic, Hebrew – templates

Choice of lemma depth

inflection: debates → debatebrought, brings, bringing → bring

negation: unreasonable → reasonable

gradation: highest → high (Highest Court)

Morphology: not so easy - derivation

solution – solve; kind – kindly – kindness

un-happy – in-comprehensive – im-possible – ir-rational

unloosen = loosen

unnerve, unearth

Zipf’s lawword frequency is inversely proportional to its freq rankUnique token count: 145kTotal token count: 20M rank word freq

1 the 801132

2 and 635035

3 I 521421

4 a 509089

5 to 398695

...

53038 turorials 2

>68000 1

First Covers

1% 84%

10% 97.7%

20% 98.9%

Consequences

Pareto’s rule (80 : 20)One can achieve “reasonable” quality fast

Costs of additional improvements rise “exponentially” (long tail)

Ambiguity and fuzziness on every layer of language

Part-of-speech tagging

Part-of-speech tagging

I love hiking through the woods on weekends .

PRP VBP N IN DT NNS IN NNS .

Petrov et al – (Google) Universal POS TagsetVERB - verbs (all tenses and modes)

NOUN - nouns (common and proper)

PRON - pronouns

ADJ - adjectives

ADV - adverbs

ADP - adpositions (prepositions and postpositions)

CONJ - conjunctions

DET - determiners

NUM - cardinal numbers

PRT - particles or other function words

X - other: foreign words, typos, abbreviations

. - punctuation

Penn Treebank tagset

Ambiguity

Mrs. Shaefer never got around/RP to joining.

All we gotta do is go around/IN the corner.

Chateau Petrus costs around/RB 2,500.

They were married/VBN by the Justice of the Peace yesterday at 5:00.

At the time, she was already married/JJ.

Entities

Example – Svejk – characters

Švejk

Entities – named and non-named

Named entities: personal names, organizations, geographical names

Other interesting entities: URL, e-mail, phone numbers, money amounts and other quantities, date and time

Custom entities for given domain: bacon, onion, tomato, cheese for a burger chain

Entities – some basic challengesTypes – fuzziness, hierarchy

Facebook – product or company?

European Union – organization or place

Embedded entities

[Dr.] Martin Luther King [Jr.]

[The [New England] Journal of Medicine]

[Gymnázium [Jozefa Gregora Tajovského] v [Banskej Bystrici]]

List look-up not enough

Washington, The police, ANO (Yes)

Entities – ML

• annotation – tag tokens with labels like PERSON_START, PERSON_CONT• popular classifier – CRF• features

• word shape (case, is alphanumeric etc.)

• morphological features

• gazetteers

• distsim, word2vec

• labels already assigned to previous word(s)

• add features of surrounding tokens, previous instances of the same word, use n-grams …

• could use two passes

• can use two passes

Entities/Tags – remaining issues

Correference – increase tf pronouns, the president

StandardizationiPads > iPad

Windows != Window, United States != Unite State

The first stage has landed on Of Course I Still Love You.

He sang Bratříčku zavírej vrátka.

NormalizationUSA = United States of America = United States ~ America

Hillary Rodham = Hillary Clinton

Syntax & Parsing

Old men and women are hard to live with.

I saw her duck.

The chicken are too hot to eat.

The mayor is a dirty street fighter.

Happily they left.

Terry loves his wife and so do I.

Vectors

Vector methods

One way to bridge natural language and classical ML

After transforming to vectors, integration with ML systems is easy

Applications: Search

Text classification

Preprocessing / feature extraction for any ML task e.g., neural networks image -> vector -> text

Vector methods: bag of words

Preprocessing – tokenization, stemming/lemmatization, cleaning

Create a vector d with dimension V (size of vocabulary)

di = tfi (term frequency of the i-th word)

A black cat and a white cat slept on a mat -> {black:1, white:1, cat:2, sleep:1, mat:1} -> [1, 1, 2, 1, 1, ...]

Vector methods: bag of words improved

Fancier values instead of tf. (e.g., tf-idf)

Add n-grams/phrases/entities to the bag {..., black cat:1, white cat:1, ...}

Vector methods: dimensionality reduction

Each term is a feature - very big dimension

Dimensionality reductionLSI (LSA) – term-document matrix decomposition

LDA – topic inference using probabilistic graphical model

Word2vec – transform words to vectors of given size, capture their context

gensim Python library

Vector methods: Latent Semantic Indexing (LSI)goal: map semantically similar documents to similar vectors

{(car), (truck), (flower)} –> a{(1.345 * car + 0.282 * truck), (flower)}

reduce dimensionality by singular value decomposition (SVD) of the term-document matrix

somehow addresses synonymy, in lesser extent homonymy

From EP corpus: 0.365*fishery + 0.342*fishing + 0.197*fish + -0.153*tax + -0.140*food + 0.116*aquaculture + ...

Source: Jialu Liu: Topic Model

Vector methods: Latent Dirichlet Allocation (LDA)

topic1 –> 0.1 milk, 0.09 meow, 0.08 kitten

topic2 –> 0.12 bark, 0.11 bone, 0.07 puppy

Finds probability distributions of topics for documents and words for topics

Source: Jialu Liu: Topic Model

Vector methods: Latent Dirichlet Allocation (LDA)

From EP corpus: 0.018*transport + 0.013*passenger + 0.011*airline + 0.010*road + 0.009*safety + 0.007*simplify + 0.007*rail + 0.006*travel + ...

0.025*Israel + 0.017*Palestinian + 0.015*Jerusalem + 0.015*Gaza + 0.012*Prime + 0.011*Israeli + 0.009*peace +

Vector methods: Word2vecDoesn’t ignore word order, uses either skip-grams or continuous bag of words (CBOW)

vector arithmetic king - man + woman = queen

uses neural networks

research shows analogy to matrix factorization

Vector methods: Word2vec

model.most_similar(positive=['nuclear'])

[('stations', 0.6321508884429932),('reactor', 0.6199184060096741),('plants', 0.6013395190238953),('atomic', 0.5934208035469055),('coal-fired', 0.5920413732528687),('reactors', 0.549136221408844),('solar', 0.5483176112174988),('weapons', 0.5343624353408813),('disarmament', 0.5275484919548035),('plant', 0.5141536593437195)]

Natural Language Processing...Approaches to NLP •Rule based approach – circa from 1950...

Documents

Transcript of Natural Language Processing...Approaches to NLP •Rule based approach – circa from 1950...