Natural Language Processing...Approaches to NLP •Rule based approach – circa from 1950...
Transcript of Natural Language Processing...Approaches to NLP •Rule based approach – circa from 1950...
Natural Language Processing
10 April 2017
Who are we?
Jiri Tom Rado Martin
Infrastructure
● Keboola project invitation● Python 3+ (preferably Anaconda) installed● cmder.net on Windows (mac and linux should be fine)
NLP – Why do we care?
Problem
Huge amount of text, growing faster and fasterComputers process mostly structured data
Businesses are forced to ignore crucial data
We had a great time in Royal Plaza. We had to wait a while at hte reception before check in, Mike seemed to be very busy handling other guests. Bathrooms were a little bit rusty, but the room service was escellent. Speed Taxi service to Gatwick was way too expensive!
Im completely dissappointed. nothing works. The phone battery lasts only few hours before I have to charge it again. I barely have any reception at home. And the last invoice was wrong again. Third time in a row! You either fix it now or I’m leaving.
Problem
We had a great time in Royal Plaza. We had to wait a while at hte reception before check in, Mike seemed to be very busy handling other guests. Bathrooms were a little bit rusty, but the room service was escellent. Speed Taxi service to Gatwick was way too expensive!
Im completely dissappointed. nothing works. The phone battery lasts only few hours before I have to charge it again. I barely have any reception at home. And the last invoice was wrong again. Third time in a row! You either fix it now or I’m leaving.
Solution = analyze text automatically
We had a great time in Royal Plaza. We had to wait a while at the reception before check in, Mike seemed to be very busy handling other guests. Bathrooms were a little bit rusty, but the room service was escellent. Speed Taxi service to Gatwick was way too expensive!
Im completely dissappointed. nothing works. The phone battery lasts only few hours before I have to charge it again. I barely have any reception at home. And the last invoice was wrong again. Third time in a row! You either fix it now or I’m leaving.
Royal Plaza: greatRoom service: excellent
Reception: waiting Bathrooms: rusty Speed Taxi: expensive!
Retention: I’m leavingBilling: last invoice was wrongTechnical support: bad signal at homeCustomer support: battery lasts only few hours
Solution = analyze text automatically
Example – demo.geneea.com
Example – Customer feedback – Relations
Just text analysis is not enough
Connection with structured data:• Time• Location• Popularity of text (likes/dislikes, retweets)• Financial data• Age, gender of the author of the text
What else?
• Machine translation• Information retrieval• Finding similar documents (plagiarism)• Summarization• Dictation, IVR, automatic closed-captioning, text-to-speech• Email/ticket routing • Grammar/spelling checking• Recognition of people, companies …; relations between them• Detection of sentiment, uncertainty• Intent detection• much more
Low level tasks
• Sentence boundary detection
• Tokenization
• Part-of-speech tagging: Can he can me for kicking a can?
• Lemmatization
• Parsing
• Coreference resolution
• Understanding dates: 10th of April 2017; April 10, 2017; 04/10/2017; 04/10/17; 04/10; April 10; 2017-04-10
Approaches to NLP
Approaches to NLP
• Rule based approach – circa from 1950
• Machine learning – circa from 1980
Approaches to NLP – Rule based
• Chomsky (159): Syntactic structures
• Machine translation from IBM & Georgetown
Time flies like an arrow.
Approaches to NLP – Machine learning
• Statistical methods, machine learning
• Importance of exact evaluation
• Data, data, data
• Annotation
Machine learning
•Unsupervised • Finding hidden structure in data• For example clustering
•Supervised • Requires training data with correct answers
Data – Corpora
● Morphology, tagging: Penn Treebank, PDT, ...● Parallel corpora:
European, Canadian parliament, movie subtitles● Specialised (e.g. for sentiment)
Sentiment analysis
Sentiment Analysis
• German economy is booming.
• Human trafficking is booming in California.
• Burger King has better fries than McDonald.
• Battery is good, but the display is terrible.
Sentiment Analysis
• I was happy.
• I was sad.
• I was not happy.
• I have never been happy in my life.
• I have never been so happy in my life.
• It’s not good, but I still love it.
Sentiment Analysis
•She is pretty.
•She is pretty annoying.
Sentiment Analysis
•Well, that was a success. (sarcasm?)
•Go read the book.
Sentiment Analysis
• The previous version was absolutely great, it was a pleasure to work with, but now, I am a little confused.
Sentiment Analysis
•Harmony Smith drives me crazy.
•Bob’s Bad Breath Burger is delicious.
•That was a bad ass burger!
Sentiment classification - python
go to https://jupyter.geneea.com
Evaluation
Evaluation: Precision/Recall
False positiveBad Precision
False negativeBad Recall
Evaluation: Confusion matrix
Prediction
Predicted positive
Predicted Negative
Reality
Real positive
True positive
False negative
Real negative
False positive
True negative
Evaluation: Confusion matrix
Prediction
Predicted positive
Predicted Negative
Reality
Real positive
10 5
Real negative
3 16
Evaluation
• Multiple possible answers
• Not all errors are equally important
• Inter-annotator agreement (very low for tagging with an open tag set)
Machine learning – Overfitting
Discovery analysis● explore the data● prefer recall over precision● malformed or irrelevant tags not a big deal
(as opposed to media tags)
Yelp Sample – 160k Restaurant Reviews
Command line tools
• Simple and generic tools for text transformation
• Where to get:• Linux (and Mac) – part of the
operating system• Windows – install: cmder.net or
cygwin
https://jupyter.geneea.com/tree/data(click the file, then choose File > Save; DO NOT R-Click and Save as !!)
Example
• How many lines, words and characters?• The first command: wc <filename>
• How to show :• cat <filename> (prints the whole file)• less <filename> (pages, use space and then “q”) • head / tail <filename>
• All commands: parameter --help
Encoding
• https://en.wikipedia.org/wiki/Character_encoding
• ASCII – 7bits• Does not cover all languages’
characters• More encodings for the same
language (e.g. windows-1250, iso-8859-2, utf-8 for Czech)
Conversion from one encoding to another (iso-8859-2 to utf-8)
iconv -f iso-8859-2 -t utf-8 text_orig.txt > text_utf8.txt
Other commands and generic principles
• sort• Sorting the file alphabetically or numerically (-n)
• Each command has input and output
• It is possible to make a chain of commands (output of one command becomes input for another one)
• Using character | (pipe)• cat text_utf8.txt | sort
• How to send output to a file?• Using character >• cat text_utf8.txt | sort > text_sorted.txt
CSV processing
• cvskit• https://csvkit.readthedocs.io • pip install csvkit
• csvcut -c 2 file.csv
• in2csv data.xls > data.csv• Conversion from Excel to csv
• csvcut -n data.csv• Print column names
Other commands – tr, uniq, cut
• tr• Replace one character with another:
• tr 'a' 'b'• tr ' ' '\n'• tr '[:punct:]' '\n'
• uniq• Exclude repeating rows• It’s necessary to have the input sorted• cat text_uf8.txt | tr '[:punct:]' '\n' | tr ' ' '\n' | sort | uniq -c | sort
• cut
• Filters columns or characters and prints only selected ones• cut –f 1 –d “ “• Works well with tsv (tab separated)
Other commands - grep
• Filtering rows• grep “foo”
• Regular expressions:• Template matching more words/texts• [a-z] … characters from a to z
• Exercises:• Print rows containing at least one number• How many unique words does the file contain?• What are the most frequent words starting with a specific letter?
Other commands
• wget• Crawl pages from a website
• echo• Prints text to console
• sed• More complex tool for replacing strings• echo "wine" | sed -e 's/wine/beer/
•dos2unix, unix2dos• Encoding of end-of-line characters
Other tools
• Notepad++• With TextFX plug-in
Simple tagging
1. Tokenization: split text into words
2. Drop unimportant words: stop words
3. Find important words: tf-idf
tf – term frequencySo bummed, our company ordered pizza from here today and I tried to give the person who answered the phone the name of our company but she stated just give me your first name...due to that fact when the pizza was delivered over an hour later and we are less then 3 minutes down the street it was ICE COLD!!! I even called Dina's up questioning if the pizza was on the way yet and was told it left a while a go I don't understand how you don't have it yet. We ordered the taco specialty pizza so try warming that up with the lettuce and tomato all over the top of it. Not to mention the fact that the toppings completely fell off the pizza and were all on the corner of the box. WHAT A WASTE OF MONEY THIS WAS!!!!!!!!!!
tf(pizza) = 5
tf – term frequencySo bummed, our company ordered pizza from here today and I tried to give the person who answered the phone the name of our company but she stated just give me your first name...due to that fact when the pizza was delivered over an hour later and we are less then 3 minutes down the street it was ICE COLD!!! I even called Dina's up questioning if the pizza was on the way yet and was told it left a while a go I don't understand how you don't have it yet. We ordered the taco specialty pizza so try warming that up with the lettuce and tomato all over the top of it. Not to mention the fact that the toppings completely fell off the pizza and were all on the corner of the box. WHAT A WASTE OF MONEY THIS WAS!!!!!!!!!!
tf(pizza) = 5tf(the) = 15
idf – inverse document frequency
idf(the) = log(160,000 / 152,000) = log(1.05)
idf(pizza) = log(160,000 / 11,800) = log(13.56)
idf(Tokyo) = log(160,000 / 194) = log(824.74)
tf-idf
w(t,doc) = log(1 + tf(t) ) * log (N / df(t))
idf(the) = log(1+15) * log(1.05) = 0.28
idf(pizza) = log(1+5) * log(13.56) = 9.72
idf(Tokyo) = log(1+0) * log(824.74) = 0
Simple, yet works well
TokenizationMerriam-Webster's, Tumu-M'Pongo
10-year, one-liners, self-proclaimed
United Kingdom-United States relations
5-3, 5-3+1, U+2010, 2:4, 14:34, 10:00-14:00
10000, 10 000, 10,000
3.14159, 10.12., 10. prosince, U.S.A., H2O
km/h, A/C, s/he, °C
N40° 44.9064', W073° 59.0735'
www.some-news.com/article-about-stuff, [email protected]
Tokenization
Arbitrary decisions have to be made.
Stick to them consistently. Pre-trained models might work poorly if fed with differently tokenized data
eat x ate x eatenOK, not cheap but not outrageously expensive either. I've eaten here twice, the last time during May 2009, I enjoyed both the food & atmosphere. I suppose you could call the place a Bistro. The food is Scottish & locally sourced, caters for vegetarians & has a pretty varied menu without being ridiculously extensive. I seem to remember a good selection of wines but don't think they serve anything but bottled beer. Damned if I can remember what I ate but had fish once that was extremely tasty & their veg isn't undercooked that can be the fashion. The service was friendly with no unseemly waiting! A great night out in New Town. There are two sister restaurants: A Room in the West End & A Room in Leith. Enjoy a great place to eat in a fabulous city! tf(eat) + tf(ate) + tf(eaten) = 3
tf(eat) = 1
Lemmatization & Morphology
Processing MorphologyLemmatization: word → lemma (dictionary form) Peter saw her. → Peter see she .
POS Tagging: word → tagPeter saw her. → noun, verb, pronoun, punct
Morphological analysis: ignores contextsaw → {[see, verb.past], [saw, noun.sg]}
Morpheme segmentation: de-nation-al-iz-ation
Generation: see + verb.past → saw
Morphology – not so easy
city – citi-es, goose – geese, sheep – sheep, go – went
Stuhl – Stühl-e, Vater – Väter
matk-a – mat-e-k – matc-e – matč-in
Morphology – not so easy
Tagalog (Philippines):
basa ‘read’ b-um-asa ‘readpast’sulat ‘write’ s-um-ulat ‘wrote’
rare in English: abso-bloody-lutely
Arabic, Hebrew – templates
Choice of lemma depth
inflection: debates → debatebrought, brings, bringing → bring
negation: unreasonable → reasonable
gradation: highest → high (Highest Court)
Morphology: not so easy - derivation
solution – solve; kind – kindly – kindness
un-happy – in-comprehensive – im-possible – ir-rational
unloosen = loosen
unnerve, unearth
Zipf’s lawword frequency is inversely proportional to its freq rankUnique token count: 145kTotal token count: 20M rank word freq
1 the 801132
2 and 635035
3 I 521421
4 a 509089
5 to 398695
...
53038 turorials 2
>68000 1
First Covers
1% 84%
10% 97.7%
20% 98.9%
Consequences
Pareto’s rule (80 : 20)One can achieve “reasonable” quality fast
Costs of additional improvements rise “exponentially” (long tail)
Ambiguity and fuzziness on every layer of language
Part-of-speech tagging
Part-of-speech tagging
I love hiking through the woods on weekends .
PRP VBP N IN DT NNS IN NNS .
Petrov et al – (Google) Universal POS TagsetVERB - verbs (all tenses and modes)
NOUN - nouns (common and proper)
PRON - pronouns
ADJ - adjectives
ADV - adverbs
ADP - adpositions (prepositions and postpositions)
CONJ - conjunctions
DET - determiners
NUM - cardinal numbers
PRT - particles or other function words
X - other: foreign words, typos, abbreviations
. - punctuation
Penn Treebank tagset
Ambiguity
Mrs. Shaefer never got around/RP to joining.
All we gotta do is go around/IN the corner.
Chateau Petrus costs around/RB 2,500.
They were married/VBN by the Justice of the Peace yesterday at 5:00.
At the time, she was already married/JJ.
Entities
Example – Svejk – characters
Švejk
Entities – named and non-named
Named entities: personal names, organizations, geographical names
Other interesting entities: URL, e-mail, phone numbers, money amounts and other quantities, date and time
Custom entities for given domain: bacon, onion, tomato, cheese for a burger chain
Entities – some basic challengesTypes – fuzziness, hierarchy
Facebook – product or company?
European Union – organization or place
Embedded entities
[Dr.] Martin Luther King [Jr.]
[The [New England] Journal of Medicine]
[Gymnázium [Jozefa Gregora Tajovského] v [Banskej Bystrici]]
List look-up not enough
Washington, The police, ANO (Yes)
Entities – ML
• annotation – tag tokens with labels like PERSON_START, PERSON_CONT• popular classifier – CRF• features
• word shape (case, is alphanumeric etc.)
• morphological features
• gazetteers
• distsim, word2vec
• labels already assigned to previous word(s)
• add features of surrounding tokens, previous instances of the same word, use n-grams …
• could use two passes
• can use two passes
Entities/Tags – remaining issues
Correference – increase tf pronouns, the president
StandardizationiPads > iPad
Windows != Window, United States != Unite State
The first stage has landed on Of Course I Still Love You.
He sang Bratříčku zavírej vrátka.
NormalizationUSA = United States of America = United States ~ America
Hillary Rodham = Hillary Clinton
Syntax & Parsing
Old men and women are hard to live with.
I saw her duck.
The chicken are too hot to eat.
The mayor is a dirty street fighter.
Happily they left.
Terry loves his wife and so do I.
Vectors
Vector methods
One way to bridge natural language and classical ML
After transforming to vectors, integration with ML systems is easy
Applications: Search
Text classification
Preprocessing / feature extraction for any ML task e.g., neural networks image -> vector -> text
Vector methods: bag of words
Preprocessing – tokenization, stemming/lemmatization, cleaning
Create a vector d with dimension V (size of vocabulary)
di = tfi (term frequency of the i-th word)
A black cat and a white cat slept on a mat -> {black:1, white:1, cat:2, sleep:1, mat:1} -> [1, 1, 2, 1, 1, ...]
Vector methods: bag of words improved
Fancier values instead of tf. (e.g., tf-idf)
Add n-grams/phrases/entities to the bag {..., black cat:1, white cat:1, ...}
Vector methods: dimensionality reduction
Each term is a feature - very big dimension
Dimensionality reductionLSI (LSA) – term-document matrix decomposition
LDA – topic inference using probabilistic graphical model
Word2vec – transform words to vectors of given size, capture their context
gensim Python library
Vector methods: Latent Semantic Indexing (LSI)goal: map semantically similar documents to similar vectors
{(car), (truck), (flower)} –> a{(1.345 * car + 0.282 * truck), (flower)}
reduce dimensionality by singular value decomposition (SVD) of the term-document matrix
somehow addresses synonymy, in lesser extent homonymy
From EP corpus: 0.365*fishery + 0.342*fishing + 0.197*fish + -0.153*tax + -0.140*food + 0.116*aquaculture + ...
Source: Jialu Liu: Topic Model
Vector methods: Latent Dirichlet Allocation (LDA)
topic1 –> 0.1 milk, 0.09 meow, 0.08 kitten
topic2 –> 0.12 bark, 0.11 bone, 0.07 puppy
Finds probability distributions of topics for documents and words for topics
Source: Jialu Liu: Topic Model
Vector methods: Latent Dirichlet Allocation (LDA)
From EP corpus: 0.018*transport + 0.013*passenger + 0.011*airline + 0.010*road + 0.009*safety + 0.007*simplify + 0.007*rail + 0.006*travel + ...
0.025*Israel + 0.017*Palestinian + 0.015*Jerusalem + 0.015*Gaza + 0.012*Prime + 0.011*Israeli + 0.009*peace +
Vector methods: Word2vecDoesn’t ignore word order, uses either skip-grams or continuous bag of words (CBOW)
vector arithmetic king - man + woman = queen
uses neural networks
research shows analogy to matrix factorization
Vector methods: Word2vec
model.most_similar(positive=['nuclear'])
[('stations', 0.6321508884429932),('reactor', 0.6199184060096741),('plants', 0.6013395190238953),('atomic', 0.5934208035469055),('coal-fired', 0.5920413732528687),('reactors', 0.549136221408844),('solar', 0.5483176112174988),('weapons', 0.5343624353408813),('disarmament', 0.5275484919548035),('plant', 0.5141536593437195)]