Crash-course in Natural Language Processing

47
Crash Course in Natural Language Processing Vsevolod Dyomkin 08/2015

Transcript of Crash-course in Natural Language Processing

Page 1: Crash-course in Natural Language Processing

Crash Course inNatural Language Processing

Vsevolod Dyomkin08/2015

Page 2: Crash-course in Natural Language Processing

A Bit about Me

* Lisp programmer* Research lead at Grammarly

(4 years of practical NLP work)* Teacher at KPI: Operating Systems course

https://vseloved.github.io

Page 3: Crash-course in Natural Language Processing

A Bit about GrammarlyThe best English language writing app

Spellcheck - Grammar check - Style improvement - Synonyms and word choice Plagiarism check

Page 4: Crash-course in Natural Language Processing

Plan* Overview of NLP* Where to get Data* Common NLP problems and approaches* How to develop an NLP system

Page 5: Crash-course in Natural Language Processing

What Is NLP?Transforming free-form text into structured data and back

Page 6: Crash-course in Natural Language Processing

What Is NLP?Transforming free-form text into structured data and back

Intersection of:* Computational Linguistics* CompSci & AI* Stats & Information Theory

Page 7: Crash-course in Natural Language Processing

Linguistic Basis

* Syntax (form)* Semantics (meaning)* Pragmatics (intent/logic)

Page 8: Crash-course in Natural Language Processing

Natural Language

* ambiguous* noisy* evolving

Page 9: Crash-course in Natural Language Processing

Time flies like an arrow.Fruit flies like a banana.

I read a story about evolution in ten minutes.I read a story about evolution in the last million years.

Page 10: Crash-course in Natural Language Processing

NLP & DataTypes of text data:* structured* semi-structured* unstructured

“Data is ten times more

powerful than algorithms.”-- Peter NorvigThe Unreasonable Effectiveness of Data.http://youtu.be/yvDCzhbjYWs

Page 11: Crash-course in Natural Language Processing

Kinds of Data* Dictionaries* Databases/Ontologies* Corpora* User Data

Page 12: Crash-course in Natural Language Processing

Where to Get Data?* Linguistic Data Consortium http://www.ldc.upenn.edu/ * Common Crawl* Wikimedia* Wordnet* APIs: Twitter, Wordnik, ...* University sites & the academic community: Stanford, Oxford, CMU, ...

Page 13: Crash-course in Natural Language Processing

Create Your Own!* Linguists* Crowdsourcing* By-product

-- Johnatahn Zittrain http://goo.gl/hs4qB

Page 14: Crash-course in Natural Language Processing

Classic NLP Problems* Linguistically-motivated: segmentation, tagging, parsing

* Analytical: classification, sentiment analysis

* Transformation: translation, correction, generation

* Conversation:question answering, dialog

Page 15: Crash-course in Natural Language Processing

TokenizationExample:This is a test that isn't so simple: 1.23."This" "is" "a" "test" "that" "is" "n't" "so" "simple" ":" "1.23" "."

Issues:* Finland’s capital - Finland Finlands Finland’s* what’re, I’m, isn’t - what ’re, I ’m, is n’t* Hewlett-Packard or Hewlett Packard * San Francisco - one token or two?* m.p.h., PhD.

Page 16: Crash-course in Natural Language Processing

Regular ExpressionsSimplest regex: [^\s]+

More advanced regex:

\w+|[!"#$%&'*+,\./:;<=>?@^`~…\(\) {}\[\|\]⟨⟩ ‒–—«»“”‘’-]―

Even more advanced regex:

[+-]?[0-9](?:[0-9,\.]*[0-9])?|[\w@](?:[\w'’`@-][\w']|[\w'][\w@'’`-])*[\w']?|["#$%&*+,/:;<=>@^`~…\(\) {}\[\|\] «»“”‘’']⟨⟩ ‒–—―|[\.!?]+|-+

Page 17: Crash-course in Natural Language Processing

Post-processing

* concatenate abbreviations and decimals* split contractions with regexes 2-character: i['‘’`]m|(?:s?he|it)['‘’`]s|(?:i|you|s?he|we|they) ['‘’`]d$

3-character: (?:i|you|s?he|we|they)['‘’`](?:ll|[vr]e)|n['‘’`]t$

Page 18: Crash-course in Natural Language Processing

Rule-based Approach* easy to understand and reason about* can be arbitrarily precise* iterative, can be used to gather more data

Limitations:* recall problems* poor adaptability

Page 19: Crash-course in Natural Language Processing

Rule-based NLP tools

* SpamAssasin* LanguageTool* ELIZA* GATE

Page 20: Crash-course in Natural Language Processing

Statistical Approach

“Probability theoryis nothing butcommon sensereduced to calculation.”-- Pierre-Simon Laplace

Page 21: Crash-course in Natural Language Processing

Language Models

Question: what is the probability of a sequence of words/sentence?

Page 22: Crash-course in Natural Language Processing

Language Models

Question: what is the probability of a sequence of words/sentence?

Answer: Apply the chain rule

P(S) = P(w0) * P(w1|w0) * P(w2|w0 w1) * P(w3|w0 w1 w2) * …

where S = w0 w1 w2 …

Page 23: Crash-course in Natural Language Processing

NgramsApply Markov assumption: each word depends only on N previous words (in practice N=1..4 which results in bigrams-fivegrams, because we include the current word also).

If n=2: P(S) = P(w0) * P(w1|w0) * P(w2|w0 w1) * P(w3|w1 w2) * …

According to the chain rule:

P(w2|w0 w1) = P(w0 w1 w2) / P(w0 w1)

Page 24: Crash-course in Natural Language Processing

Spelling Correction

Problem: given an out-of-dictionary word return a list of most probable in-dictionary corrections.

http://norvig.com/spell-correct.html

Page 25: Crash-course in Natural Language Processing

Edit DistanceMinimum-edit (Levenstein) distance the –minimum number of insertions/deletions/substitutions needed to transform string A into B.

Other distance metrics:* the Damerau-Levenstein distance adds another operation: transposition* the longest common subsequence (LCS) metric allows only insertion and deletion, not substitution* the Hamming distance allows only substitution, hence, it only applies to strings of the same length

Page 26: Crash-course in Natural Language Processing

Dynamic ProgrammingInitialization: D(i,0) = i D(0,j) = j

Recurrence relation: For each i = 1..M For each j = 1..N D(i,j) = D(i-1,j-1), if X(i) = Y(j)

otherwise: min D(i-1,j) + w_del(Y(j)) D(i,j-1) + w_ins(X(i)) D(i-1,j-1) + w_subst(X(i),Y(j))

Page 27: Crash-course in Natural Language Processing

Noisy Channel ModelGiven an alphabet A, let A* be the set of all finite strings over A. Let the dictionary D of valid words be some subset of A*.

The noisy channel is the matrix G = P(s|w) where w in D is the intended word and s in A* is the scrambled word that was actually received.

P(s|w) = sum(P(x(i)|y(i))) for x(i) in s* (s aligned with w) for y(i) in w* (w aligned with s)

Page 28: Crash-course in Natural Language Processing

Machine Learning Approach

Page 29: Crash-course in Natural Language Processing

Spam FilteringA 2-class classification problem with a bias towards minimizing FPs.

Default approach: rule-based (SpamAssassin)

Problems:* scales poorly* hard to reach arbitrary precision* hard to rank the importance of complex features?

Page 30: Crash-course in Natural Language Processing

Bag-of-words Models* each word is a feature* each word is independent of others* position of the word in a sentence is irrelevant

Pros:* simple* fast* scalable

Limitations:* independence assumption doesn't hold

Initial results: recall: 92%, precision: 98.84% Improved results: recall: 99.5%, precision: 99.97%

http://www.paulgraham.com/spam.html

Page 31: Crash-course in Natural Language Processing

Naive Bayes Classifier

P(Y|X) = P(Y) * P(X|Y) / P(X)select Y = argmax P(Y|x) Naive step:

P(Y|x) = P(Y) * prod(P(x|Y)) for all x in X

(P(x) is marginalized out because it's the same for all Y)

Page 32: Crash-course in Natural Language Processing

Dependency Parsing

nsubj(ate-2, They-1)root(ROOT-0, ate-2)det(pizza-4, the-3)dobj(ate-2, pizza-4)prep(ate-2, with-5)pobj(with-5, anchovies-6)

https://honnibal.wordpress.com/2013/12/18/a-simple-fast-algorithm-for-natural-language-dependency-parsing/

Page 33: Crash-course in Natural Language Processing

Shift-reduce Parsing

Page 34: Crash-course in Natural Language Processing

Shift-reduce Parsing

Page 35: Crash-course in Natural Language Processing

ML-based ParsingThe parser starts with an empty stack, and a buffer index at 0, with no dependencies recorded. It chooses one of the valid actions, and applies it to the state. It continues choosing actions and applying them until the stack is empty and the buffer index is at the end of the input.

SHIFT = 0; RIGHT = 1; LEFT = 2 MOVES = [SHIFT, RIGHT, LEFT]

def parse(words, tags): n = len(words) deps = init_deps(n) idx = 1 stack = [0] while stack or idx < n: features = extract_features(words, tags, idx, n, stack, deps) scores = score(features) valid_moves = get_valid_moves(i, n, len(stack)) next_move = max(valid_moves, key=lambda move: scores[move]) idx = transition(next_move, idx, stack, parse) return tags, parse

Page 36: Crash-course in Natural Language Processing

Averaged Perceptron

def train(model, number_iter, examples): for i in range(number_iter): for features, true_tag in examples: guess = model.predict(features) if guess != true_tag: for f in features: model.weights[f][true_tag] += 1 model.weights[f][guess] -= 1 random.shuffle(examples)

Page 37: Crash-course in Natural Language Processing

Features* Word and tag unigrams, bigrams, trigrams* The first three words of the buffer* The top three words of the stack* The two leftmost children of the top of the stack* The two rightmost children of the top of the stack* The two leftmost children of the first word in the buffer* Distance between top of buffer and stack

Page 38: Crash-course in Natural Language Processing

Discriminative ML Models

Linear:* (Averaged) Perceptron* Maximum Entropy / LogLinear / Logistic Regression; Conditional Random Field* SVM

Non-linear:* Decision Trees, Random Forests* Other ensemble classifiers* Neural networks

Page 39: Crash-course in Natural Language Processing

SemanticsQuestion: how to model relationships between words?

Page 40: Crash-course in Natural Language Processing

SemanticsQuestion: how to model relationships between words?Answer: build a graph

WordnetFreebaseDBPedia

Page 41: Crash-course in Natural Language Processing

Word Similarity

Next question: now, how do we measure those relations?

Page 42: Crash-course in Natural Language Processing

Word Similarity

Next question: now, how do we measure those relations?

* different Wordnet similarity measures

Page 43: Crash-course in Natural Language Processing

Word Similarity

Next question: now, how do we measure those relations?

* different Wordnet similarity measures

* PMI(x,y) = log(p(x,y) / p(x) * p(y))

Page 44: Crash-course in Natural Language Processing

Distributional Semantics

Distributional hypothesis:"You shall know a word bythe company it keeps"--John Rupert Firth

Word representations:* Explicit representation Number of nonzero dimensions: max:474234, min:3, mean:1595, median:415* Dense representation (word2vec, GloVe)* Hierarchical representation (Brown clustering)

Page 45: Crash-course in Natural Language Processing

Steps to Developan NLP System

* Translate real-world requirements into a measurable goal* Find a suitable level and representation* Find initial data for experiments* Find and utilize existing tools and Frameworks where possible* Don't trust research results* Setup and perform a proper experiment (series of experiments)

Page 46: Crash-course in Natural Language Processing

Going into Prod

* NLP tasks are usually CPU-intensive but stateless * General-purpose NLP frameworks are (mostly) not production-ready* Value pre- and post- processing* Gather user feedback

Page 47: Crash-course in Natural Language Processing

Final WordsWe have discussed:* linguistic basis of NLP

- although some people manage to do NLP without it:http://arxiv.org/pdf/1103.0398.pdf

* rule-based & statistical/ML approaches* different concrete tasks

We haven't covered:* all the different tasks, such as MT,

question answering, etc. (but they use the same technics)* deep learning for NLP* natural language understanding (which remains an unsolved problem)