I256 Applied Natural Language Processing Fall 2009 Lecture 8 Words – Lexical acquisition –...

I256

Applied Natural Language Processing

Fall 2009

Lecture 8

• Words– Lexical acquisition– Collocations– Similarity– Selectional preferences

Barbara Rosario

2

Lexical acquisition

• Develop algorithms and statistical techniques for filling the holes in existing dictionaries and lexical resources by looking at the occurrences of patterns of words in large text corpora– Collocations– Semantic similarity– Logical metonymy– Selectional preferences

3

The limits of hand-encoded lexical resources

• Manual construction of lexical resources is very costly

• Because language keeps changing, these resources have to be continuously updated

• Quantitative information (e.g., frequencies, counts) has to be computed automatically anyway

4

The coverage problem

From CS 224N / Ling 280, Stanford, Manning

5

Lexical acquisition

• Examples:– “insulin” and “progesterone” are in WordNet 2.1 but

“leptin” and “pregnenolone” are not. – “HTML” and “SGML”, but not “XML” or “XHTML”. – “Google” and “Yahoo”, but not “Microsoft” or “IBM”.

• We need some notion of word similarity to know where to locate a new word in a lexical resource

6

Lexical acquisition

• Lexical acquisition problems– Collocations– Semantic similarity– Logical metonymy– Selectional preferences

7

Collocations

• A collocation is an expression consisting of two or more words that correspond to some conventional way of saying things– Noun phrases: weapons of mass destruction, stiff breeze (but

why not *stiff wind?)– Verbal phrases: to make up– Not necessarily contiguous: knock…. door

• Limited compositionality– Compositional if meaning of expression can be predicted by the

meaning of the parts– Idioms are most extreme examples of non-compositionality

• Kick the bucket– In collocations there is an element of meaning added to the

combination (i.e. the exact meaning cannot be derived directly form its components)

• White hair, white wine, white woman

8

Collocations

• Non Substitutability– Cannot substitute words in a collocation

• *yellow wine

• Non modifiability– To get a frog in one’s throat

• *To get an ugly frog in one’s throat

• Useful for– Language generation

• *Powerful tea, *take a decision– Machine translation

• Easy way to test if a combination is a collocation is to translate it into another language

– Make a decision *faire une decision (prendre), *fare una decisione (prendere)

9

Subclasses of collocations

• Light verbs– Make a decision, do a favor

• Phrasal verbs– To tell off, make up

• Proper names– San Francisco, New York

• Terminological expressions– Hydraulic oil filter

• This is compositional, but need to make sure, for example that it’s always translated the same

10

Finding collocations• Frequency

– If two words occur together a lot, that may be evidence that they have a special function

– But if we sort by frequency of pairs C(w1, w2), then “of the” is the most frequent pair

– Filter by POS patterns – A N (linear function), N N (regression coefficients) etc..

• Mean and variance of the distance of the words• For not contiguous collocations

– She knocked at his door (d = 2)– A man knocked on the metal front door (d = 4)

– Hypothesis testing (see page 162 Stat NLP)• How do we know it’s really a collocation?• Low mean distance can be accidental (new company)• We need to know whether two words occur together by chance

or not (because they are a collocation)– Hypothesis testing

11

Finding collocations

• Mutual information measure – A measure of how much a word tells us about the

other, i.e. the reduction in uncertainty of one word due to knowing about another

• 0 when the two words are independent

• (see Stat NLP page 66 and178)

))log),

p(yp(x

p(x,y)yI(x

12

Lexical acquisition


13

Lexical and semantic similarity• Lexical and distributional notions of meaning similarity• How can we work out how similar in meaning words are?• What is it useful for?

– IR– Generalization

• Semantically similar words behave similarly– QA, inference…

• We could use anything in the thesaurus – Meronymy – Example sentences/definitions – In practice, by “thesaurus-based” we usually just mean using the

is-a/subsumption/hypernym hierarchy • Word similarity versus word relatedness

– Similar words are near-synonyms – Related could be related any way

• Car, gasoline: related, not similar• Doctor nurse fever: related (topic) • Car, bicycle: similar

14

Semantic similarity

• Similar if contextually interchangeable– The degree for which one word can be

substituted for another in a given context• Suit similar to litigation (but only in the legal

context)

• Measures of similarity– WordNet-based– Vector-based– Detecting hyponymy and other relations

15

WordNet: Semantic Similarity

• Whale is very specific (and baleen whale even more so), while vertebrate is more general and entity is completely general. We can quantify this concept of generality by looking up the depth of each synset:

16

WordNet: Semantic Similarity

• Path_similarity: Two words are similar if nearby in thesaurus hierarchy (i.e. short path between them) – path_similarity assigns a score in the range 0–1 based on the shortest

path that connects the concepts in the hypernym hierarchy

• The numbers don’t mean much, but they decrease as we move away from the semantic space of sea creatures to inanimate objects.

17

WordNet: Path Similarity


18

WordNet: Path Similarity

• Problem with path similarity– Assumes each link represents a uniform

distance – Instead: – Want a metric which lets us represent the cost

of each edge independently – There have been a whole slew of methods

that augment thesaurus with notions from a corpus (Resnik, Lin, …)


19

Vector-based lexical semantics

• Very old idea: the meaning of a word can be specified in terms of the values of certain `features’ (`COMPONENTIAL SEMANTICS’) – dog : ANIMATE= +, EAT=MEAT, SOCIAL=+ – horse : ANIMATE= +, EAT=GRASS, SOCIAL=+ – cat : ANIMATE= +, EAT=MEAT, SOCIAL=-

• Similarity / relatedness: proximity in feature space


20

Vector-based lexical semantics


21

General characterization of vector-based semantics

• Vectors as models of concepts • The CLUSTERING approach to lexical

semantics: 1. Define properties one cares about, and give values to each

property (generally, numerical) 2. Create a vector of length n for each item to be classified 3. Viewing the n-dimensional vector as a point in n-space,

cluster points that are near one another

• What changes between models: 1. The properties used in the vector 2. The distance metric used to decide if two points are `close’ 3. The algorithm used to cluster


22

Distributional Similarity: Using words as features in a vector-based semantics• The old decompositional semantic approach requires

– i. Specifying the features – ii. Characterizing the value of these features for each lexeme

• Simpler approach: use as features the WORDS that occur in the proximity of that word / lexical entry – Intuition: “You shall know a word by the company it keeps.” (J.

R. Firth) • More specifically, you can use as `values’ of these

features – The FREQUENCIES with which these words occur near the

words whose meaning we are defining – Or perhaps the PROBABILITIES that these words occur next to

each other • Some psychological results support this view.


23

Using neighboring words to specify the meaning of words

• Take, e.g., the following corpus: – John ate a banana.

– John ate an apple.

– John drove a lorry.

• We can extract the following co-occurrence matrix:

24

Acquiring lexical vectors from a corpus

• To construct vectors C(w) for each word w: 1. Scan a text 2. Whenever a word w is encountered, increment all cells of C(w)

corresponding to the words v that occur in the vicinity of w, typically within a window of fixed size

• Differences among methods: – Size of window – Weighted or not – Whether every word in the vocabulary counts as a dimension

(including function words such as the or and) or whether instead only some specially chosen words are used (typically, the m most common content words in the corpus; or perhaps modifiers only).

– The words chosen as dimensions are often called CONTEXT WORDS

– (Whether dimensionality reduction methods are applied)


25

Variant: using only modifiers to specify the meaning of words


26

The CLUSTERING approach to lexical semantics

– Create a vector of length n for each item to be classified • Viewing the n-dimensional vector as a point in n-

space, cluster points that are near one another

– Define a similarity measure (the distance metric used to decide if two points are `close’)• For example:

– (Eventually) clustering algorithm


27

The HAL model

• Burges and Lund (95, 98)– A 160 million words corpus of articles extracted from all

newsgroups containing English dialogue – Context words: the 70,000 most frequently occurring

symbols within the corpus – Window size: 10 words to the left and the right of the

word – Measure of similarity: cosine – Frightened: scared, upset, shy, embarrassed, anxious,

worried, afraid – Harmed: abused, forced, treated, discriminated, allowed,

attracted, taught – Beatles: original, band, song, movie, album, songs, lyrics,

British


28

Latent Semantic Analysis

• Landauer at al (97, 98)– Goal: extract expected contextual usage from

passages – Steps:

• Build a word / document co-occurrence matrix • `Weight’ each cell (e.g., tf.idf) • Perform a DIMENSIONALITY REDUCTION

– Argued to correlate well with humans on a number of tests


29

Detecting Hyponymy and other relations with patterns

• Goal: discover new hyponyms, and add them to a taxonomy under the appropriate hypernym– Agar is a substance prepared from a mixture of

red algae, such as Gelidium, for laboratory or industrial use.

– What does Gelidium mean? How do you know?

30

Hearst approach

• Hearst hand-built patterns:


31

Trained algorithm to discover patterns

• Snow, Jurafsky, Ng (05)• Collect noun pairs from corpora

– (752,311 pairs from 6 million words of newswire)

• Identify each pair as positive or negative example of hypernym/hyponym relationship – (14,387 yes, 737,924 no)

• Parse the sentences, extract patterns (lexical and parses-paths)

• Train a hypernym classifier on these patterns


32


33

Evaluation: precision and recall

• Precision can be seen as a measure of exactness or fidelity, whereas Recall is a measure of completeness.

• Used in information retrieval– A perfect Precision score of 1.0 means that every result

retrieved by a search was relevant (but says nothing about whether all relevant documents were retrieved) whereas a perfect Recall score of 1.0 means that all relevant documents were retrieved by the search (but says nothing about how many irrelevant documents were also retrieved).

– Precision is defined as the number of relevant documents retrieved by a search divided by the total number of documents retrieved

– Recall is defined as the number of relevant documents retrieved by a search divided by the total number of existing relevant documents (which should have been retrieved).

34

Evaluation: precision and recall• Classification context

• A perfect Precision score of 1.0 for a class C means that every item labeled as belonging to class C does indeed belong to class C (but says nothing about the number of items from class C that were not labeled correctly)

• A perfect Recall of 1.0 means that every item from class C was labeled as belonging to class C (but says nothing about how many other items were incorrectly also labeled as belonging to class C).

35

Precision and recall: trade-off

• Often, there is an inverse relationship between Precision and Recall, where it is possible to increase one at the cost of reducing the other.

• For example, an search engine can increase its Recall by retrieving more documents, at the cost of increasing number of irrelevant documents retrieved (decreasing Precision).

• Similarly, a classification system for deciding whether or not, say, a fruit is an orange, can achieve high Precision by only classifying fruits with the exact right shape and color as oranges, but at the cost of low Recall due to the number of false negatives from oranges that did not quite match the specification.

36


37

Lexical acquisition


38

Other lexical semantics tasks• Metonymy is a figure of speech in which a thing

or concept is not called by its own name, but by the name of something intimately associated with that thing or concept. – Examples:

• Logical Metonymy– enjoy the book means enjoy reading the book, and

easy problem means a problem that is early to solve.

39

Other lexical semantics tasks


40


41


42

Lexical acquisition


43

Selectional preferences

• Most verbs prefer arguments of a particular type: selectional preferences or restrictions– Objects of eat tend to be food, subjects of

think tend to be people etc..– “Preferences” to allow for metaphors

• Feat eats the soul

• Why is it important for NLP?

44

Selectional preferences

• Why Important?– To infer meaning from selectional restrictions

• Suppose we don’t know the words durian (not in the vocabulary)

• Susan ate a very fresh durian• Infer that durian is a type of food

– Ranking the possible parses of a sentence• Give higher scores to parses where the verbs has

‘natural argument”

45

Model of selectional preferences

• Resnick, 93 (see page 288 Stat NLP)• Two main concepts1. Selectional preference strength

– How strongly the verb constrains its direct object• Eat, find, see

2. Selectional association between the verb and the object semantic class

• Eat and food

• The higher 1 and 2 the less important is to have an object (i.e. the more likely is to have the implicit object construction)

• Bo ate, but *Bo saw

46

Next class

• Next time: review

• Classification

• Project ideas (likely on October 6)

• Two more assignments (most likely)

• Project proposals (1-2 pages description)

• Projects

I256 Applied Natural Language Processing Fall 2009 Lecture 8 Words – Lexical acquisition –...

Documents

Transcript of I256 Applied Natural Language Processing Fall 2009 Lecture 8 Words – Lexical acquisition –...