Overview of Peter D. Turney’s Work on Similarity

Overview of Peter D. Turney’s Work o

n Similarity

From 2001-2008

similarity Attributional similarity (2001 - 2003)

the degree to which two words are synonymous also known as

Semantic relatedness and semantic association

Relational similarity (2005 - 2008) the degree to which two relations are analogous

Objective evaluation of the approaches by

Attributional similarity 80 TOFEL Synonym questions

Relational similarity 374 SAT analogy questions

2001 Mining the Web for

Synonyms: PMI-IR versus LSA on TOEFL

In Proceedings of the 12th European

Conference on Machine Learning,

pages 491–502, Springer, Berlin, 2001.

1 Introduction

识别同义词：给定一个词和一组候选词，从候选词中选出与给定

词意义最相近的一个。核心思想：基于 co-occurrence

“a word is characterized by the company it keeps”

1 Introduction: idea

给定一个词 problem 和一组候选词 {choice1, choice2, …, choicen} 计算 choicei 的 score(choicei) ，得分最高的即为同义词。

uses Pointwise Mutual Information (PMI) to analyze statistical data collected by Information

Retrieval (IR).

ii 2

i

p(problem & choice )score(choice ) = log

p(problem)p(choice )

2 formula

Score 1:

Score 2: NEAR为十个单词以内

i1 i

i

hits(problem AND choice )score (choice ) =

hits(choice )

i2 i

i

hits(problem NEAR choice )score (choice ) =

hits(choice )

2 formula

Score 3: 避免反义词如 big vs. small

Score 4: 引入上下文 context

context word的选择：只选一个（保证样本数）

3 i

i i

i i

score (choice ) =

hits((problem NEAR choice ) AND NOT ((problem OR choice ) NEAR "not"))

hits(choice AND NOT (choice NEAR "not"))

4 i

i i

i i

score (choice ) =

hits((problem NEAR choice ) AND context AND NOT ((problem OR choice ) NEAR "not"))

hits(choice AND context AND NOT (choice NEAR "not"))

3 Experiments

Compare with LSA: Latent Semantic Analysis

利用百科全书构造初始矩阵 X ： 61,000 * 30,473 文档片段：整篇文档压缩降维： SVD Element: tfidf weight Similarity: cosine

学生的 TOFEL 成绩

Dataset: 80 个 TOFEL 试题

50 个 ESL 考试题

3 Experiments: PMI-IR Vs. LSA

时间效率 PMI-IR ：程序简单，耗时少

2s/query * 8 querys ，几乎全部耗时在网络交互并行： 2S

LSA ：耗时长 61,000 * 30,473 压缩到 61,000 *300 ， UNIX Station 需时大

约三小时

3 Experiments

80 个 TOFEL 试题， 50 个 ESL考试题 PMI-IR ： 73.75%(59/80) 74%(37/50) 留学生： 64.5%(51.6/80) LSA: 64.4%(51.5/80)

性能 : PMI-IR WIN: 10% 原因

NEAR 的使用， Smaller chunk size LSA 64.4% PMI-IR with AND 62.5% PMI-IR with NEAR 72.5%

4 Conclusion

结合 PMI 和 IR 用共现来衡量词语间的相关程度

PMI 利用向引擎发送查询

解决了数据稀疏的问题

2003Combining independent modules

in lexical multiple-choice problems

In RANLP-03, pages 482–489,

Borovets, Bulgaria(RANLP: Recent Advances in Natural Language Proc

essing )

1 Introduction

There are several approaches to natural language problems

No one will be the best for all problem instances.

How about combine them?

1 Introduction

two main contributions introduces and evaluates several new modules

for answering multiple-choice synonym questions and analogy questions.

3 merging rules presents a novel product rule compares it with other 2 similar merging rules.

2 Merging rules: the parameter

The parameter of the rules: w ph

ij >= 0 represents the probability

第 i 个 module 1 <= i <= n 第 h 个 instance 1 <= h <= m. 第 j 个 choice 1 <= j <= k

Dh,wj be the probability

assigned by the merging rule to choice j of training instance h when the weights are set to w.

1<= a(h) <= k be the correct answer for instance

, '

( )'arg maxh w

a hw hw D

2 Merging rules: old

mixture rule: very common

归一化

logarithmic rule

,

1

nh w hj i ij

i

M w p

,

1

exp ln ( ) i

nwh w h h

j i ij ijii

L w p p

, , ,

1

kh w h w h wj j j

j

D L L

, , ,

1

kh w h w h wj j j

j

D M M

2 Merging rules: novel

product rule

, ( (1 ) )h w hj i ij iiP w p w k

, , ,

1

kh w h w h wj j j

j

D P P

3 Synonym: dataset

a training set of 431 4-choice synonym questions

randomly divided them into 331 training questions and 100 testing questions. Optimize w with the training set

3 Synonym: Modules LSA PMI-IR Thesaurus

queries Wordsmyth (www.wordsmyth.net) Create synonyms lists for both stem and choices scored them by their overlap

Connector used summary pages from querying Google with a pair of

words Weighted sum of

the times when the words appear separated by a symbol [, ”, :, ,, =, /, ,, (, ] means, defined, equals, synonym, whitespace, and

the number of times “dictionary” or “thesaurus” appear

3 Synonym: combine results 3 rules’ accuracies are nearly identical the product and logarithmic rules assign

higher probabilities to correct answers as evidenced by the mean likelihood.

3 Synonym: compare with other approaches

4 Analogies: dataset

374 5-choice instances randomly split the collection into 274 training i

nstances and 100 testing instances. Eg. cat:meow::

(a) mouse:scamper,(b) bird:peck, (c) dog:bark, (d) horse:groom,(e) lion:scratch

4 Analogies: modules

Phrase vectors Create vector r to present the relationship betwee

n X and Y. Phrases with 128 patterns

Eg. “X for Y", “Y with X", “X in the Y", “Y on X“ Query and record the number of hits Measure by cosine

Thesaurus paths (WordNet) degree of similarity between paths

4 Analogies: combine results

Lexical relation modules a set of more specific modules using the WordNet 9 modules: Each checks a relationship

Synonym, Antonym, Hypernym, Hyponym, Meronym:substance, Meronym:part, Meronym:member, Holonym:substance, Holonym:member.

Check the stem first, then the choices Similarity

Make use of definition Similarity:dict uses dictionary.com and Similarity:wordsmyth uses wordsmyth.net

Given A:B::C:D, similarity = sim (A, C) + sim (B, D)

5 Conclusion

applied three trained merging rules to TOEFL questions Accuracy: 97.5%

provided first results on a challenging analogy task with a set of novel modules that use both lexical databases and statistical information. Accuracy: 45%

the popular mixture rule was consistently weaker than the logarithmic and product rules at assigning high probabilities to correct answers.

State of the art (accuracy)

LSA HUMAN PIM-IR

(2001)

HYBRID

(2003)

Synonym

question

64.4% 64.5% 73.75% 97.5%

HYBRID HUMAN

Analogies 45% 57%

2005 Corpus-based Learning of

Analogies and Semantic Relations

IJCAI 2005

Proceedings of the Nineteenth International Joint Conference on Artificial Intelligence, Edinburgh,

Scotland, UK, July 30-August 5, 2005.

1 Introduction

Verbal analogy: VSM A:B :: C:D The novelty of the paper is the application of VSM

to measure the similarity between relationships. Noun-modifier pairs relations: supervised nea

rest neighbour algorithm Dataset: Nastase and Szpakowicz (2003), 600 n

one-modifier pairs.

1 Introduction: examples

Analogy

Noun-modifier pairs relations Laser printer Relation: instrument

2 Solving Analogy Problems

assign scores to candidate analogies A:B::C:D For multiple-choice questions, guess highest

scoring choice Sim(R1, R2) difficulty is that R1 and R2 are implicit

attempt to learn R1 and R2 using unsupervised learning from a very large corpus

2 Solving Analogy Problems: Vector Space Model

create vectors, r1 and r2, that represent features of R1 and R2

measure the similarity of R1 and R2 by the cosine of the angle θ between r1 and r2

2 Solving Analogy Problems:简易图解版 Generate vector for each word pair

Joining terms:

“X for Y", “Y with X", “X in the Y", “Y on X“

vector

[ log(hit1), log(hit2)…, log(hit128) ]

Word PairA:B

64 joining terms

phrases

searchhits vector

log

2 Solving Analogy Problems: experiment

3 Noun-Modifier Semantic Relations

First attempt to classify semantic relations without a lexicon.

30 Semantic Relations of training data

3 Noun-Modifier Semantic Relations: algorithm

nearest neighbour supervised learning nearest neighbour = cosine

Cosine (training pair, testing pair) vector of 128 elements, same joining terms as

before

3 Noun-Modifier Semantic Relations:Experiment for the 30 Classes

30 Semantic Relations

F when precision and recall are balanced 26.5%

F for random guessing 3.3%

much better than random guessing but still much room for improvement

30 classes is hard too many possibilities for confusing classes

try 5 classes instead group classes together

F for the 5 Classes


F when precision and recall are balanced 43.2%

F for random guessing 20.0%

better than random guessing better than 30 classes

26.5% but still room for improvement

Execution Time

experiments presented here required 76,800 queries to AltaVista 600 word pairs × 128 queries per word pair = 76,800 queries

as courtesy to AltaVista, inserted a five second delay between each query processing 76,800 queries took about five

days

Conclusion

The cosine metric in the VSM used to Analogy Classify semantic relations

It performs much better than random guessing, but below human levels.

State of the art

accuracy HYBRID

(2003)

VSM

(2005)

HUMAN

Analogies 45% 47% 57%

F-measure VSM

(2005)

Noun-Modifier

(5 classes)

43.2%

2006aSimilarity of Semantic

Relations

Computational Linguistics, 32(3):379–416.

1 Introduction

Latent Relational Analysis (LRA) LRA extends the VSM approach of Turney an

d Littman (2005) in three ways: The connecting patterns are derived automatical

ly from the corpus, instead of using a fixed set of patterns.

Singular Value Decomposition (SVD) is used to smooth the frequency data.

automatically generated synonyms are used to explore variations of the word pairs.

2 A short description of LRA简易图解版 Generate vector for each word pair

Word PairA:B

64 joining terms

phrases

searchhits vector

log

A’:B, A:B’同义词扩展

熵 *log(hit)

矩阵

SVD

Calculate avg(cosine)

自动获得的pattern

3 Experiment: Word Analogy Questions Baseline LSA

Matrix: 17,232 * 8,000, density of 5.8% Time required: 209:49:36, 9 days Performance:

Experiment: Word Analogy Questions LSA vs. VSM

Corpus size: AltaVista: 5*1011 English words WMTS: 5*1010 English words

Experiment: Word Analogy Questions Varying the Parameters

Experiment: Word Analogy Questions Ablation Experiments

No SVD: not significant, but maybe significant with more word pairs

No synonyms: recall drops No both: recall drops VSM: drop is significant

Experiments with Noun-Modifier Relations

Dataset 600 noun-modifier pairs, hand-labeled with 30 cla

sses of semantic relations Algorithm

Baseline LRA with Single Nearest Neighbor LRA: a distance (nearness) measure

Discussion

For Word Analogy Questions Performance is not yet be adequate for practical

application Speed

For noun-modifier classification More hand-labeled data, but it’s expensive the choice of classification scheme for the

semantic relations Hybrid approach

combine the corpus-based approach of LRA with the lexicon-based approach of Veale (2004)

Conclusion of 2006a

LRA, extend the VSM (2005) in Patterns are derived automatically SVD is used to smooth and compress data. automatically generated synonyms are used to

explore variations of the word pairs.

State of the art

accuracy HYBRID

(2003)

VSM

(2005)

LRA

(2006a)

HUMAN

Analogies 45% 47% 56.8% 57%

F-measure VSM

(2005)

LRA

(2006a)

Noun-Modifier

(5 classes)

43.2% 54.6%

2006bExpressing Implicit Semantic Relations

without Supervision

Coling/ACL-06

Introduction

Hearst (1992): pattern X:Y Pattern “Y such as the X” can be used to mine l

arge text corpora for hypernym-hyponym Search using the pattern “Y such as the X” and fi

nd the string “bird such as the ostrich”, then we can infer that “ostrich” is a hyponym of “bird”.

Here we consider the inverse of this problem: X:Y pattern Can we mine a large text corpus for patterns that

express the implicit relations between X and Y?

Introduction

Discovering high quality patterns Pertinence: measure of quality Reliable for mining further word pairs with the

same semantic relations

2 Pertinence the first formal measure of quality for text mining patterns. a set of word pairs a set of patterns

Pi is pertinent to Xj:Yj

if highly typical word pairs Xk:Yk for the pattern Pi tend to be relationally similar to Xj:Yj

Pertinence tends to be highest with unambiguous patterns

1 1: ,..., :n nW X Y X Y1{ ,..., }mP P P

1

: ,

( : | ) : , :

j j i

n

k k i r j j k kk

pertinence X Y P

p X Y P sim X Y X Y

2 Pertinence: 计算 fk,I is the number of occurrences in a corpus of the w

ord pair Xk:Yk with the pattern Pi

Smoothing, ,

1

( | : )m

i k k k i k jj

p P X Y f f

1

( : ) ( | : )( : | )

( : ) ( | : )

k k i k kk k i n

j j i j jj

p X Y p P X Yp X Y P

p X Y p P X Y

( : ) 1j jp X Y n

1

( | : )( : | )

( | : )

i k kk k i n

i j jj

p P X Yp X Y P

p P X Y

( : | )k k ip X Y P

贝叶斯定理

, ,1

( : | ) ( : , ) ( )n

k k i k k i i k i j ij

p X Y P p X Y P p P f f

3 Related Work Hearst (1992)

describes a method for finding patterns like “Y such as the X”. but her method requires human judgment.

Riloff and Jones (1999) use a mutual bootstrapping technique that can find patterns

automatically but the bootstrapping requires an initial seed of manually cho

sen examples.

Other works all require training examples or initial seed patterns for each relation

3 Related Work

Turney (2006a): LRA maps each pair X:Y to a high-dimensional vector

v, then calculate the cosine. Pertinence is based on it A limitation:

the semantic content of the vectors is difficult to interpret

The Algorithm 1. Find phrases 2. Generate patterns

Note pattern frequency (TF) A local frequency count

3. Count pair frequency: It’s a global frequency count (DF)

4. Map pairs to rows: both for Xj:Yj and Yj:Xj

5. Map patterns to columns drop all patterns with a pair frequency less than 10 1,706,845 distinct patterns 42,032 patterns

The Algorithm 6. Build a sparse matrix

Element is frequency 7. Calculate entropy: log and entropy

gives more weight to patterns that vary substantially in frequency for each pair.

8. Apply SVD: 9. Calculate cosines: 10. Calculate conditional probabilities:

For every word pair and every pattern

11. Calculate pertinence: 1

( | : )( : | )

( | : )

i k kk k i n

i j jj

p P X Yp X Y P

p P X Y

The Algorithm：简易图解版

语义相似度 = pattern list 的相似度

{ 词对 } 矩阵词对 1, pattern list1

……词对 n, pattern listn

检索 , 统计patterns 等计算 , 排序

5 Experiments with Word Analogies Dataset

374 college-level multiple-choice word analogies, taken from the SAT test.

6*374 = 2244 pairs 4194rows * 84,064 columns The sparse matrix density is 0.91%

Score =

( rankstem + rankchoice ) / 2

the four highest ranking patterns for the stem and solution for the first example

the top five pairs match the pattern “Y such as the X”.

Comparing with other measures

Experiments with Noun-Modifiers

Method and Result Method

A single nearest neighbour algorithm with leave-one-out cross-validation.

The distance between two noun-modifier pairs is measured by the average rank of their best shared pattern.

Result

More

For the 5 general classes

Comparing with other measures

Discussion Time

Word Analogies: 5 hours, vs. 5 days (2005), 9 days(2006a) Noun-Modifiers: 9 hours the majority of the time was spent in SEARCHING

Performance Near the level of the average senior high school student

(54.6% vs. 57%) For applications such as building a thesaurus, lexicon, or

ontology, this level of performance suggests that our algorithm could assist, but not replace, a human expert.

Conclusion

LRA is a black box The main contribution of this paper is the idea

of pertinence use it to find patterns that express the implicit

semantic relations between two words.

State of the art

accuracy HYBRID

(2003)

VSM

(2005)

LRA

(2006a)

pertinence

(2006b)

HUMAN

Analogies 45% 47% 56.8% 55.7% 57%

F-measure VSM

(2005)

LRA

(2006a)

pertinence

(2006b)

Noun-Modifier

(5 classes)

43.2% 54.6% 50.2%

2008A Uniform Approach to Analogies,

Synonyms, Antonyms,and Associations

Proceedings of the 22nd International Conference on Computational Linguistics (Coling 2008), August 2008, Manchester, UK, Pages 905-912

1 Introduction

语义种类太多，不可能每种都提供一种特别的算法 we restrict our attention to

analogous synonymous Antonymous Associated

As far as we know, the algorithm proposed here is the first attempt to deal with all four tasks using a uniform approach.

1 Introduction: idea

Analogous Synonymous

X:Y is analogous to the pair levied:imposed Antonymous

X:Y is analogous to the pair black:white Associated

X:Y is analogous to the pair doctor:hospital

1 Introduction: Why not WordNet?

WordNet contains all of the needed relations. Corpus-based algorithm is BETTER than lexicon

answer 374 multiple-choice SAT analogy questionsWordNet (Veale, 2004): 43%corpus-based (Turney, 2006a): 56%

Less human labor Easy to extend to other languages

1 Introduction: experiments

SAT college entrance test TOFEL ESL a set of word pairs that are labeled similar,

associated, and both, developed for experiments in cognitive psychology

2 Algorithm: PairClass

view the task of recognizing word analogies as a problem of classifying word pairs standard classification problem for supervised

machine learning

2 Algorithm: Resource

Corpus: 5 × 1010 words, consisting of web pages gathered by a web

crawler, gathered by Clarke,CharlesL.A., 2003

Wumpus: an efficient search engine for passage retrieval from large c

orpora. (http://www.wumpus-search.org/) to study issues that arise in the context of indexing dynami

c text collections in multi-user environments.

2 Algorithm: PairClasstraining set

&testing set

Step1: generate morphological variations

Step 2: search in a large corpus forall phrases

Step 3:

generate patterns

Step 4: reduce the number

of patterns

Step 5: generate feature

vectors

Step 6: apply a standard su

pervisedlearning algorithm

Weka

mason:stone

masons:stones

the mason cut the stone

with

[0 to 1 words] X [0 to 3 words] Y [0 to 1 words]

“the X cut * Y with”

“* X * the Y *”

2(n−2) patterns

topkN patterns

k = 20

SMO RBF algorithm

PairClass vs. LSA(Turney, 2006a)

PairClass does not use a lexicon to find synonyms for the input word pairs. a pure corpus-based algorithm can handle synony

ms without a lexicon. PairClass uses a support vector machine (SV

M) instead of a nearest neighbour (NN) learning algorithm.

PairClass does not use SVD to smooth the feature vectors. It has been our experience that SVD is not neces

sary with SVMs.

Measure of similarity PairClass: probability estimates, more useful Turney (2006): cosine

The automatically generated patterns are slightly more general PairClass: [0 to 1 words] X [0 to 3 words] Y [0 to 1 words] Turney (2006): X [0 to 3 words] Y

The morphological processing in PairClass (Minnen et al., 2001) is more sophisticated than in Turney (2006).

3 Experiment: SAT Analogies

use a set of 374 multiple-choice questions from the SAT college entrance exam.

Eg.

a binary classification problem


1st DIFFICULTY: no negative examples the training set consists of one positive example

(the stem pair) and the testing set consists of five unlabeled examples (the five choice pairs).

Solution: Randomly choose one of the other 373 questions,

to be a negative example use PairClass to estimate the probability that

each testing example is positive, and we guess the testing example with the highest probability.


2nd DIFFICULTY: the algorithm is very unstable, for lack of example

s. Solution:

To increase the stability, we repeat the learning process 10 times, using a different randomly chosen negative training example each time.

Average the 10 probability

PairClass: accuracy of 52.1%

52.1%

3 Experiment: TOEFL Synonyms

Recognizing synonyms a set of 80 multiple-choice synonym question

s from the TOEFL

View it as a binary classification problem

3 Experiment: TOEFL Synonyms

80 questions, 80 positive, 240 negative apply PairClass using ten-fold cross-validation

In each random fold, 90% of the pairs are used for training and 10% are used for testing.

For each fold, the model that is learned from the training set is used to assign probabilities to the pairs in the testing set.

They are non-overlapping, so can cover the whole dataset.

Choice: the one with hightest probability

PairClass: accuracy of 76.1%

76.1%

3 Experiment: Synonyms and Antonyms

a set of 136 ESL practice questions


By patterns: hand-coded Lin et al. (2003) two patterns, “from X to Y ” and “either X or Y ”.

Antonyms: they occasionally appear in a large corpus in one of these two patterns

Synonyms: very rare to appear in these patterns.

PairClass: automatically


RESULT PairClass: ten-fold cross-validation

accuracy of 75.0% (ten-fold cross-validation) Baseline:

accuracy of 65.4% (Always guessing the majority class)

NO COMPARISON

3 Experiment: Similar, Associated, and Both

Lund et al. (1995) evaluated their corpus-based algorithm for measuring word similarity with word pairs that were labeled similar, associated, or both.

These 144 labeled pairs were originally created for cognitive psychology experiments with human subjects

3 Experiment: Similar, Associated, and Both

Lund et al. (1995) did not measure the accuracy showed that their algorithm’s similarity scores were

correlated with the response times of human subjects in priming tests.

PairClass with ten-fold cross-validation accuracy of 77.1%

Baseline: guessing the majority and Randomly guessing: 33.3%

Since the three classes are of equal size

3 Experiment: summary

For the first two experiments PairClass is not the best, But it performs competitively

For the second two experiments, PairClass performs significantly above the baselin

es.

State of the art

YEAR 算法类型 synonym analogy

2001 PMI-IR Corpus-based 73.75%

2003 PR Hybrid 97.50%

2005 VSM Corpus-based 47.1%

2006a LRA Corpus-based 56.1%

2006b PERT Corpus-based 53.5%

2008 PairClass Corpus-based 76.1% 52.1%

HUMAN 64.5% 57.0%

终于讲完了 o_0

Any Questions?

Overview of Peter D. Turney’s Work on Similarity

Documents

Transcript of Overview of Peter D. Turney’s Work on Similarity