1 Mining the Web to Determine Similarity Between Words, Objects, and Communities Author : Mehran...

12
1 Mining the Web to Determine Similarity Between Words, Objects, and Communities Author : Mehran Sahami Reporter : Tse Ho Lin 2007/9/10 FLAIRS, 2006

Transcript of 1 Mining the Web to Determine Similarity Between Words, Objects, and Communities Author : Mehran...

Page 1: 1 Mining the Web to Determine Similarity Between Words, Objects, and Communities Author : Mehran Sahami Reporter : Tse Ho Lin 2007/9/10 FLAIRS, 2006.

1

Mining the Web to Determine Similarity Between Words, Objects, and Communities

Author : Mehran Sahami

Reporter : Tse Ho Lin

2007/9/10

FLAIRS, 2006

Page 2: 1 Mining the Web to Determine Similarity Between Words, Objects, and Communities Author : Mehran Sahami Reporter : Tse Ho Lin 2007/9/10 FLAIRS, 2006.

2

Outline

Motivation Objectives Methodology

Words Objects Communities

Experiments Conclusion Personal Comments

Page 3: 1 Mining the Web to Determine Similarity Between Words, Objects, and Communities Author : Mehran Sahami Reporter : Tse Ho Lin 2007/9/10 FLAIRS, 2006.

3

Motivation

Words Many similarity measure are term-wise similarity.

Objects Users may be looking to find the same item sold at

different vendors on the web. Communities

Users are seeking to find others with similar interests.

Cos(“space exploration”, “NASA”)

Page 4: 1 Mining the Web to Determine Similarity Between Words, Objects, and Communities Author : Mehran Sahami Reporter : Tse Ho Lin 2007/9/10 FLAIRS, 2006.

4

Objectives

We begin by describing a robust method for measuring the semantic similarity between short texts.

We then examine the use of machine learning to produce similarity functions between semi-structured data elements.

We measure the similarity between on –line communities of users as part of a recommendation system.

Page 5: 1 Mining the Web to Determine Similarity Between Words, Objects, and Communities Author : Mehran Sahami Reporter : Tse Ho Lin 2007/9/10 FLAIRS, 2006.

5

Methodology – Words

Retrieved documentsRetrieved documents

Compute the TFIDF term vector

Compute the TFIDF term vector

idnddd ,...,, 21

Query x

Truncate top m weighted terms

Query y

vi

Page 6: 1 Mining the Web to Determine Similarity Between Words, Objects, and Communities Author : Mehran Sahami Reporter : Tse Ho Lin 2007/9/10 FLAIRS, 2006.

6

Methodology – Objects

Product Name ISBN CategorizationMe Talk Pretty One Day Paperback Edition 0316776963 Books

Product Name ISBN CategorizationThe Tiny Book of Boss Jokes 0007152604 Books

1f 2f 3f

1R

Compute similaritybetween fields

Compute similaritybetween fields

Training the parameters

Training the parameters ClusteringClustering

2R

),( 12111 RRf ),( 22212 RRf ),( 32313 RRf

Page 7: 1 Mining the Web to Determine Similarity Between Words, Objects, and Communities Author : Mehran Sahami Reporter : Tse Ho Lin 2007/9/10 FLAIRS, 2006.

7

Methodology – Communities

Joachims’ Combine Ranking

B, R: Community

Page 8: 1 Mining the Web to Determine Similarity Between Words, Objects, and Communities Author : Mehran Sahami Reporter : Tse Ho Lin 2007/9/10 FLAIRS, 2006.

8

Experiments

Words

Objects

Page 9: 1 Mining the Web to Determine Similarity Between Words, Objects, and Communities Author : Mehran Sahami Reporter : Tse Ho Lin 2007/9/10 FLAIRS, 2006.

9

ExperimentsCommunities

m: The user is already a member of the recommended communityn: The user visits but does not join the recommended communityj: The user joins the recommended community

L2, MI1, MI2, IDF, L1, LogOdds.

Page 10: 1 Mining the Web to Determine Similarity Between Words, Objects, and Communities Author : Mehran Sahami Reporter : Tse Ho Lin 2007/9/10 FLAIRS, 2006.

10

Conclusion

In this paper we have presented several web-based applications where measuring the similarity between different entities is an important element for success.

Page 11: 1 Mining the Web to Determine Similarity Between Words, Objects, and Communities Author : Mehran Sahami Reporter : Tse Ho Lin 2007/9/10 FLAIRS, 2006.

11

Personal Comments

Application Similarity Measure, Record linkage.

Advantage The proposed approaches use large quantity of

available on-line information.

Drawback The author doesn’t compare with other related methods

in the experiment.

Page 12: 1 Mining the Web to Determine Similarity Between Words, Objects, and Communities Author : Mehran Sahami Reporter : Tse Ho Lin 2007/9/10 FLAIRS, 2006.

Parameters Training

12