The Google Similarity Distance

Post on 01-Feb-2016

20 views 0 download

description

The Google Similarity Distance. Presenter : Chien-Hsing Chen Author: Rudi L. Cilibrasi Paul M.B. Vitanyi. 2007,TKDE. Outline. Motivation Objective NGD Experiments Conclusions Personal Opinion. Motivation. - PowerPoint PPT Presentation

Transcript of The Google Similarity Distance

1Intelligent Database Systems Lab

國立雲林科技大學National Yunlin University of Science and Technology

Presenter: Chien-Hsing Chen

Author: Rudi L. Cilibrasi

Paul M.B. Vitanyi

The Google Similarity Distance

2007,TKDE

2

Intelligent Database Systems Lab

N.Y.U.S.T.

I. M.Outline

Motivation Objective NGD Experiments Conclusions Personal Opinion

3

Intelligent Database Systems Lab

N.Y.U.S.T.

I. M.Motivation

great cost of designing structures capable of manipulating knowledge

entering high quality contents in these structures by knowledgeable human experts

the efforts are long-running

large scale

4

Intelligent Database Systems Lab

N.Y.U.S.T.

I. M.Objective

The author develop a method that uses only the name of an object and obtains knowledge about the similarity of objects

a regular FCA, used in Ontology, acquires the similarity between objects and attributes

5

Intelligent Database Systems Lab

N.Y.U.S.T.

I. M.The Google Similarity Distance

Kolmogorov complexity

6

Intelligent Database Systems Lab

N.Y.U.S.T.

I. M.The Google Similarity Distance

NGD (horse, rider) = 0.443“horse” 46,700,000 pages

“rider” 12,200,000 pages

“horse, rider” 2,630,000 pages

N= Indexed 8,058,044,651 pages

NGD(pensi, cola)=0.797NGD( 賓拉登 , 攻擊 )=0.64NGD(horse, rider)=0.898NGD(book, drink)=0.694NGD(web, network)=0.2768

7

Intelligent Database Systems Lab

N.Y.U.S.T.

I. M.Applications and Experiments

Hierarchical ClusteringGiven a set of objects in a space provided with a distance measure, the matrix has as entries the pairwise distances between the objects.

8

Intelligent Database Systems Lab

N.Y.U.S.T.

I. M.Applications and Experiments

Hierarchical ClusteringDataset: 17th Century painters

9

Intelligent Database Systems Lab

N.Y.U.S.T.

I. M.Applications and Experiments

SVM-NGD LearningThe author uses the anchor words to convert each of the 40 training words w1, …, w40 to 6-dimensional training vector v1,…v40.

The entry vj,i of vj=(vj,1,…,vj,6) is defined as vj,i=NGD(wj,ai) (1j 40, 1 i 6)≦ ≦ ≦ ≦

10

Intelligent Database Systems Lab

N.Y.U.S.T.

I. M.NGD Translation

11

Intelligent Database Systems Lab

N.Y.U.S.T.

I. M.Comparison to WordNet semantics

Randomly selected 100 semantic categories from the WordNet database

for each category, SVM is trained on 50 labeled training samplesPositive examples are from WordNet, others are from dictionary

Per experiment is used a total of six anchors, 3 are from WordNet, 3 are from dictionary

Testing dataset, 20 new examples

Running with 100 experiments

The author ignores the false negatives

12

Intelligent Database Systems Lab

N.Y.U.S.T.

I. M.Conclusion

This knowledge base was created over the course of decades by paid human experts.

Google has already indexed more than 8 billion pages and shows no signs of slowing down.

Someone who estimated the 8-billion indexed pages was in 2004.

13

Intelligent Database Systems Lab

N.Y.U.S.T.

I. M.Opinion

AdvantageGoogle search engine was respected recently for similarity measure.

Drawbackanchors determination, accuracy measure (ignore false-negative)

NGD is a nothing novel but a demonstration straightly

Application