MEASURING SEMANTIC SIMILARITY BETWEEN WORDS USING PAGE COUNTS AND

6
Measuring Semantic Similarity between Words Using Page Counts and Snippets Manasa.Ch Computer Science & Engineering, SR Engineering College Warangal, Andhra Pradesh, India Email: [email protected] V.Ramana Assistant Professor, CSE SR Engineering College, Warangal, Andhra Pradesh, India Email: [email protected] S.P. Ananda Raj Sr. Assistant Professor, CSE SR Engineering College, Warangal, Andhra Pradesh, India Email: [email protected] Abstract Web mining involves activities such as document clustering, community mining etc. to be performed on web. Such tasks need measuring semantic similarity between words. This helps in performing web mining activities easily in many applications. However, the accuracy of measuring semantic similarity between any two words is difficult task. In this paper a new approach is proposed to measure similarity between words. This approach is based on text snippets and page counts. These two measures are taken from the results of a search engine like Google. To achieve the aim of this paper, lexical patterns are extracted from text snippets and word co-occurrence measures are defined using page counts. The results of these two are combined. Moreover, we proposed algorithms such as pattern clustering and pattern extraction in order to find various relationships between any given two words. Support Vector Machines, a data mining technique, is used to optimize the results. The empirical results reveal that the proposed techniques are finding best results that can be compared with human ratings and accuracy in web mining activities. Key Words - Text snippets, word count, semantic similarity, web mining, lexical patterns 1. INTRODUCTION Web mining has gained popularity as huge amount of information is being made available over web and the automated processing of such data or information is the need of the hour. The applications of web mining include entity disambiguation, relation detection and community extraction. Information retrieval and natural language processing are two important aspects involved in all web mining applications. Lexical dictionary such as Word Net is widely used to achieve natural language processing. However, it is a general purpose lexical ontology. As part of web mining documents are to be compared and analyzed programmatically. This is a tedious task as the meaning of words change across domains over time. The problem with lexical dictionaries is that they are not having diverse information about words in various contexts. For instance the word “apple” is somehow related to computer science as there is a company by name “Apple” which has been instrumental in brining many computer hardware and software technologies. However, this word is ignored in some of the lexical dictionaries as they consider it as a fruit. As new words are created and many meanings are associated with the words, the lexical dictionaries have proved to be inadequate to handle things when the words having new meanings and relationships with other words which are not yet updated in lexical dictionaries. To overcome the drawbacks mentioned above, we propose a method that automatically finds semantic similarity between words or entities based on the page counts and text snippets retrieved from web search engines like Google. Page count is an estimate of number of pages that contain query words. Snippet is some text extracted by web search engine based on Manasa Ch et al , International Journal of Computer Science & Communication Networks,Vol 2(4), 553-558 553 ISSN:2249-5789

Transcript of MEASURING SEMANTIC SIMILARITY BETWEEN WORDS USING PAGE COUNTS AND

Page 1: MEASURING SEMANTIC SIMILARITY BETWEEN WORDS USING PAGE COUNTS AND

Measuring Semantic Similarity between Words Using Page Counts and

Snippets

Manasa.Ch

Computer Science & Engineering,

SR Engineering College

Warangal, Andhra Pradesh, India

Email: [email protected]

V.Ramana

Assistant Professor, CSE

SR Engineering College,

Warangal, Andhra Pradesh, India

Email:[email protected]

S.P. Ananda Raj

Sr. Assistant Professor, CSE

SR Engineering College,

Warangal, Andhra Pradesh, India

Email: [email protected]

Abstract

Web mining involves activities such as document

clustering, community mining etc. to be performed on

web. Such tasks need measuring semantic similarity

between words. This helps in performing web mining

activities easily in many applications. However, the

accuracy of measuring semantic similarity between

any two words is difficult task. In this paper a new approach is proposed to measure similarity between

words. This approach is based on text snippets and

page counts. These two measures are taken from the

results of a search engine like Google. To achieve the

aim of this paper, lexical patterns are extracted from

text snippets and word co-occurrence measures are

defined using page counts. The results of these two

are combined. Moreover, we proposed algorithms

such as pattern clustering and pattern extraction in

order to find various relationships between any given

two words. Support Vector Machines, a data mining

technique, is used to optimize the results. The

empirical results reveal that the proposed techniques

are finding best results that can be compared with

human ratings and accuracy in web mining activities.

Key Words - Text snippets, word count, semantic

similarity, web mining, lexical patterns

1. INTRODUCTION

Web mining has gained popularity as huge amount of

information is being made available over web and the

automated processing of such data or information is

the need of the hour. The applications of web mining

include entity disambiguation, relation detection and

community extraction. Information retrieval and

natural language processing are two important

aspects involved in all web mining applications.

Lexical dictionary such as Word Net is widely used

to achieve natural language processing. However, it

is a general purpose lexical ontology. As part of web

mining documents are to be compared and analyzed

programmatically. This is a tedious task as the meaning of words change across domains over time.

The problem with lexical dictionaries is that they are

not having diverse information about words in

various contexts. For instance the word “apple” is

somehow related to computer science as there is a

company by name “Apple” which has been

instrumental in brining many computer hardware and

software technologies. However, this word is ignored

in some of the lexical dictionaries as they consider it

as a fruit. As new words are created and many

meanings are associated with the words, the lexical

dictionaries have proved to be inadequate to handle

things when the words having new meanings and

relationships with other words which are not yet

updated in lexical dictionaries.

To overcome the drawbacks mentioned above, we

propose a method that automatically finds semantic similarity between words or entities based on the

page counts and text snippets retrieved from web

search engines like Google. Page count is an estimate

of number of pages that contain query words. Snippet

is some text extracted by web search engine based on

Manasa Ch et al , International Journal of Computer Science & Communication Networks,Vol 2(4), 553-558

553

ISSN:2249-5789

Page 2: MEASURING SEMANTIC SIMILARITY BETWEEN WORDS USING PAGE COUNTS AND

the query term given. The following is the text

snippet obtained from Google search engine with

query string “apple”.

“Apple Inc. (NASDAQ: AAPL; formerly Apple

Computer, Inc.) is an American multinational

corporation that designs and sells consumer

electronics, computer ...”

Fig. 1: Shows text snippet given by Google search engine for search word “apple”

Similarity measures have been associated with text

snippets for query expansion [4], personal name

disambiguation [9], and community mining [17]. The

text snippets and page counts are automatically

obtained from search engines and used in web

mining. However, they have the drawbacks as

follows

page count analysis ignores the position of a

word in a page

page count of a polysemous word (a word

with multiple senses) might contain a

combination of all its senses.

Because the large number of documents in

the result set, only those snippets for the top

ranking results for a query can be processed

We propose a method that overcomes the problems

mentioned above. We use both snippets and page

counts and propose algorithms such as lexical pattern

extraction and pattern clustering to accurately

measure semantic similarity between words. The

main contributions of this paper include lexical

patterns extraction to identify relation between

words, SVM usage to integrate machine learning

approach to optimize results.

2. RELATED WORK

In [15] taxonomy of words is used to calculate

similarity between to words by finding the length of

the shorted path connecting two words. Information

content concept is used by Resnik [9] where

similarity between two concepts was introduced. The

maximum of similarity between any concepts that the

words belong to is used for finding similarity

between words. Information content and also

structural semantic information are combined by Li et

al. [3] in order to have a similarity measure. Very

high accuracy was shown by this technique when

used with Charles [11] benchmark data set. Lin [8] defined similarity as the information which is in

common to both concepts while Cilibrasi and Vitanyi

[12] proposed a metric known as distance metric.

This metric is defined using page counts retrieved

from web search engines.

Snippets were used by [4] in order to measure

semantic similarity between any given two words.

They represented each snippet as TF-IDF weighted

term vector. A double checking model is developed

by Chen et al. [4] which is based on snippets returned

by web search engine. In various web mining

applications such as word sense disambiguation [6], language modeling [13], synonym extraction [5],

thesauri extraction [4] the concept of measuring

semantic similarity is used.

3. PROPOSED METHOD

The proposed method that finds similarity between

two words A, B is supposed to return a value between

0.0 and 1.0. The value 0.0 indicates that there is no

similarity between words while 1.0 indicates there

are absolute similarity between given words. The

proposed method makes use of page counts and text

snippets retrieved by search engine like Google. For

instance the words gem and jewel are given to

Google and the resultant page counts and text

snippets are used by our method to find similarity

between words. The proposed method is visualized as

shown in fig. 2.

As illustrated in Fig. 2, two words such as gem, jewel

is given as input to search engine. The search engine

is returning page counts and also text snippets. These

are extracted and given input to our proposed

techniques. Page counts are given to word co-occurrence measures such as Web Jaccard,

WebOverlap, WebDice, and WebPMI. The result of

these techniques is given to SVM. On the other hand,

the text snippets are given to the proposed algorithms

that can generate pattern clusters which in turn are

given to SVM. Now SVM has got two inputs. They

are work co-occurrence measures and also pattern

clusters. The SVM is trained with these and finally

accurate semantic similarity is calculated for the

given two words such as gem and jewel.

Manasa Ch et al , International Journal of Computer Science & Communication Networks,Vol 2(4), 553-558

554

ISSN:2249-5789

Page 3: MEASURING SEMANTIC SIMILARITY BETWEEN WORDS USING PAGE COUNTS AND

.

Fig. 2: outline of proposed method

3.1 Page Count Based Co-Occurrence

Measures For given two words A and B page counts are given

by search engine when these words are given as

input. The four famous word co-occurrence measures

such as Jaccard, Overlap (Simpson), Dice, and Point

wise mutual information (PMI) are used in the

proposed design in order to find similarity between

words.

3.2 Lexical Pattern Extraction

To overcome the drawbacks of using text snippets

directly, we propose an algorithm known as “lexical pattern extraction algorithm” based on text snippets.

The algorithm is meant for finding semantic relations

that exist between given words. This technique has

been used by various natural language processing

tasks like extracting hypernyms [1], [7], question

answering [10], meronyms [14] and paraphrase

extract. Lexical patterns are the patterns that satisfy

the following criteria.

1. A subsequence must exactly contain one

occurrence of each A and B

2. The max length of subsequence is L words

3. In a subsequence one or more words can be

skipped. However, consequently it should be less

than g.

4. Only negation contractions in a context are

expanded.

3.3 Lexical Pattern Clustering

The extracted lexical patterns are clustered based on

the similarity with respect to given cluster. Each

cluster contains patterns that express similar semantic

relations. Algorithm 1 returns such clusters. The

sorted clusters in ascending order do mean that the

most useful clusters are at the top.

Steps for pattern clustering

4. TRAINING WITH SVM

A two- class SVM is trained with both synonymous

and nonsynonymous word pairs generated from

WordNet. For 3000 words the word pairs are

extracted. The total number of words in the training

data is 6000. Then lexical patterns are extracted

subject to specified threshold. Lexical patterns thus

extracted are clustered and given to SVM. The SVM

acts up on both results of word co-occurrence

measures and also pattern clusters in order to

calculate semantic similarity between two given

words.

5. EXPERIMENTAL RESULTS

The experimental results include semantic similarities

between given two words by using SVM and page

counts and text snippets retrieved from search

engines for given words.

Manasa Ch et al , International Journal of Computer Science & Communication Networks,Vol 2(4), 553-558

555

ISSN:2249-5789

Page 4: MEASURING SEMANTIC SIMILARITY BETWEEN WORDS USING PAGE COUNTS AND

Fig3:home page

We have to enter a key word in given text box to search in the search engine.for example, “google”

and “opera” are the words to search as in Fig 4 and

Fig 6.

Fig 4 shows entering of a word “google” to search

When we click on search button,it displays the page

counts and snippets as result. For example, the page

counts and snippets for “google” and “opera” are

shown in Fig 5 and Fig 7.

Fig 5 shows page counts and snippets retrived for given word google

Fig 6 shows entering of a word “opera” to search

Fig 7 shows page counts and snippets retrived for given word opera We have to enter two words to measure semantic similarity between them.the measurement ranges from 0 to 1. For

given words “google” and “opera” the semantic similarity

is 0.8.

Fig 8 shows semantic similarity between “google” and

“opera” as 0.8

For various words, we can measure semantic similarity between them. The result is close to 1 when they are

semantically closed and it is close to 0 when they are not

closed semantically.the output will be shown in form of

graphs and tables as follows.

Manasa Ch et al , International Journal of Computer Science & Communication Networks,Vol 2(4), 553-558

556

ISSN:2249-5789

Page 5: MEASURING SEMANTIC SIMILARITY BETWEEN WORDS USING PAGE COUNTS AND

Table 1 shows semantic similarities for various word pairs

Graph 1 shows semantic similarities for various word pairs

6. CONCLUSION

We used the results of web search engine for two

words and proposed a semantic similarity measure

which is based on the page counts and text snippets

that are the results of a web search engine like

Google. The aim of this paper is to measure semantic

similarity between any two given words with utmost

accuracy. To achieve these techniques like pattern

extraction and pattern clustering are introduced.

These algorithms help in finding various

relationships between words. SVM was trained with

relationships identified between the given words. The

experiments are made with synonymous and non

synonymous word pairs that are collected from Word

net synsets. The experimental results have shown that

the proposed method is far better than the existing approaches that are employed to measure semantic

similarity between words.

7. REFERENCES [1] C. Buckley, G. Salton, J. Allan, and A.

Singhal.Automatic query expansion using smart: Trec 3. In

Proc. of 3rd Text REtreival Conference, pages 69{80,

1994. [2] D. Bollegala, Y. Matsuo, and M.

Ishizuka.Disambiguating personal names on the web using

automatically extracted key phrases. In Proc. of the 17th

European Conference on Artificial Intelligence,pages 553{557, 2006.

[3] D. R. Cutting, J. O. Pedersen, D. Karger, and J.

W.Tukey. Scatter/gather: A cluster-based approach to

browsing large document collections. In Proceedings SIGIR '92, pages 318{329, 1992.

[4] D. Lin. An information-theoretic de¯nition of similarity.

In Proc. of the 15th ICML, pages 296{304,1998.

[5] D. Lin. Automatic retreival and clustering of similar words. In Proc. of the 17th COLING, pages 768{774,1998.

WWW 2007 / Track: Semantic Web Session: Similarity

and Extraction 765 Table 7: Entity Disambiguation Results

Jaguar Java Method Precision Recall F Precision Recall F WebJaccard 0:5613 0:541 0:5288 0:5738 0:5564 0:5243

WebOverlap 0:6463 0:6314 0:6201 0:6228 0:5895 0:56

WebDice 0:5613 0:541 0:5288 0:5738 0:5564 0:5243

WebPMI 0:5607 0:478 0:5026 0:7747 0:595 0:6468 Sahami [36] 0:6061 0:6337 0:6019 0:751 0:4793 0:5761

CODC [6] 0:5312 0:6159 0:5452 0:7744 0:5895 0:6358

Proposed 0:6892 0:7144 0:672 0:8198 0:6446 0:691

[6] F. Keller and M. Lapata. Using the web to obtain frequencies for unseen bigrams. Computational Linguistics,

29(3):459{484, 2003.

[7] G. Miller and W. Charles. Contextual correlates of

semantic similarity. Language and Cognitive Processes,6(1):1{28, 1998.

[8] H. Han, H. Zha, and C. L. Giles. Name disambiguation

in author citations using a k-way spectral clustering

method. In Proceedings of the International Conference on Digital Libraries, 2005.

[9] J. Curran. Ensemble menthods for automatic thesaurus

extraction. In Proc. of EMNLP, 2002.

[10] J. Mori, Y. Matsuo, and M. Ishizuka. Extracting keyphrases to represent relations in social networks from

web. In Proc. of 20th IJCAI , 2007.

[11] M. Fleischman and E. Hovy. Multi-document person

name resolution. In Proceedings of 42nd Annual Meeting of the Association for Computational Linguistics (ACL),

Reference Resolution Workshop,2004.

Manasa Ch et al , International Journal of Computer Science & Communication Networks,Vol 2(4), 553-558

557

ISSN:2249-5789

Page 6: MEASURING SEMANTIC SIMILARITY BETWEEN WORDS USING PAGE COUNTS AND

[12] M. Hearst. Automatic acquisition of hyponyms from

large text corpora. In Proc. of 14th COLING, pages 539{545, 1992.

[13] M. Lapata and F. Keller. Web-based models ofr

natural language processing. ACM Transactions on Speech

and Language Processing, 2(1):1{31, 2005. [14] M. Mitra, A. Singhal, and C. Buckley. Improving

automatic query expansion. In Proc. of 21st Annual

International ACM-SIGIR Conference on Research and

Development in Information Retrieval, pages 206{214, 1998.

[15]P. Cimano, S. Handschuh, and S. Staab. Towards the

self-annotating web. In Proc. of 13th WWW, 2004.

[16] R. Bekkerman and A. McCallum. Disambiguating web appearances of people in a social network. In Proceedings

of the World Wide Web Conference (WWW), pages

463{470, 2005.

[17] Z. Bar-Yossef and M. Gurevich. Random sampling from a search engine's index. In Proceedings of 15th

International World Wide Web Conference, 2006.

8 . ABOUT THE AUTHORS

Manasa.Ch received the M.C.A Degree from

Kamala Institute of

Technology and Science,

Huzurabad, Karimnagar,

A.P, India. Currently doing

M.tech in Computer

Science and Engineering at

SR Engineering College,

Warangal, India. Her

research interests include

Knowledge and Data Engineering. She has

Participated in ISTE approved National conference

on “Mobile Communications and Data Engineering

at VITS, Karimnagar,A.P. and participated in

Women Student Congress at NIT, Warangal, organized by IEEE WIE student branch,

V.Ramana received B.Tech

(CSE) degree from JNTU,

Hyderabad in the year

2006.M.Tech (AI) from

university of Hyderabad in

the year 2010, He has2

Years of Teaching Experience. His area of

interest is Artificial

Intelligence and Machine

Learning. He has published

papers in International Journal, International

Conference and National Conference and attended

National Workshops/FDP/Seminars etc., He is a

member of CSI.

S.P.Anandaraj received

B.E (CSE) degree from

Madras University, Chennai

in the year 2004, M.Tech

(CSE) with Gold Medal

from Dr.MGR Educational

and Research Institute,

University in the year 2007

(Distinction with Honors). Now Pursuing Ph.D in St.

Peter’s University, Chennai. He has 8 Years of

Teaching Experience. His areas of interest are

Information security and Sensor Networks. He has

published papers in International Journal,

International Conference and National Conference

and attended nearly15 National

Workshops/FDP/Seminars etc. He is a member of

ISTE, CSI, IEEE, Member of IACSIT and Member

of IAENG.

Manasa Ch et al , International Journal of Computer Science & Communication Networks,Vol 2(4), 553-558

558

ISSN:2249-5789