MEASURING SEMANTIC SIMILARITY BETWEEN WORDS USING PAGE COUNTS AND
Transcript of MEASURING SEMANTIC SIMILARITY BETWEEN WORDS USING PAGE COUNTS AND
Measuring Semantic Similarity between Words Using Page Counts and
Snippets
Manasa.Ch
Computer Science & Engineering,
SR Engineering College
Warangal, Andhra Pradesh, India
Email: [email protected]
V.Ramana
Assistant Professor, CSE
SR Engineering College,
Warangal, Andhra Pradesh, India
Email:[email protected]
S.P. Ananda Raj
Sr. Assistant Professor, CSE
SR Engineering College,
Warangal, Andhra Pradesh, India
Email: [email protected]
Abstract
Web mining involves activities such as document
clustering, community mining etc. to be performed on
web. Such tasks need measuring semantic similarity
between words. This helps in performing web mining
activities easily in many applications. However, the
accuracy of measuring semantic similarity between
any two words is difficult task. In this paper a new approach is proposed to measure similarity between
words. This approach is based on text snippets and
page counts. These two measures are taken from the
results of a search engine like Google. To achieve the
aim of this paper, lexical patterns are extracted from
text snippets and word co-occurrence measures are
defined using page counts. The results of these two
are combined. Moreover, we proposed algorithms
such as pattern clustering and pattern extraction in
order to find various relationships between any given
two words. Support Vector Machines, a data mining
technique, is used to optimize the results. The
empirical results reveal that the proposed techniques
are finding best results that can be compared with
human ratings and accuracy in web mining activities.
Key Words - Text snippets, word count, semantic
similarity, web mining, lexical patterns
1. INTRODUCTION
Web mining has gained popularity as huge amount of
information is being made available over web and the
automated processing of such data or information is
the need of the hour. The applications of web mining
include entity disambiguation, relation detection and
community extraction. Information retrieval and
natural language processing are two important
aspects involved in all web mining applications.
Lexical dictionary such as Word Net is widely used
to achieve natural language processing. However, it
is a general purpose lexical ontology. As part of web
mining documents are to be compared and analyzed
programmatically. This is a tedious task as the meaning of words change across domains over time.
The problem with lexical dictionaries is that they are
not having diverse information about words in
various contexts. For instance the word “apple” is
somehow related to computer science as there is a
company by name “Apple” which has been
instrumental in brining many computer hardware and
software technologies. However, this word is ignored
in some of the lexical dictionaries as they consider it
as a fruit. As new words are created and many
meanings are associated with the words, the lexical
dictionaries have proved to be inadequate to handle
things when the words having new meanings and
relationships with other words which are not yet
updated in lexical dictionaries.
To overcome the drawbacks mentioned above, we
propose a method that automatically finds semantic similarity between words or entities based on the
page counts and text snippets retrieved from web
search engines like Google. Page count is an estimate
of number of pages that contain query words. Snippet
is some text extracted by web search engine based on
Manasa Ch et al , International Journal of Computer Science & Communication Networks,Vol 2(4), 553-558
553
ISSN:2249-5789
the query term given. The following is the text
snippet obtained from Google search engine with
query string “apple”.
“Apple Inc. (NASDAQ: AAPL; formerly Apple
Computer, Inc.) is an American multinational
corporation that designs and sells consumer
electronics, computer ...”
Fig. 1: Shows text snippet given by Google search engine for search word “apple”
Similarity measures have been associated with text
snippets for query expansion [4], personal name
disambiguation [9], and community mining [17]. The
text snippets and page counts are automatically
obtained from search engines and used in web
mining. However, they have the drawbacks as
follows
page count analysis ignores the position of a
word in a page
page count of a polysemous word (a word
with multiple senses) might contain a
combination of all its senses.
Because the large number of documents in
the result set, only those snippets for the top
ranking results for a query can be processed
We propose a method that overcomes the problems
mentioned above. We use both snippets and page
counts and propose algorithms such as lexical pattern
extraction and pattern clustering to accurately
measure semantic similarity between words. The
main contributions of this paper include lexical
patterns extraction to identify relation between
words, SVM usage to integrate machine learning
approach to optimize results.
2. RELATED WORK
In [15] taxonomy of words is used to calculate
similarity between to words by finding the length of
the shorted path connecting two words. Information
content concept is used by Resnik [9] where
similarity between two concepts was introduced. The
maximum of similarity between any concepts that the
words belong to is used for finding similarity
between words. Information content and also
structural semantic information are combined by Li et
al. [3] in order to have a similarity measure. Very
high accuracy was shown by this technique when
used with Charles [11] benchmark data set. Lin [8] defined similarity as the information which is in
common to both concepts while Cilibrasi and Vitanyi
[12] proposed a metric known as distance metric.
This metric is defined using page counts retrieved
from web search engines.
Snippets were used by [4] in order to measure
semantic similarity between any given two words.
They represented each snippet as TF-IDF weighted
term vector. A double checking model is developed
by Chen et al. [4] which is based on snippets returned
by web search engine. In various web mining
applications such as word sense disambiguation [6], language modeling [13], synonym extraction [5],
thesauri extraction [4] the concept of measuring
semantic similarity is used.
3. PROPOSED METHOD
The proposed method that finds similarity between
two words A, B is supposed to return a value between
0.0 and 1.0. The value 0.0 indicates that there is no
similarity between words while 1.0 indicates there
are absolute similarity between given words. The
proposed method makes use of page counts and text
snippets retrieved by search engine like Google. For
instance the words gem and jewel are given to
Google and the resultant page counts and text
snippets are used by our method to find similarity
between words. The proposed method is visualized as
shown in fig. 2.
As illustrated in Fig. 2, two words such as gem, jewel
is given as input to search engine. The search engine
is returning page counts and also text snippets. These
are extracted and given input to our proposed
techniques. Page counts are given to word co-occurrence measures such as Web Jaccard,
WebOverlap, WebDice, and WebPMI. The result of
these techniques is given to SVM. On the other hand,
the text snippets are given to the proposed algorithms
that can generate pattern clusters which in turn are
given to SVM. Now SVM has got two inputs. They
are work co-occurrence measures and also pattern
clusters. The SVM is trained with these and finally
accurate semantic similarity is calculated for the
given two words such as gem and jewel.
Manasa Ch et al , International Journal of Computer Science & Communication Networks,Vol 2(4), 553-558
554
ISSN:2249-5789
.
Fig. 2: outline of proposed method
3.1 Page Count Based Co-Occurrence
Measures For given two words A and B page counts are given
by search engine when these words are given as
input. The four famous word co-occurrence measures
such as Jaccard, Overlap (Simpson), Dice, and Point
wise mutual information (PMI) are used in the
proposed design in order to find similarity between
words.
3.2 Lexical Pattern Extraction
To overcome the drawbacks of using text snippets
directly, we propose an algorithm known as “lexical pattern extraction algorithm” based on text snippets.
The algorithm is meant for finding semantic relations
that exist between given words. This technique has
been used by various natural language processing
tasks like extracting hypernyms [1], [7], question
answering [10], meronyms [14] and paraphrase
extract. Lexical patterns are the patterns that satisfy
the following criteria.
1. A subsequence must exactly contain one
occurrence of each A and B
2. The max length of subsequence is L words
3. In a subsequence one or more words can be
skipped. However, consequently it should be less
than g.
4. Only negation contractions in a context are
expanded.
3.3 Lexical Pattern Clustering
The extracted lexical patterns are clustered based on
the similarity with respect to given cluster. Each
cluster contains patterns that express similar semantic
relations. Algorithm 1 returns such clusters. The
sorted clusters in ascending order do mean that the
most useful clusters are at the top.
Steps for pattern clustering
4. TRAINING WITH SVM
A two- class SVM is trained with both synonymous
and nonsynonymous word pairs generated from
WordNet. For 3000 words the word pairs are
extracted. The total number of words in the training
data is 6000. Then lexical patterns are extracted
subject to specified threshold. Lexical patterns thus
extracted are clustered and given to SVM. The SVM
acts up on both results of word co-occurrence
measures and also pattern clusters in order to
calculate semantic similarity between two given
words.
5. EXPERIMENTAL RESULTS
The experimental results include semantic similarities
between given two words by using SVM and page
counts and text snippets retrieved from search
engines for given words.
Manasa Ch et al , International Journal of Computer Science & Communication Networks,Vol 2(4), 553-558
555
ISSN:2249-5789
Fig3:home page
We have to enter a key word in given text box to search in the search engine.for example, “google”
and “opera” are the words to search as in Fig 4 and
Fig 6.
Fig 4 shows entering of a word “google” to search
When we click on search button,it displays the page
counts and snippets as result. For example, the page
counts and snippets for “google” and “opera” are
shown in Fig 5 and Fig 7.
Fig 5 shows page counts and snippets retrived for given word google
Fig 6 shows entering of a word “opera” to search
Fig 7 shows page counts and snippets retrived for given word opera We have to enter two words to measure semantic similarity between them.the measurement ranges from 0 to 1. For
given words “google” and “opera” the semantic similarity
is 0.8.
Fig 8 shows semantic similarity between “google” and
“opera” as 0.8
For various words, we can measure semantic similarity between them. The result is close to 1 when they are
semantically closed and it is close to 0 when they are not
closed semantically.the output will be shown in form of
graphs and tables as follows.
Manasa Ch et al , International Journal of Computer Science & Communication Networks,Vol 2(4), 553-558
556
ISSN:2249-5789
Table 1 shows semantic similarities for various word pairs
Graph 1 shows semantic similarities for various word pairs
6. CONCLUSION
We used the results of web search engine for two
words and proposed a semantic similarity measure
which is based on the page counts and text snippets
that are the results of a web search engine like
Google. The aim of this paper is to measure semantic
similarity between any two given words with utmost
accuracy. To achieve these techniques like pattern
extraction and pattern clustering are introduced.
These algorithms help in finding various
relationships between words. SVM was trained with
relationships identified between the given words. The
experiments are made with synonymous and non
synonymous word pairs that are collected from Word
net synsets. The experimental results have shown that
the proposed method is far better than the existing approaches that are employed to measure semantic
similarity between words.
7. REFERENCES [1] C. Buckley, G. Salton, J. Allan, and A.
Singhal.Automatic query expansion using smart: Trec 3. In
Proc. of 3rd Text REtreival Conference, pages 69{80,
1994. [2] D. Bollegala, Y. Matsuo, and M.
Ishizuka.Disambiguating personal names on the web using
automatically extracted key phrases. In Proc. of the 17th
European Conference on Artificial Intelligence,pages 553{557, 2006.
[3] D. R. Cutting, J. O. Pedersen, D. Karger, and J.
W.Tukey. Scatter/gather: A cluster-based approach to
browsing large document collections. In Proceedings SIGIR '92, pages 318{329, 1992.
[4] D. Lin. An information-theoretic de¯nition of similarity.
In Proc. of the 15th ICML, pages 296{304,1998.
[5] D. Lin. Automatic retreival and clustering of similar words. In Proc. of the 17th COLING, pages 768{774,1998.
WWW 2007 / Track: Semantic Web Session: Similarity
and Extraction 765 Table 7: Entity Disambiguation Results
Jaguar Java Method Precision Recall F Precision Recall F WebJaccard 0:5613 0:541 0:5288 0:5738 0:5564 0:5243
WebOverlap 0:6463 0:6314 0:6201 0:6228 0:5895 0:56
WebDice 0:5613 0:541 0:5288 0:5738 0:5564 0:5243
WebPMI 0:5607 0:478 0:5026 0:7747 0:595 0:6468 Sahami [36] 0:6061 0:6337 0:6019 0:751 0:4793 0:5761
CODC [6] 0:5312 0:6159 0:5452 0:7744 0:5895 0:6358
Proposed 0:6892 0:7144 0:672 0:8198 0:6446 0:691
[6] F. Keller and M. Lapata. Using the web to obtain frequencies for unseen bigrams. Computational Linguistics,
29(3):459{484, 2003.
[7] G. Miller and W. Charles. Contextual correlates of
semantic similarity. Language and Cognitive Processes,6(1):1{28, 1998.
[8] H. Han, H. Zha, and C. L. Giles. Name disambiguation
in author citations using a k-way spectral clustering
method. In Proceedings of the International Conference on Digital Libraries, 2005.
[9] J. Curran. Ensemble menthods for automatic thesaurus
extraction. In Proc. of EMNLP, 2002.
[10] J. Mori, Y. Matsuo, and M. Ishizuka. Extracting keyphrases to represent relations in social networks from
web. In Proc. of 20th IJCAI , 2007.
[11] M. Fleischman and E. Hovy. Multi-document person
name resolution. In Proceedings of 42nd Annual Meeting of the Association for Computational Linguistics (ACL),
Reference Resolution Workshop,2004.
Manasa Ch et al , International Journal of Computer Science & Communication Networks,Vol 2(4), 553-558
557
ISSN:2249-5789
[12] M. Hearst. Automatic acquisition of hyponyms from
large text corpora. In Proc. of 14th COLING, pages 539{545, 1992.
[13] M. Lapata and F. Keller. Web-based models ofr
natural language processing. ACM Transactions on Speech
and Language Processing, 2(1):1{31, 2005. [14] M. Mitra, A. Singhal, and C. Buckley. Improving
automatic query expansion. In Proc. of 21st Annual
International ACM-SIGIR Conference on Research and
Development in Information Retrieval, pages 206{214, 1998.
[15]P. Cimano, S. Handschuh, and S. Staab. Towards the
self-annotating web. In Proc. of 13th WWW, 2004.
[16] R. Bekkerman and A. McCallum. Disambiguating web appearances of people in a social network. In Proceedings
of the World Wide Web Conference (WWW), pages
463{470, 2005.
[17] Z. Bar-Yossef and M. Gurevich. Random sampling from a search engine's index. In Proceedings of 15th
International World Wide Web Conference, 2006.
8 . ABOUT THE AUTHORS
Manasa.Ch received the M.C.A Degree from
Kamala Institute of
Technology and Science,
Huzurabad, Karimnagar,
A.P, India. Currently doing
M.tech in Computer
Science and Engineering at
SR Engineering College,
Warangal, India. Her
research interests include
Knowledge and Data Engineering. She has
Participated in ISTE approved National conference
on “Mobile Communications and Data Engineering
at VITS, Karimnagar,A.P. and participated in
Women Student Congress at NIT, Warangal, organized by IEEE WIE student branch,
V.Ramana received B.Tech
(CSE) degree from JNTU,
Hyderabad in the year
2006.M.Tech (AI) from
university of Hyderabad in
the year 2010, He has2
Years of Teaching Experience. His area of
interest is Artificial
Intelligence and Machine
Learning. He has published
papers in International Journal, International
Conference and National Conference and attended
National Workshops/FDP/Seminars etc., He is a
member of CSI.
S.P.Anandaraj received
B.E (CSE) degree from
Madras University, Chennai
in the year 2004, M.Tech
(CSE) with Gold Medal
from Dr.MGR Educational
and Research Institute,
University in the year 2007
(Distinction with Honors). Now Pursuing Ph.D in St.
Peter’s University, Chennai. He has 8 Years of
Teaching Experience. His areas of interest are
Information security and Sensor Networks. He has
published papers in International Journal,
International Conference and National Conference
and attended nearly15 National
Workshops/FDP/Seminars etc. He is a member of
ISTE, CSI, IEEE, Member of IACSIT and Member
of IAENG.
Manasa Ch et al , International Journal of Computer Science & Communication Networks,Vol 2(4), 553-558
558
ISSN:2249-5789