plugin-05563873

Semantic Relatedness based on Searching Engines

Ning Yang, Lei Guo, lun Fang and Xiaoyu Chen Dept. of Automation

Northwestern Poly technical University, Xi'an, P.R.China

Abstract-Semantic relatedness plays an important role in Natural Language Processing area since it has the ability to imitate human judgments. Recently the most popular methods are depend on the hierarchy of WordNet, which result in restricted calculation objects and language, moreover, the accuracy is not satisfied. This paper presents a new semantic relatedness calculation method by considering number of returned web pages which resulted from searching combination of measured words in a searching engine. It is assumption that words appear in a same web page have some relatedness. The searching Engine semantic relatedness method doesn't have the limitation on part of speech and language. Experiment results show the method is better than the current important methods, and the accuracy can be over eighty percent.

Keywords- semantic similarity, semantic relatedness, algorithm, web search

I. INTRODUCTION Semantic relatedness and similarity are important

directions of Natural Language Processing area, they are important in many applications such as word sense disambiguation [ 1], information retrieval [2] and spelling correction [3]. Furthermore, we can integrate the work of semantic relatedness into the network to make it more intelligent, thereby the network can be satisfied as to fulfill people's requirement.

We should emphasize that semantic relatedness and semantic similarity are two different conceptions. Semantic relatedness is defined to cover any kind of lexical (metonymy, antonymy and so on) or functional association that may exist between two words. It can be co-occurrence statistics from corpora. Semantic similarity, a special case of relatedness, refers to a relationship between concepts that is information found in an is-a hierarchy. Semantically similar concepts are typically defined via the lexical relation of synonymy [4]. The computation of semantic relatedness can be more complicated while semantically similar concepts have the result equals to according to WordNet. Because semantic of relatedness is widespread used in many applications, our research will focus on semantic relatedness in this paper.

Investigations indicate that most current semantic relatedness measurements are based on WordNet. WordNet is a large lexical database of English, developed under the direction of George A. Miller. Nouns, verbs, adjectives and adverbs are grouped into sets of cognitive synonyms (synsets), each expresses a distinct concept. Synsets are interlinked by means of conceptual-semantic and lexical relations. WordNet distinguishes between nouns, verbs,

978-1-4244-5540-9/10/$26.00 2010 IEEE

292

adjectives and adverbs because they follow different grammatical rules. Every synset contains a group of synonymous words or collocations; different senses of a word are in different synsets.

Currently, there are several classic methods for semantic relatedness such as path-based measure [5], information content measure [6, 7], gloss-based measure [8] and context vector measure [9]. However, most of these measurements rely on hierarchical corpus (WordNet), this requirement for corpus' structure would definitely lead to the limitation of certain method. For example, all measurements except context vector measurement can only involve English vocabulary and noun. Gloss-based method is not suitable for precise computation due to little information quantity offered by gloss. Context vector method has high precision only with abundant computation.

In order to overcome the above limitations, this paper represents a novel semantic relatedness calculation method based on searching engine. It is well known that searching results returned by search engine are very close to human judgment and words appears in a same web page have some relatedness, So we can use these results to measure semantic relatedness. We implemented our semantic relatedness calculation method by invoking opened APIs provided by some searching engine. Experiment results show our method can have a relatively high precision; the correlation with human judgments can be as high as 80% both on English and Chinese datasets.

II. RELATED CLASSIC ALGORITHMS

2.1 Path based Method 1) Leacock and Chodorow. The measurement of Leacock

and Chodorow is related to that of Rada et aI., in that it is the length of the shortest paths between noun concepts in an is-a hierarchy. The shortest path is the one which includes the fewest number of intermediate concepts. The measure is formulated as follows:

simich (c1 ,c2) = max [ -log(length(c1 ,c2) / (2 D))] (1)

where length(cJ, C2) is the shortest path length (i.e., having minimum number of nodes) between the two concepts and D is the maximum depth of the taxonomy.

2) Hirst and St.Onge. Hirst and St.Onge introduce a measure of relatedness that considers many other relations in WordNet beyond is-a relations. The measure has four levels of relatedness: extra strong, strong, medium strong and weak. An extra strong relation is based on the surface form of the

words. The medium-strong relation is determined by a set of allowable paths between concepts. If a path that is neither too long nor too winding exists, then there is a mediumstrong relation between the concepts. The score given to a medium-strong relation considers the path length between the concepts and the number of changes in direction of the path:

path _ weight = C -path _length -(k x # _ changes _ in _ direction)

(2)

Following Budanitsky and Hirst, we set C to 8 and k to 1. The value of strong relations is defined to be 2C. Thus, two concepts that exhibit a strong relation will receive a score of 16, while two concepts with a medium-strong relation will have a maximum score of 8, and two concepts that have no relation will receive a score of zero [2].

2.2 Information Content Method This measurements is based on the assumption that the

frequency counts associated with concepts higher up in the is-a hierarchy is always greater than or equal to those lower down in the hierarchy.

1) Jiang and Conrath. The principle of the Jiang and Conrath's method is that the relatedness of two concepts can be scaled by the difference between the information content of the subsuming concept and of the individual concepts. The formulation is as follows:

Since that the measure computes distance between concepts, the more they are related the less score they would get. In order to be consistent with other methods, we take the reciprocal of the result.

(4)

2) Lin. Lin measures the ratio of the information content needed to state the commonality of the two concepts as represented in their lowest common subsume to the amount of information needed to describe them individually. The similarity is computed as:

The information content measure takes the depth of a hierarchy structure into account. It computes the relatedness between two concepts according to their frequency in corpus, furthermore, the method takes effect wherever there is a frequency model. Meanwhile the potential limitation is that quite a few concepts might share the same least common subsume, and will have identical values of similarity assigned to them.

293

2.3 Gloss based Method Gloss overlap measure and the extended gloss overlap

measure both belong to the gloss based measures. The measure determines how related two concepts are by counting the number of shared words (overlaps) in the word senses of the concepts, as well as in the glosses of words that are related to those concepts according to the dictionary. We mainly discuss the extended gloss overlap measure in this paper.

Extended gloss overlap measure extends the gloss of the target concepts by finding overlaps not only between the definitions of them, but also among those concepts to which they are related. The measure eases the problem of gloss shortage on certain level, meanwhile extended gloss overlaps can find relations between adjectives and verbs as well (gloss overlaps can only solve the computation between nouns). The main foundation of this method is that: multiple word matches are scored higher than single word matches. This is done by adding the square of the number of consecutive words matched, to the score of the gloss pair.

Analyzing two concepts c1 and C2: firstly consider C1 and a set C1 of glosses corresponding to c1 (initially empty). The gloss of concept c1 is added to the set C1 All glosses connected to c1 (corresponding to all the WordNet relations) are added to set C1 Similarly we create a set Cz for the second concept c2 To find the relatedness of c1 and c2, gloss overlap scores for each gloss in C1 with each gloss in Cz are added and the sum is the semantic relatedness of

concepts c1 and c2 This measure has a lower bound of 0 and no upper bound [9].

2.4 Context Vector Method Context vector measure commonly covers the

information retrieval and natural language processing area. First order co-occurrence matrix is the most familiar one, which is constructed by concepts co-occur with the target concept in a given corpus. Consider concepts c1 and c2' we start by creating Word Vectors, which are first order context vectors. The dimensions of these vectors are content words from the same corpus of text (each dimension corresponding to one content word). The vector for concept c is created as follows: 1) Initialize the first order context vector to a zero vector; 2) Find every occurrence of word w in the given corpus; 3) For each occurrence of c, increment by 1 those dimensions of vector which correspond to the words present in a specified window of context around c.

The semantic relatedness of two concepts c1 and c2 is then computed as the cosine of the angle between their context vectors:

(6)

where VI and V2 are the context vectors corresponding to c1 and c2 respectively.

Context vector measure is able to compute the relatedness between concepts from any part of certain corpus;

it doesn't depend on strict hierarchy structure and comprehensively reflect the similarity and difference between concepts on syntax, word sense and word usage. But this method relatively depends on training data which can affect computing results; there are huge computations and this method is quite complex which makes it easy to be disturbed by sparse data and data noise.

III. SEMANTIC RELATEDNESS BASED ON SEARCHING ENGINE

3.] Basic principle of the method Our algorithm uses statistical information of searching

engine to compute semantic relatedness. It refers the exoteric web resources to a corpus and brings out the follow reasonable assumption.

Assumption: If two words appear in one web page together, there would be some relativity between them.

According to the assumption, the more web pages the two concepts share the more related they are. The web pages can be obtained through searching engine's opened up APIs. Concrete steps of our method are as follows.

Firstly, we assign the value of target words to strings wordl and word2. Then, we call the API of certain searching engine to make key words search for wordl and word2 which returns with web pages (separately searching and jointly searching results). Finally, the above result is input into Eq.7 to obtain the relatedness between the two concepts.

rei = hits( wI + w2)/min(hits( wI), hits( w2)) (7)

where hits( wI + w2) is jointly searching results of word 1 and word2; min(hits(wl),hits(w2)) is the minimum pages returned by separately searching. Semantic relatedness calculated by using Formula (3- 1) is between 0 and 1.

As long as two key words word] and word2 relate to each other, the result pages from separate search and joint search have association in a certain extent. Hence, the relatedness value calculated by Eq.7 is more close to 1, which means the two words are closer related to each other. Eq.7 chooses the minimum pages returned through separate search instead of others such as hits(w]) X hits(w2) is to avoid the following phenomenon: when hits of one word (e.g. wI) is many while hits of the other (e.g. w2) is few, and many appearance of w2 is accompanied by wI, which means they are strongly related. Under such circumstances, the result would be incorrect if we chose the product of hits(wI) and hits(w2).

Waste messages' in the web could affect the computing results. It was assured by our experiment where the result computed through formula (3- 1) was slightly smaller than human judgment. In spite of that the whole idea about using search engine is definitely going the right way. We can optimize formula (3-1) to achieve a better result. After many experiments we found out that if we make the new result the third power of the original one, the result could be perfectly close to human judgment. The modified formula is shown as follow:

294

rei = hits(wl + w2)/min(hits(wl),hits(w2)) (8)

3.2 Proposed Algorithm Considering that Java is platform independent and have

the strong ability of network supporting, we chose Java as the implementation language. We use Google search engine as the foundation to accomplish our algorithm for its abundant exoteric API resources. Its API is supportable for many databases like websites, pictures, news and videos. Since the algorithm is universally applicable, other searching engines with open APIs could do the job as well.

3.3 Advantages and disadvantages of the algorithm Semantic Relatedness based on Searching Engines is a

novel successful method according to the later described experimental results. The similarity is far superior to other methods, when it is applied to calculate an extremely wide range of WordSimiiarity353 data. Also, experimental values are quite close to human judgments that obtained for the calculation of our own ordering Chinese data (concept group and individual group) based on this method. Besides these advantages, there are still some drawbacks that we can improve or optimize, which is our primary task for future. Following is the summary of advantages and disadvantages in Table 1

CI ient Appl ication

Figure 1. Flowchart of the algorithm.

TABLE I. PROS AND CONS OF ALL THE METHODS

measures Advantal(es Disadvantal!es Can compare the

relatedness between different languages,

Searching between concept and Need network support; engine entity, between verbs, Weak for synonym method nouns and adjectives and calculation.

so on; In -time dating; simple algorithm;

Easy to understand

The connection type between the concept

Good calculation results should be the same; Path based with the dictionary of The length of the path

Method given hierarchy; should be the same; Easy to handle. Limited to the

calculation of Word Net noun

Gloss Can compute relatedness Too short description of all corpus without the based dictionary with a specific causes imprecise Method hierarchy calculation results

Minimum number of the intersections of

Independent of the many concept point are Information dictionary with the the same;

Content hierarchy; Calculation results are Method Can be applied in case of the same; probability model Limited to the

calculation of Word Net noun

Depend on corpus; Large calculation;

Context Can compute relatedness Disturbed by noise interference of data Vector without a strict hierarchy easily; Method structure Limited to the calculation of WordNet

noun

IV. EXPERIMENTAL SETUP In order to evaluate the precision of the searching engine

algorithm, we organized two groups of experiment. The first group is an English dataset experiment which chooses the standard dataset word similarity353 to do the job (there are 351 pair of nouns). It compares our method with others on the same corpus level. The second group is a Chinese dataset experiment where we collect 100 pairs of Chinese phrase (50 pairs of concept group and 50 pairs of unity group). It shows that our method can deal with non-English data and achieve a relatively high accuracy while others can't compute with Chinese dataset.

The following classical relatedness measures are compare with our method: Hirst and St. Onge [5] (hso), Jiang and Conrath [6] (jcn), Lin [7] (lin), Lesk [8] (lesk), path [5], vector "'pairs [9]. All these methods are implemented in WordNet:: Similarity package.

Firstly, all the above classical methods are normalized. It then can be easier to compare with each other. For hso, we have the formula hsore1 = hso(wl, w2)/16; while for lesk, the formula IS leskre1 = lesk(wl, w2)/min(lesk(wl, wl),lesk(w2, w2 linrel(wl, w2) = 2 * IC(/cs(wl, w2))/(IC(wl)+ IC(w2)) IS the formula for lin.

In order to more conveniently compare with human judgments, we divided the 0-1 relatedness value into three parts: ( 0::; reI < 0.2 ) irrelevance, ( 0.2::; reI < 0.88) normally related, ( reI 0.88 ) closely related. If there is little relation between two concepts, people would assign very small value as their relatedness, usually smaller than 0.2. If two words

295

are closely related, people would assign a very high score to them, say 0.88. So we set 0.88 as the threshold and 0.88- 1 is the closely related area.

4.1 English dataset experiment All word pairs from wordsimilarity353 data are in

English, most of them are concept groups. Comparing with human judgment according to the three parts divided before, we obtained results of correlations between lesk, jcn, hso, path, lin, vector "'pairs and human judgment. The results are: lesk 0.164, jcn 0.204, hso 0.363, path 0.397, lin 0.487, vector_pairs 0.532.

The result of searching engine (referred as Google method during in this paper) is 0.816. It is the best result compared with the other methods, Google-based measure excels all the measures contained in the WordNet:: Similarity package.

Results of the methods comparing to human judgment can be illustrated in Table 2:

TABLE II.

measure

Correlation with human judgment

THE EXPERIMENTAL RESULTS

Hso Lesk Jcn Path Lin vector_ pairs

0.3630. 1640.2040.3970.487 0.532

4.2 Chinese dataset experiment

Google

0.8 16

Since there is no standard Chinese datasets for the use of semantic relatedness, we created our own dataset. There are 100 pairs of Chinese phrase, 50 pairs of concept group and 50 pairs of unity group. After calculating relatedness by using Google measure, we compared the results with human given reference value. Concept group has a correlation of 0.86 with reference value while unity group has a correlation ofO.88.

We also observe that Google measure is dynamic and up-to-data comparing with the invariance of other method based on WordNet. Google measure works well with the new words people currently concern such as "Jjp, tf:l:W t:t", the relatedness value of them are 0.89, which is closely related. As W ordNet only contains English words, other methods can't deal with Chinese dataset.

TABLE IIl. THE EXPERIMENTAL RESULTS OF GOOGLE

Goog1e measure Concept group Unity group

Reference value 0.86 0.88

4.3 Discussion According to the experimental results of English dataset,

Google measure has the best precision among all the other methods which obviously shows that Google excels all of them. The problems of other methods are probably the followings: First of all, path based measures rely on

hierarchical corpus (WordNet), this requirement for corpus' structure would definitely lead to the limitation of certain method; second of all, although gloss-based and context vector methods don't rely on the hierarchical structure, Gloss-based method can't get the satisfy the request for precise computation due to little information quantity offered by gloss while Context vector method has high precision with abundant computation; third of all, they all depend on WordNet which contains no unity words like (Maradona football), it affects the result significantly.

Meanwhile due to the fact that WordNet only contains English words, the other methods can't deal with nonEnglish dataset either. The experimental results show that searching engine method works well for Chinese dataset.

In a word, most of current methods are based on the hierarchy of lexicon such as WordNet, which result in restricted calculation of objects and language. In contrast to these methods, our method can measure semantic relatedness between words of different part of speech or non-English language. And the semantic relatedness is up-to-date.

V. CONCLUSIONS AND FUTURE WORK

In this paper, we referenced the current measures for computing semantic relatedness. Analyzed their pros and cons helps us bring out the new method-semantic relatedness based on searching engine. This new method use statically information of searching engines to compute semantic relatedness. It is very close to human judgment.

Although the searching engine method excels other ones, there are still improvements in some aspects. For example, we could combine searching engine method with context vector method. We can retrieve several top web pages, and then form into context vector to participate in the computation. It will improve the performance of the method significantly.

296

ACKNOWLEDGEMENT

This paper is partially supported by the Ph.D. Programs Foundation of Ministry of Education of China (No. 20096102120037)

REFERENCES [I] Saif Mohammad and Graeme Hirst. Distributional Measures as

Proxies for Semantic Relatedness. 2005 Kluwer Academic Publishers. Printed in the Netherlands. pp. 32-35

[2] Ted Pedersen, Satanjeev Banetjee, Siddharth Patwardhan, Maximizing Semantic Relatedness to Perform Word Sense Disambiguation, in: Preprint submitted to Elsevier Sciencem, 2005, pp. 10-2 1

[3] Ricardo B Y. Modem Information Retrieval. ACM Press, 1999 [4] Torsten Zesch, Iryna Gurevych. Automatically creating datasets for

measures of semantic relatedness. In: Proceedings of the Workshop on Linguistic Distances, pages 16--24, Sydney, July 2006

[5] G. Hirst, D. St-Onge, Lexical chains as representations of context for the detection and correction of malapropisms, in: C. Fellbaum (Ed.), WordNet: An electronic lexical database, MIT Press, 1998, pp. 305-332

[6] J. Jiang, D. Conrath, Semantic similarity based on corpus statistics and lexical taxonomy, in: Proceedings on International Conference on Research in Computational Linguistics, Taiwan, 1997, pp. 19-33

[7] D. Lin, Using syntactic dependency as a local context to resolve word sense ambiguity, in: Proceedings of the 35th Annual Meeting of the Association for Computational Linguistics, Madrid, 1997, pp. 64--7 1

[8] M. Lesk, Automatic sense disambiguation using machine readable dictionaries: how to tell a pine code from an ice cream cone, in: Proceedings of the 5th annual international conference on Systems documentation, ACM Press, 1986,pp. 24--26

[9] Patwardhan S, Pedersen T. Using WordNet-based context vectors to estimate the semantic relatedness of concepts. In: Proceedings of the EACL 2006 workshop, making sense of sense: Bringing computational linguistics and psycholinguistics together. Trento, Italy; 2006. pp. 2-3

plugin-05563873

Documents

Transcript of plugin-05563873