Open Research Onlineoro.open.ac.uk/29584/1/ijcnlp2011.pdf · Using Explicit Semantic Analysis for...

10
Open Research Online The Open University’s repository of research publications and other research outputs Using Explicit Semantic Analysis for Cross-Lingual Link Discovery Conference or Workshop Item How to cite: Knoth, Petr; Zilka, Lukas and Zdrahal, Zdenek (2011). Using Explicit Semantic Analysis for Cross-Lingual Link Discovery. In: 5th International Workshop on Cross Lingual Information Access: Computational Linguistics and the Information Need of Multilingual Societies (CLIA) at The 5th International Joint Conference on Natural Language Processing (IJC-NLP 2011), 08-13 Nov 2011, Chiang Mai, Thailand. For guidance on citations see FAQs . c Unknown Version: Accepted Manuscript Link(s) to article on publisher’s website: http://www.cfilt.iitb.ac.in/ clia2011/ Copyright and Moral Rights for the articles on this site are retained by the individual authors and/or other copyright owners. For more information on Open Research Online’s data policy on reuse of materials please consult the policies page. oro.open.ac.uk

Transcript of Open Research Onlineoro.open.ac.uk/29584/1/ijcnlp2011.pdf · Using Explicit Semantic Analysis for...

Page 1: Open Research Onlineoro.open.ac.uk/29584/1/ijcnlp2011.pdf · Using Explicit Semantic Analysis for Cross-Lingual Link Discovery Petr Knoth KMi, The Open University p.knoth@open.ac.uk

Open Research OnlineThe Open University’s repository of research publicationsand other research outputs

Using Explicit Semantic Analysis for Cross-Lingual LinkDiscoveryConference or Workshop Item

How to cite:

Knoth, Petr; Zilka, Lukas and Zdrahal, Zdenek (2011). Using Explicit Semantic Analysis for Cross-LingualLink Discovery. In: 5th International Workshop on Cross Lingual Information Access: Computational Linguistics andthe Information Need of Multilingual Societies (CLIA) at The 5th International Joint Conference on Natural LanguageProcessing (IJC-NLP 2011), 08-13 Nov 2011, Chiang Mai, Thailand.

For guidance on citations see FAQs.

c© Unknown

Version: Accepted Manuscript

Link(s) to article on publisher’s website:http://www.cfilt.iitb.ac.in/ clia2011/

Copyright and Moral Rights for the articles on this site are retained by the individual authors and/or other copyrightowners. For more information on Open Research Online’s data policy on reuse of materials please consult the policiespage.

oro.open.ac.uk

Page 2: Open Research Onlineoro.open.ac.uk/29584/1/ijcnlp2011.pdf · Using Explicit Semantic Analysis for Cross-Lingual Link Discovery Petr Knoth KMi, The Open University p.knoth@open.ac.uk

Using Explicit Semantic Analysis for Cross-Lingual Link Discovery

Petr KnothKMi, The Open [email protected]

Lukas ZilkaKMi, The Open [email protected]

Zdenek ZdrahalKMi, The Open University

[email protected]

Abstract

This paper explores how to automati-cally generate cross-language links be-tween resources in large document col-lections. The paper presents new meth-ods for Cross-Lingual Link Discovery(CLLD) based on Explicit Semantic Anal-ysis (ESA). The methods are applicable toany multilingual document collection. Inthis report, we present their comparativestudy on the Wikipedia corpus and providenew insights into the evaluation of link dis-covery systems. In particular, we mea-sure the agreement of human annotators inlinking articles in different language ver-sions of Wikipedia, and compare it to theresults achieved by the presented methods.

1 Introduction

Cross-referencing documents is an essential partof organising textual information. However, keep-ing links in large, quickly growing, document col-lections up-to-date, is problematic due to the num-ber of possible connections. In multilingual doc-ument collections, interlinking semantically re-lated information in a timely manner becomeseven more challenging. Suitable software toolsthat could facilitate the link discovery process byautomatically analysing the multilingual contentare currently lacking. In this paper, we presentnew methods for Cross-Lingual Link Discovery(CLLD) applicable across different types of mul-tilingual textual collections.

Our methods are based on Explicit SemanticAnalysis (ESA) introduced by Gabrilovich andMarkovitch (2007). ESA is a method that calcu-lates semantic relatedness of two texts by map-ping their term vectors to a high dimensionalspace (typically, but not necessarily, the space ofWikipedia concepts) and by calculating the sim-

ilarity between these vectors (instead of compar-ing them directly). The method has receivedmuch attention in the recent years and it has alsobeen extended to a multilingual version calledCross-Lingual Explicit Semantic Analysis (CL-ESA) (Sorg and Cimiano, 2008). To the best of ourknowledge, this method has not yet been appliedin the context of automatic link discovery systems.

Since the CLLD field is relatively young, it isalso important to establish a constructive meansfor evaluating these systems. Our paper pro-vides insight into this problem by investigating theagreement/reliability of man-made links and bypresenting a possible approach for the definitionof ground truth, i.e. gold standard.

The paper brings the following contributions:

(a) It applies Explicit Semantic Analysis to thelink discovery and CLLD tasks.

(b) It provides new insights into the evaluation ofCLLD systems and into the way people linkinformation in different languages, as mea-sured by their agreement.

2 Related Work

CLLD MethodsCurrent approaches to link detection can be di-

vided into three groups:

(1) link-based approaches discover new links byexploiting an existing link graph (Itakura andClarke, 2008; Jenkinson et al., 2008; Lu etal., 2008).

(2) semi-structured approaches try to discovernew links using semi-structured information,such as the anchor texts or document titles(Geva, 2007; Dopichaj et al., 2008; Granitzeret al., 2008; Milne and Witten, 2008; Mihal-cea and Csomai, 2007).

Page 3: Open Research Onlineoro.open.ac.uk/29584/1/ijcnlp2011.pdf · Using Explicit Semantic Analysis for Cross-Lingual Link Discovery Petr Knoth KMi, The Open University p.knoth@open.ac.uk

(3) purely content-based approaches use as aninput plain text only. They typically dis-cover related resources by calculating se-mantic similarity based on document vectors(Allan, 1997; Green, 1998; Zeng and Blo-niarz, 2004; Zhang and Kamps, 2008; He,2008). Some of the mentioned approaches,such as (Lu et al., 2008), combine multipleapproaches. To the best of our knowledge, noapproach has so far been reported to use Ex-plicit Semantic Analysis to address this task.

The main disadvantage of the link-based andsemi-structured approaches is probably the dif-ficulty associated with porting them across dif-ferent types of document collections. The twowell-known solutions to monolingual link detec-tion, the Geva’s and Itakura’s algorithms (Trotmanet al., 2009), fit in these two categories. Whilethese algorithms have been demonstrated to be ef-fective on a specific Wikipedia set, their perfor-mance has significantly decreased when they wereapplied to a slightly different task of interlink-ing two encyclopedia collections. Purely content-based methods have been mostly found to produceslightly worse results than the two previous classesof methods, however their advantage is that theirperformance should remain stable across differentdocument collections. As a result, they can al-ways be used as part of any link discovery sys-tem and can even be combined with domain spe-cific methods that make use of the link graph orsemi-structured information. In practice, domain-specific link discovery systems can achieve highprecision and recall. For example, Wikify! (Mihal-cea and Csomai, 2007) and the link detector pre-sented by Milne and Witten (2008) can be usedto identify suitable anchors in text and enrich itwith links to Wikipedia by combining multiple ap-proaches with domain knowledge.

In this paper, we present four methods (threepurely content-based and one combining the link-based and content-based approach) for CLLDbased on CL-ESA. Measuring semantic similar-ity using ESA has been previously shown to pro-duce better results than calculating it directly ondocument vectors using cosine and other similar-ity measures and it has also been found to outper-form the results that can be obtained by measuringsimilarity on vectors produced by Latent Seman-tic Analysis (LSA) (Gabrilovich and Markovitch,2007). Therefore, the cross-lingual extension of

ESA seems a plausible choice.Evaluation of link discovery systemsThe evaluation of link discovery systems is cur-

rently problematic as there is no widely acceptedgold standard. Manual development of such astandard would be costly, because: (a) the num-ber of possible links is very high even for smallcollections, (b) the link generation task is subjec-tive (Ellis et al., 1994) and (c) it is not entirelyclear how the link generation task should be de-fined in terms of link granularity (for example,document-to-document links, anchor-to-documentlinks, anchor-to-passage links etc.). Developingsuch a CLLD corpora manually would be evenmore complicated.

As a result, Wikipedia links were extracted andtaken as the gold standard (ground truth) in a com-parative evaluation in (Huang et al., 2008). Theauthors admit that Wikipedia links are not perfect(validity of existing links is sometimes question-able and useful links may be missing) the compar-ative evaluation of methods and systems shouldbe considered informative only. For example, itwould be naı̈ve to expect that measuring preci-sion/recall characteristics would be accurate.

In this paper we discuss the issues in automati-cally defining the ground truth for CLLD systems.We take into account the differences in the waypeople link content in different languages to as-sess the agreement between the different languageversions with the goal to find out how well our sys-tem performs. Our experiments are conducted onthe Wikipedia dataset, however we use the articlesonly as a set of documents abstracting from theWikipedia encyclopedic nature.

3 The CLLD methods

This section describes the methods used in our ex-periments. The whole process of cross-languagelink detection is shown in Figure 1. The methodtakes as an input a new “orphan” document (i.e.a document that is not linked to other documents)written in the source language and automaticallygenerates a ranked list of documents written in thetarget language (the suitable link targets from thesource document). The task involves two steps:the cross-language step and the link generationstep. We have experimented with four differentCLLD methods: CL-ESA2Links, CL-ESADirect,CL-ESA2ESA and CL-ESA2Similar that will bedescribed later on. The names of the methods

Page 4: Open Research Onlineoro.open.ac.uk/29584/1/ijcnlp2011.pdf · Using Explicit Semantic Analysis for Cross-Lingual Link Discovery Petr Knoth KMi, The Open University p.knoth@open.ac.uk

Figure 1: Cross-language link discovery process

are derived from the approach applied in the firstand the second step. These methods have differ-ent characteristics and would be useful in differentscenarios.

In the first step, an ESA vector is calculatedfor each document in the document collection.This results in obtaining a weighted vector ofWikipedia concepts for each document in the tar-get language. The cardinality of the vector is givenby the number of concepts (pages) in the targetlanguage version of Wikipedia (i.e. it is about 3.8million for English, 764,000 for Spanish, etc.). Asimilar procedure is applied on the orphan doc-ument, however, the source language version ofESA is used. The resulting ESA vector is thencompared to the ESA vectors that represent docu-ments in the target language collection (CL-ESAapproach). A set of candidate vectors representingdocuments in the target language is acquired as anoutput of the cross-language step, see Section 3.1.

In the second step, the candidate vectors aretaken as a seed and are used to discover documentsthat are suitable link targets. The four different

Figure 2: CLLD candidates

approaches used in this step distinguish the above-mentioned methods, see Section 3.2.

3.1 The cross-language step

The main rationale for the cross-language step isto find t suitable candidates in the target languagethat can later be exploited to identify link targets.Semantically similar target language documents tothe source language document are considered byour methods as suitable candidates. To identifysuch documents, the ESA vector of the source doc-ument is compared to the ESA vectors of docu-ments in the target document collection.

Each dimension in an ESA vector expresses thesimilarity of a document to the given language ver-sion of a Wikipedia concept/article. Therefore, thecardinality of the source document vector is differ-ent from the cardinality of the vectors represent-ing the documents in the target language collec-tion (Figure 2). In order to calculate the similarityof two vectors, we map the dimensions that corre-spond to the same Wikipedia concepts in differentlanguage versions. In most cases, if a Wikipediaconcept is mapped to another language version,there is a one-to-one correspondence between thearticles in those two languages. However, thereare cases when one page in the source language ismapped to more than one page in the target lan-guage and vice versa.1 For the purpose of simi-larity calculation, we use 100 dimensions with thehighest weight that are mappable from the sourceto the target language. The number of candidatesto be extracted is controlled by parameter t. Wehave experimentally found that its selection has asignificant impact on the performance of our meth-ods.

1These multiple mappings appear quite rarely, e.g. in5,889 cases out of 550,134 for Spanish to English and for2,528 cases out of 163,715 for Czech to English.

Page 5: Open Research Onlineoro.open.ac.uk/29584/1/ijcnlp2011.pdf · Using Explicit Semantic Analysis for Cross-Lingual Link Discovery Petr Knoth KMi, The Open University p.knoth@open.ac.uk

Figure 3: Schematic illustration of the four ap-proaches used by the CLLD methods.

3.2 The link generation step

In the link generation step, the candidate docu-ments are taken and used to produce a ranked listof targets for the original source document. Thefollowing approaches, schematically illustrated inFigure 3, are taken by our four methods:

• CL-ESA2Links - This method requires ac-cess to the link structure in the target collec-tion. More precisely, the method takes theoriginal orphan document in the source lan-guage and tries to link it to an already inter-linked target language collection. After ap-plying CL-ESA in the first step, existing linksare extracted from the candidate documents.The link targets are then ranked according totheir similarity to the source document, i.e.documents that are more similar are rankedhigher. This list is then used as a collectionof link targets.

• CL-ESADirect - This method applies CL-ESA on the source document and takes thelist of candidates directly as link targets.

• CL-ESA2ESA - In this method, the applica-tion of CL-ESA is followed by another appli-cation of monolingual ESA, which measuresthe semantic similarity of the candidates with

all documents in the document collection, toidentify link targets.

• CL-ESA2Similar - Instead of generating theranked list of link targets using monolin-gual ESA as in the previous method, whichis computationally expensive, we calculate avector sum from the candidate list of ESAdocument vectors. We then select strongWiki concepts representing these dimensionsas the set of targets. This is equivalent to cal-culating cosine similarity using tfidf vectors.Though much quicker, the main disadvantageis that if we wanted to use this method on an-other set than Wikipedia, ESA would have tobe used with a different background collec-tion.

All of the methods have different properties.CL-ESA2Links requires the knowledge of the linkgraph in the target document collection. CL-ESA2ESA and ESADirect are two methods thatare universal, i.e. can be easily applied in any doc-ument collection. The difference between them isthat the former one requires significantly less doc-ument vector comparisons than the later method.CL-ESA2Similar works almost as fast as CL-ESADirect, but it has the disadvantage that ESAhas to be used with the specific document collec-tion as a background.

4 The underlying data

Wikipedia has been used as a corpus for the meth-ods evaluation. This decision has the followingadvantages that make it possible for us to test andanalyse the methods on a real use case:

• A very large multilingual text collection.

• The articles are well-interlinked and the in-terlinking has been approved by a large com-munity of users.

• A large proportion of articles contain ex-plicit mapping between different languageversions.

In our study, we have experimented with theEnglish, Spanish and Czech language versions ofWikipedia. We consider the cases of linking fromSpanish to English and from Czech to English,i.e. from a less resourced language to the moreresourced one. We believe that this is the more in-teresting direction for CLLD methods as the target

Page 6: Open Research Onlineoro.open.ac.uk/29584/1/ijcnlp2011.pdf · Using Explicit Semantic Analysis for Cross-Lingual Link Discovery Petr Knoth KMi, The Open University p.knoth@open.ac.uk

language version is more likely to contain relevantinformation not available in the source language.The language selection has been motivated by theaim to test the methods in two very different envi-ronments. The Spanish version is relatively wellresourced containing 764,095 pages (about fourtimes fewer than English), the Czech languageis much less resourced containing 196,494 pages(about four times fewer than Spanish).

5 Evaluation methodology

One of the main obstacles in systematically im-proving link discovery systems is the difficulty toevaluate the results. The issue that makes reliableevaluation problematic is due to both technical andcognitive aspects. The difficulty in obtaining the“ground truth” for a sufficiently large dataset iscaused both by the lack of human resources tomanually annotate a very large number of docu-ment combinations, and the inherent subjectivityof the task. As a result, we find it essential to es-timate the agreement between annotators and seeto what extent the precision and recall character-istics can be measured with respect to interlinkeddocument collections.

We claim that the reasons for linking two piecesof information is made at the level of seman-tics, i.e. the annotator has to understand the con-cepts/ideas described in two papers to decide ifthey should be connected by a link. We claimthat this process should be language independent.Thus, an article about London will be related to anarticle about the United Kingdom regardless of thelanguage the articles are written in.

Therefore, let us define the link generation taskin the following way: Given a document2 in thesource language, find documents in the target lan-guage that are suitable link targets for the sourcedocument, i.e. there is a semantic relationship be-tween the source document and the linked targetdocuments.

Based on the definition, the ground truth fora topic document d is the set of documents thatcan be considered (semantically) suitable link tar-gets. Though this set is typically unknown to us,we can in our experiment approximate it by takingthe existing Wikipedia links as ground truth. Be-cause the Wikipedia link structure has been agreedby a large number of contributing authors, it is

2The term topic is also sometimes used to refer to the doc-ument.

likely to have a relatively consistent link struc-ture in comparison to content that would be linkedjust by a single person. To establish the groundtruth for the original source document, we can ex-tract all links originating in the source documentand pointing to other documents. Since the pro-cess of linking information is performed at the se-mantic level, and is thus language independent, wecan enrich our ground truth with link graphs fromdifferent language versions of Wikipedia. Thiscauses the ground truth to get larger which has twoconsequences: (1) It increases the reliability of theevaluation as many relevant links are often omit-ted (Knoth et al., 2010) (2) It is more difficult toachieve higher recall.

6 Results

6.1 Experimental setupThe experiment was carried out for two languagepairs: Spanish to English and Czech to English.We will denote the source language Lsource andthe target language Ltarget. The input for the dif-ferent CLLD methods are two document sets:

• Let SOURCELsource be the set of topicdocuments selected as pages that contain aWikipedia link between different languageversions. In our case, 100 pages were se-lected.

• Let TARGETLtarget be the collection ofdocuments in the target language from whichthe link targets are selected. In our case,this collection contains all (3.8 million)Wikipedia pages in English.

The output of the method is a set (ranked list)LISTresult = 〈TARGETLtarget , score〉. To es-tablish the ground truth we define:

• Let ρ be the mapping from documents in thesource language to their target language ver-sions ρ : DLsource → DLtarget .

• Let SOURCELtarget be the set of topicdocuments mapped to the target languageSOURCELtarget = ρSOURCELsource .

• Let α, β be the mappings from documentsto the other documents they link to in thesource and target language respectively α :DLsource → DLsource , β : DLtarget →DLtarget .

Page 7: Open Research Onlineoro.open.ac.uk/29584/1/ijcnlp2011.pdf · Using Explicit Semantic Analysis for Cross-Lingual Link Discovery Petr Knoth KMi, The Open University p.knoth@open.ac.uk

then we define the ground truth (GT) as theunion of ground truths for different language ver-sions, in this experiment we define it as the unionof ground truth for the source and target language.

GT = α(SOURCELsource)∪β(SOURCELtarget)

A given generated item 〈d, score〉 ∈LISTresult is evaluated as a hit if and onlyif d ∈ GT .

6.2 Methods evaluation

To investigate the performance of the first part ofCLLD - the cross-language step carried out byCL-ESA, we have analysed how well the systemfinds for a given topic document in the source lan-guage the duplicate document in the target lan-guage. In this step, the system takes a docu-ment in the source language, and selects from the3.8 million large document set in the target lan-guage the documents with the highest similarity.We then check, if a duplicate document (d =ρdsource) appears among the top k retrieved doc-uments. The experiment is repeated for all exam-ples in SOURCELsource and the results are thenaveraged (Figure 4). The graph suggests that themethod performs well, as the document often ap-pears among the first few results. In about 65%of cases, the document is found among the first 50retrieved items. We believe that if the set of candi-dates (controlled by the t parameter) contains thisdocument, the CLLD method is likely to producebetter results, this is especially true for the CL-ESA2Links method.

The overall results for all the methods are pre-sented in Figure 5. We have experimentally sett = 10 for Spanish to English and t = 3 for Czechto English CLLD. CL-ESA2Links performed inthe experiments the best achieving 0.2 precisionat 0.3 recall. CL-ESA2Similar performed the bestout of the purely content-based methods.

Though the precision/recall might seem quitelow, a number of things should be taken into ac-count:

• A significant number of potentially usefullinks is still missing in our ground truth, be-cause people typically do not intend to linkall relevant information. As a result, manypotentially useful connections are not explic-itly present in Wikipedia (Knoth et al., 2010).

0

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

50 100 150 200 250 300 350 400 450 500

P

TOP-k

TOP-k probability (es)TOP-k probability (cs)

Figure 4: The probability (y-axis) of finding thetarget language version of a given source languagedocument using CL-ESA in the top k retrieveddocuments (x-axis). Drawn as a cumulative dis-tribution function.

The problem can be partly mitigated by com-bining the ground truth from more languageversions. Another approach is to measure theagreement instead of precision/recall charac-teristics (see Section 6.3).

• A significant number of links in Wikipediaare conceptual links. These links do not ex-press a particularly strong relationship at thearticle level. This makes it very difficult forthe pure-content based methods to find them,which results in low recall. It seems that CL-ESA2Links is the only method that does notsuffer from this issue.

• The experiment settings make it hard for themethods to achieve high precision/recall per-formance. The TARGETLtarget set contains3.8 million articles, out of which, the meth-ods are supposed to identify on average justa small subset of target documents. Moreprecisely, in Spanish to English CLLD, ourground truth contains on average 341 tar-get documents with standard deviation 293,in Czech to English, it contains on average382 target documents with standard deviation292.

6.3 Measuring the agreementTo assess the subjectivity of the link genera-tion task and to investigate the reliability of theacquired ground truth, we have compared thelink structures from different language version of

Page 8: Open Research Onlineoro.open.ac.uk/29584/1/ijcnlp2011.pdf · Using Explicit Semantic Analysis for Cross-Lingual Link Discovery Petr Knoth KMi, The Open University p.knoth@open.ac.uk

0

0.05

0.1

0.15

0.2

0.25

0.3

0.35

0.4

0 0.05 0.1 0.15 0.2 0.25 0.3 0.35 0.4

Pre

cis

ion

Recall

CL-ESA2Links

CL-ESA2Similar

CL-ESA2ESA

CL-ESADirect

0

0.05

0.1

0.15

0.2

0.25

0.3

0.35

0.4

0.45

0 0.05 0.1 0.15 0.2 0.25 0.3 0.35

Pre

cis

ion

Recall

CL-ESA2Links

CL-ESA2Similar

CL-ESA2ESA

CL-ESADirect

Figure 5: The precision (y − axis)/recall (x-axis) graphs for Spanish to English (left) and Czech toEnglish (right) CLLD methods.

Spanish vs EnglishYen Nen N/Aen

Yes 5,563 10,201 3,934Nes 15,715 539,299,641 99,191,766N/Aes 5781 321,326,145 0Czech vs English

Yen Nen N/Aen

Ycz 4,308 8,738 2,194Ncz 12,961 392,411,445 7,501,806N/Acz 9,790 356,532,740 0

Table 1: The agreement of Spanish and EnglishWikipedia and Czech and English Wikipedia ontheir link structures calculated and summed for allpages in SOURCEes. Y - indicates yes, N - no,N/A - not available/no decision

Wikipedia. We have iterated over the set of top-ics from SOURCELsource and recorded for eachdocument in TARGETLtarget in each step if it isa valid link target (yes - Y ) or if it is not a validlink target (no - N ) for the given source documentin each language, thus measuring the agreementbetween the link structures in different languages.The results are presented in Table 1.

As demonstrated in Figure 6, a subset ofWikipedia pages cannot be mapped to other lan-guage versions. Either the semantically equiva-lent page does not exist or the cross-language linkis missing. These links were classified as no de-cision/not available (N/A). The mappable docu-ments were classified in a standard way accordingto their appearance in the link graphs of the lan-guage versions. Only these links are taken intoaccount while measuring the agreement.

A common way to assess inter-annotator agree-

Figure 6: Individual cases of agree-ment/disagreement/no decision (not available) fortwo language versions of Wikipedia link graphs.

ment between two raters in Information Retrievalis using the Cohen’s Kappa calculated as:

κ =Pr(a)− Pr(e)

1− Pr(e),

where Pr(a) is the relative observed frequency ofagreement and Pr(e) is the hypothetical probabil-ity of chance agreement. Pr(a) is typically cal-culated as |Y,Y |+|N,N |

|Y,Y |+|Y,N |+|N,Y |+|N,N | . Since there isa strong agreement on the negative decisions, theprobability will be close to 1. If we ignore the|N,N | cases, which do not carry any useful infor-mation, the formula looks as follows:

Pr(a) =|Y, Y |

|Y, Y |+ |Y,N |+ |N,Y |.

The probability of a random agreement is ex-tremely low, because the probability of a link

Page 9: Open Research Onlineoro.open.ac.uk/29584/1/ijcnlp2011.pdf · Using Explicit Semantic Analysis for Cross-Lingual Link Discovery Petr Knoth KMi, The Open University p.knoth@open.ac.uk

0

0.05

0.1

0.15

0.2

0 10 20 30 40 50 60 70 80 90 100

agre

em

ent

% truth

CL-ESA2LinksCL-ESA2Similar

CL-ESA2ESACL-ESADirect

0

0.05

0.1

0.15

0.2

0 10 20 30 40 50 60 70 80 90 100

agre

em

ent

% truth

CL-ESA2LinksCL-ESA2Similar

CL-ESA2ESACL-ESADirect

Figure 7: The agreements of the Spanish to English (left) and Czech to English (right) CLLD methodswith GTes,en and GTcz,en respectively. The y-axis shows the agreement strength and the x-axis thenumber of generated examples as a fraction of the number of examples in ground truth.

connecting any two pages is approximately:3

plink =|links||pages|2

=78.3M

3.2M2= 0.000007648.

Thus, the hypothetical number of items appear-ing in the Y, Y class by chance is p2link.(|Y, Y | +|Y,N |+ |N,Y |+ |N,N |). This formula estimatesthe number of agreements achieved by chance. Inour case the value is much smaller than 1, henceP (e) is close to 0. Therefore, we can calculate theagreement for English and Spanish as:

κen,es =5, 563

31, 479= 0.177.

The agreement for Czech and English is:

κen,cz =4, 308

26, 007= 0.166.

The value indicates a relatively low inter-annotatoragreement. We believe that the fact that such alow agreement has been measured is very inter-esting, particularly because the link structure inWikipedia is a result of a collaborative effort ofmany contributors. Therefore, we would expectthat even lower agreement might be experiencedin other types of text collections.

Motivated by the previous findings, we havecalculated the agreement between the output ofour method and the link graphs present in dif-ferent language versions of Wikipedia. We wereespecially interested to find out if the agree-ment is significantly different from the agreement

3Following the official Wikipedia statistics. Though dif-ferent language versions have different plink, the differencesdo not effect the results.

measured between different language versions ofWikipedia. We have generated by our CLLDmethods 100% of |GT | links for every orphan doc-ument in SOURCELsource , i.e. if a particulardocument is linked in Wikipedia to 57 documents,we generate 57 links. We have then measured theagreement for each topic document and averagedthe agreement values. The results of the exper-iment for Spanish to English and Czech to En-glish CLLD are shown in Figure 7. They suggestthat CL-ESA2Links achieved a level of agreementcomparable to that of human annotators. A veryreasonable level of agreement has also been mea-sured for CL-ESA2Similar, especially for the first10% of the generated links. CL-ESADirect andCL-ESA2ESA exhibit a lower level of agreement.

7 Conclusion

In this paper, we have presented and evaluatedfour different methods for Cross-Language LinkDiscovery (CLLD). We have used Cross-languageExplicit Semantic Analysis as a key component inthe development of the four presented methods.The results suggest that methods that are awareof the link graph in the target language achieveslightly better results than those that identify linksin the target language only by calculating seman-tic similarity. However, the former methods can-not be applied in all document collections and thusthe later methods are valuable. Though it mightseem at first sight that CLLD methods do not pro-vide very high precision and recall, we have shownthat the performance can, in fact, reach the resultsachieved by human annotators.

Page 10: Open Research Onlineoro.open.ac.uk/29584/1/ijcnlp2011.pdf · Using Explicit Semantic Analysis for Cross-Lingual Link Discovery Petr Knoth KMi, The Open University p.knoth@open.ac.uk

ReferencesJames Allan. 1997. Building hypertext using informa-

tion retrieval. Inf. Process. Manage., 33:145–159,March.

Philipp Dopichaj, Andre Skusa, and Andreas Heß.2008. Stealing anchors to link the wiki. In Gevaet al. (Geva et al., 2009), pages 343–353.

David Ellis, Jonathan Furner-Hines, and Peter Wil-lett. 1994. On the measurement of inter-linkerconsistency and retrieval effectiveness in hypertextdatabases. In SIGIR ’94: Proceedings of the 17thannual international ACM SIGIR conference on Re-search and development in information retrieval,pages 51–60, New York, NY, USA. Springer-VerlagNew York, Inc.

Evgeniy Gabrilovich and Shaul Markovitch. 2007.Computing semantic relatedness using wikipedia-based explicit semantic analysis. In In Proceedingsof the 20th International Joint Conference on Artifi-cial Intelligence, pages 1606–1611.

Shlomo Geva, Jaap Kamps, and Andrew Trotman, ed-itors. 2009. Advances in Focused Retrieval, 7th In-ternational Workshop of the Initiative for the Evalu-ation of XML Retrieval, INEX 2008, Dagstuhl Cas-tle, Germany, December 15-18, 2008. Revised andSelected Papers, Lecture Notes in Computer Sci-ence. Springer.

Shlomo Geva. 2007. Gpx: Ad-hoc queries and auto-mated link discovery in the wikipedia. In NorbertFuhr, Jaap Kamps, Mounia Lalmas, and AndrewTrotman, editors, INEX, Lecture Notes in ComputerScience. Springer.

Michael Granitzer, Christin Seifert, and Mario Zech-ner. 2008. Context based wikipedia linking. InGeva et al. (Geva et al., 2009), pages 354–365.

Stephen J. Green. 1998. Automated link generation:can we do better than term repetition? Comput.Netw. ISDN Syst., 30(1-7):75–84.

Jiyin He. 2008. Link detection with wikipedia. InGeva et al. (Geva et al., 2009), pages 366–373.

Wei Che Huang, Andrew Trotman, and Shlomo Geva.2008. Experiments and evaluation of link discoveryin the wikipedia.

Kelly Y. Itakura and Charles L. A. Clarke. 2008. Uni-versity of waterloo at inex 2008: Adhoc, book, andlink-the-wiki tracks. In Geva et al. (Geva et al.,2009), pages 132–139.

Dylan Jenkinson, Kai-Cheung Leung, and AndrewTrotman. 2008. Wikisearching and wikilinking. InGeva et al. (Geva et al., 2009), pages 374–388.

Petr Knoth, Jakub Novotny, and Zdenek Zdrahal.2010. Automatic generation of inter-passage linksbased on semantic similarity. In Proceedings of the

23rd International Conference on ComputationalLinguistics (Coling 2010), pages 590–598, Beijing,China, August.

Wei Lu, Dan Liu, and Zhenzhen Fu. 2008. Csir at inex2008 link-the-wiki track. In Geva et al. (Geva et al.,2009), pages 389–394.

Rada Mihalcea and Andras Csomai. 2007. Wikify!:linking documents to encyclopedic knowledge. InProceedings of the sixteenth ACM conference onConference on information and knowledge manage-ment, CIKM ’07, pages 233–242, New York, NY,USA. ACM.

David Milne and Ian H. Witten. 2008. Learning tolink with wikipedia. In James G. Shanahan, SihemAmer-Yahia, Ioana Manolescu, Yi Zhang, David A.Evans, Aleksander Kolcz, Key-Sun Choi, and AbdurChowdhury, editors, CIKM, pages 509–518. ACM.

Philipp Sorg and Philipp Cimiano. 2008. Cross-lingual information retrieval with explicit seman-tic analysis. In Working Notes for the CLEF 2008Workshop.

Andrew Trotman, David Alexander, and Shlomo Geva.2009. Overview of the inex 2010 link the wiki track.

Jihong Zeng and Peter A. Bloniarz. 2004. From key-words to links: an automatic approach. InformationTechnology: Coding and Computing, InternationalConference on, 1:283.

Junte Zhang and Jaap Kamps. 2008. A content-based link detection approach using the vector spacemodel. In Geva et al. (Geva et al., 2009), pages 395–400.