1 Web-Page Summarization Using Clickthrough Data* JianTao Sun, Yuchang Lu Dept. of Computer Science...

Click here to load reader

download 1 Web-Page Summarization Using Clickthrough Data* JianTao Sun, Yuchang Lu Dept. of Computer Science TsingHua University Beijing 100084, China Dou Shen,

of 20

Transcript of 1 Web-Page Summarization Using Clickthrough Data* JianTao Sun, Yuchang Lu Dept. of Computer Science...

  • Web-Page Summarization Using Clickthrough Data*JianTao Sun, Yuchang Lu Dept. of Computer Science TsingHua University Beijing 100084, ChinaDou Shen, Qiang YangHong Kong University of Science and Technology Clearwater Bay, Kowloon, HKHuaJun Zeng, Zheng ChenMicrosoft Research Asia 5F, Sigma Center, 49 Zhichun Road, Beijing 100080, China

    Presenter: Chen Yi-Ting

  • ReferenceJianTao Sun, Yuchang Lu, Dou Shen, Qiang Yang, HuaJun Zeng, Zheng Chen, Web-Page Summarization Using Clickthrought Data, SIGIR05, August 15-19, 2005.H. Luhn. The automatic creation of literature abstracts. IBM Journal of Research and Development, 2(2):159-165, 1958.

  • OutlineIntroductionSummarize Web Pages using Clickthrough DataEmpirical study on clickthrough dataAdapted web-page summarization methodsSummarize web pages not covered by clickthrough dataExperimentsConclusions and future work

  • Introduction(1/2)Why web-page summarized?Web-page summaries can be abstracts or extractsWeb-page summary can also be either generic or query-dependentA query-dependent summary presents the information which is most relevant with the initial queryA generic summary gives an overall sense of the documents contentA generic summary should meet two conditions: maintain wide coverage of the pages topics and keep low redundancy at the same timeIn this paper, we focus on extract-based generic Web-page summarizationThe objective of this research is to utilize extra knowledge to improve Web-page summarizationclickthroughcontains users knowledge on Web pages contentA users query words often reflect the true meaning of target Web pages content

  • Introduction(2/2)This is a challenging taskWeb pages may have no associated query words since they are not visited by web users through search engineThe clickthrough data are noisyIn this paper, a thematic hierarchy of query terms are constructedThe thematic lexicon can be used to complement the scarcity of Web-page content even no clickthrough data was collected associated with these pagesThat method can help filter out noises contained in query words for an individual Web page through the use of statistics over all Web page of this categoryTwo text-summarization methods to summarize Web pagesThe first approach is based on significant-word selection adapted from Luhns methodThe second method is based on Latent Semantic Analysis (LSA)

  • Summarize web pages using clickthrough data (1/7)Empirical study on clickthrough dataConsider the typical search scenario: a user (u) submits a query (q) to search engine, the search engine returns a ranked list of Web page. Then the user clicks on the pages (p) of interestBe represented by a set of triples The clickthrough data records how Web users find information through queriesThe collection of queries is supposed to well reflect the topic of the target Web pageTwo experimentTo investigate whether the query words are related with the topics of the Web page (45.5% of keywords occurs in the query words, 13.1% of query words appear as keywords)To give evidence that clickthrough data is helpful to summarizing Web pages

  • Summarize web pages using clickthrough data (2/7)Adapted Web-page Summarization Methods (Suppose that we have a set of query terms for each page now)Adapted Significant Word (ASW) MethodThe first summarization method is adapted from Luhns algorithm, which is a classical algorithm designed for text signed a significanceIn Luhns method, each sentence is assigned a significance factor and the sentences with high significance factors are selected to form the summaryThen the significant factor of a sentence can be computed as follow: (1) Set a limit L for the distance at which any two significant words could be considered as being significantly related (2) Find out a portion in the sentence that is bracketed by significant words not more than L non-significant words apart (3) Count the number of significant words contained in the portion and divide the square of this number by the total number of words within he portion First, a set of significant words are constructed (according to word frequency in a document)

  • Summarize web pages using clickthrough data (3/7)Adapted Web-page Summarization Methods Adapted Significant Word (ASW) MethodIn order to customize this procedure to leverage query terms for Web-page summarization, the significant word selection method is modifiedThe basic idea is to use both the local contents of a Web page and query terms collected from the clickthrough data to decide whether a word is significant

    After the significance factors for all words are calculated, ranking them and select the top N% as significant wordsThen Luhns algorithm to compute the significant factor of each sentence is employed

  • Summarize web pages using clickthrough data (4/7)Adapted Web-page Summarization Methods Adapted Latent Semantic Analysis (ALSA) MethodGong et al. proposed an extraction based summarization algorithmFirstly, a term-sentence matrix is constructed from the original text documentNext, LSA analysis is conducted on the matrixIn the last step, a document summary is produced incrementallyProposed LSA-based summarization method is a variant of Gongs methodUtilizing the query-word knowledge by changing the term-sentence matrix: if a term occurs as query word, its weight is increased according to its frequency in query word collectionExpecting to extract sentences whose topics are related to the ones reflected by query wordsThe term frequency vector of each sentence can be weighted by different weighting (global weighting and local weighting) and normalization methods

  • Summarize web pages using clickthrough data (5/7)Adapted Web-page Summarization Methods Adapted Latent Semantic Analysis (ALSA) MethodIn this paper, a term frequency (TF) approach without weighting or normalization is used to represent the sentences in Web pagesTerms in a sentence are augmented by query terms as follows:

    Advantages of the adapted methodsThe extra knowledge of query terms is utilized to help select significant words and to modify the page representationOur approach can, to some extent, handle the noises of query wordsFinally, ASW approach can avoid that problem that is Luhns method, the frequency-cutoff method may lead to a lot of significant words for long pages

  • Summarize web pages using clickthrough data (6/7)Summarize Web Pages Not Covered by Clickthrough DataBuilding a hierarchical lexicon using the clickthrough data and apply it to help summarize those pagesAll ODP Web pages have been manually organized into a hierarchical taxonomyFor each category of the taxonomy, the lexicon contains all query terms that users have submitted to browse Web pages of this categoryThe lexicon is built as follows:First, TS corresponding to each category is set empty.Next, for each page covered by the clickthrough data, its query words are added into TS of categoriesAt last, term weight in each TS is multiplied by its Inverse Category Frequency (ICF)For each Web page to be summarized, first look up the lexicon for TS according to the page category

  • Summarize web pages using clickthrough data (7/7)Summarize Web Pages Not Covered by Clickthrough DataWeights of the terms in TS can be used to select significant words or update the term-sentence matrixIf a page to be summarized has multiple categories, the corresponding TS are merged together and weights are averagedWhen a TS does not have sufficient terms, TS corresponding with its parent category is usedTwo advantagesFirst, the category-specific TS provides a distribution of topic term in this categorySecond, some noisy terms which may be relatively frequent in one pages query words will be given a low weight through the used of statistics over all Web pages of this category

  • Experiments(1/6)Data SetThe clickthrough data was collected from MSN search engineA set of Web pages of the ODP directory are crawledTo get 1,125,207 Web pages, 260,763 of which are clicked by Web users using 1,586,472 different queriesTwo different data sets were used for experiment (1) DAT1-consists of 90 pages which are selected from the 260,763 browsed pages. Three human evaluators were employed to summarize these

  • Experiments(2/6)Data SetTwo different data sets were used for experiment (2) DAT2-from the 260,763, 10,000 pages are randomly selected and constitutes Data2 data set descriptions of each page are also extracted that is provided by the page editor to give a general description of this page, they use it as the ideal summary

    Performance EvaluationPrecision, Recall and F1

    ROUGE EvaluationN=1

  • Experiments(3/6)Experimental Results and AnalysisOn DAT1 : (1) To investigate whether the adapted summarizers can benefit from query terms associated with each page

  • Experiments(4/6)Experimental Results and AnalysisOn DAT1 : (1) To evaluate proposed summarization methods using the thematic lexicon approach

  • Experiments(5/6)Experimental Results and AnalysisOn DAT2 :Only ROUGE-1 measure is used for evaluationSince the description length is commonly short and the ROUGE-1 measures is recall based, the summarization results are relatively poorThe thematic lexicon-based methods can still lead to better summaries compared with local textual content based summarizers

  • Experiments(6/6)DiscussionsFinding that ICF-based re-weighting can help discover topic terms of a specific categoryTo verify our hypothesis that the clickthrough data can complement the textual contents of Web pages for summarization tasks

  • Conclusions and Future workTo leverage extract knowledge from clickthrough data to improve Web-page summarizationIt would be interesting to propose a method to determine parameter automaticallyTo study how to leverage other types of knowledge