Automatic Keyphrase Extraction by Bridging Vocabulary Gap
-
Upload
claudia-mcintyre -
Category
Documents
-
view
57 -
download
4
description
Transcript of Automatic Keyphrase Extraction by Bridging Vocabulary Gap
![Page 1: Automatic Keyphrase Extraction by Bridging Vocabulary Gap](https://reader035.fdocuments.in/reader035/viewer/2022062314/56812d68550346895d927968/html5/thumbnails/1.jpg)
Automatic Keyphrase Extraction by Bridging Vocabulary Gap
Xinxiong ChenTsinghua University
2013-04-26
04/19/2023 THUNLP, Tsinghua University http://nlp.csai.tsinghua.edu.cn
1
![Page 2: Automatic Keyphrase Extraction by Bridging Vocabulary Gap](https://reader035.fdocuments.in/reader035/viewer/2022062314/56812d68550346895d927968/html5/thumbnails/2.jpg)
Main Idea Vocabulary gap: Appropriate keyphrases
are not always statistically significant or even do not appear in the given document.
Use word alignment models in statistical machine translation to learn translation probabilities between the words in documents and the words in keyphrases.
THUNLP, Tsinghua Universityhttp://nlp.csai.tsinghua.edu.cn
04/19/2023 2
![Page 3: Automatic Keyphrase Extraction by Bridging Vocabulary Gap](https://reader035.fdocuments.in/reader035/viewer/2022062314/56812d68550346895d927968/html5/thumbnails/3.jpg)
Introduction – Keyphrase What is keyphrase
a set of terms selected from a document as a short summary of the document.
THUNLP, Tsinghua Universityhttp://nlp.csai.tsinghua.edu.cn
04/19/2023 3
![Page 4: Automatic Keyphrase Extraction by Bridging Vocabulary Gap](https://reader035.fdocuments.in/reader035/viewer/2022062314/56812d68550346895d927968/html5/thumbnails/4.jpg)
Introduction – Keyphrase Extraction
Why keyphrase extraction Digital libraries Information Retrieval
Goal : automatically extract keyphrases from documents Unsupervised
THUNLP, Tsinghua Universityhttp://nlp.csai.tsinghua.edu.cn
04/19/2023 4
![Page 5: Automatic Keyphrase Extraction by Bridging Vocabulary Gap](https://reader035.fdocuments.in/reader035/viewer/2022062314/56812d68550346895d927968/html5/thumbnails/5.jpg)
Example A News article: (translated from
Chinese)
THUNLP, Tsinghua Universityhttp://nlp.csai.tsinghua.edu.cn
04/19/2023 5
Title Israeli Military Claims Iran Can Produce Nuclear Bombs and Considering Military Action against Iran
Summary …
Content …
Keywords Israeli , Iran , Nuclear bombs , Nuclear weapon
![Page 6: Automatic Keyphrase Extraction by Bridging Vocabulary Gap](https://reader035.fdocuments.in/reader035/viewer/2022062314/56812d68550346895d927968/html5/thumbnails/6.jpg)
Example
Existing unsupervised method: TFIDF : Nuclear bombs , Iran , Israeli ,
enriched uranium , speech TextRank : Iran , Israeli , chief , Nuclear
bombs , Military Use a window whose size is a constant to build a word graph Use PageRank to decidewhich word is more important
THUNLP, Tsinghua Universityhttp://nlp.csai.tsinghua.edu.cn
04/19/2023 6
TF Israeli(6) Iran(6) intelligence(5) nuclear bombs(4) enriched uranium(3)…nuclear weapon(1)
![Page 7: Automatic Keyphrase Extraction by Bridging Vocabulary Gap](https://reader035.fdocuments.in/reader035/viewer/2022062314/56812d68550346895d927968/html5/thumbnails/7.jpg)
Example LDA : Iran , England , America , Nation
, Speech Learn topics from documents
ExpandRank : Iran , enriched uranium , Israeli , atomic energy, Lebanon Find k nearest neighbor documents to
build word graphs
04/19/2023 THUNLP, Tsinghua Universityhttp://nlp.csai.tsinghua.edu.cn
7
![Page 8: Automatic Keyphrase Extraction by Bridging Vocabulary Gap](https://reader035.fdocuments.in/reader035/viewer/2022062314/56812d68550346895d927968/html5/thumbnails/8.jpg)
Idea - Association If a word is mentioned, it remind people
of other words. iPhone – Apple Nuclear bombs – Nuclear Weapon
What is the probability between “Nuclear bombs” and “Nuclear Weapon”?
04/19/2023 THUNLP, Tsinghua Universityhttp://nlp.csai.tsinghua.edu.cn
8
Nuclear bombs Nuclear Weapon
![Page 9: Automatic Keyphrase Extraction by Bridging Vocabulary Gap](https://reader035.fdocuments.in/reader035/viewer/2022062314/56812d68550346895d927968/html5/thumbnails/9.jpg)
Idea – SMT for Keyphrase Extraction
Both the content and the keyphrase are parallel summaries of a news
Unsupervised : Use title or summarization instead
Estimate the translation probabilities between the words in content and title word alignment models
THUNLP, Tsinghua Universityhttp://nlp.csai.tsinghua.edu.cn
News
Content Title(Summarization)Translation
04/19/2023 9
![Page 10: Automatic Keyphrase Extraction by Bridging Vocabulary Gap](https://reader035.fdocuments.in/reader035/viewer/2022062314/56812d68550346895d927968/html5/thumbnails/10.jpg)
Translation Probability Example:
Nuclear bombs: Nuclear bombs : 0.515757 Liquid : 0.0871815 Nuclear Weapon : 0.0808868 Military Action : 0.0239178 Israeli Military : 0.0215988 Miniaturization : 0.0118 Possible : 0.0113688 enriched uranium : 0.0100252
04/19/2023 THUNLP, Tsinghua Universityhttp://nlp.csai.tsinghua.edu.cn
10
![Page 11: Automatic Keyphrase Extraction by Bridging Vocabulary Gap](https://reader035.fdocuments.in/reader035/viewer/2022062314/56812d68550346895d927968/html5/thumbnails/11.jpg)
Keyphrase Extraction Using WAM
Given news, rank keyphrases by computing the scores
Iran , Israeli , chief, Nuclear bombs , Military …
Iran , Israeli , chief, Nuclear bombs , Nuclear weapon , Military , speech
THUNLP, Tsinghua Universityhttp://nlp.csai.tsinghua.edu.cn
04/19/2023 11
![Page 12: Automatic Keyphrase Extraction by Bridging Vocabulary Gap](https://reader035.fdocuments.in/reader035/viewer/2022062314/56812d68550346895d927968/html5/thumbnails/12.jpg)
Word Trigger Method (WTM) Three Steps :
Preparing translation pairs Learning a translation model
IBM Model-1 Extracting keyphrase given a resource
THUNLP, Tsinghua Universityhttp://nlp.csai.tsinghua.edu.cn
04/19/2023 12
![Page 13: Automatic Keyphrase Extraction by Bridging Vocabulary Gap](https://reader035.fdocuments.in/reader035/viewer/2022062314/56812d68550346895d927968/html5/thumbnails/13.jpg)
Translation Pairs Length unbalance problem
Unable to list all tags on the annotation side
Tags may have different importance for the resource
THUNLP, Tsinghua Universityhttp://nlp.csai.tsinghua.edu.cn
04/19/2023 13
![Page 14: Automatic Keyphrase Extraction by Bridging Vocabulary Gap](https://reader035.fdocuments.in/reader035/viewer/2022062314/56812d68550346895d927968/html5/thumbnails/14.jpg)
Content-Title Pairs Length unbalanced problem
Unable to list all tags on the annotation side
Tags may have different importance for the resource
Sampling Method Tag weighting type
TFt, TF-IRFt
Length ratio
THUNLP, Tsinghua Universityhttp://nlp.csai.tsinghua.edu.cn
04/19/2023 14
![Page 15: Automatic Keyphrase Extraction by Bridging Vocabulary Gap](https://reader035.fdocuments.in/reader035/viewer/2022062314/56812d68550346895d927968/html5/thumbnails/15.jpg)
Learning Translation Probabilities IBM Model-1 as WAM algorithms
Asymmetric: Prd2a(t|w), Pra2d(t|w) Linear Combination
Prd2a(t|w)
Pra2d(t|w) When λ = 1 or λ = 0, it simply uses model
Prd2a(t|w) or Pra2d(t|w) correspondingly
THUNLP, Tsinghua Universityhttp://nlp.csai.tsinghua.edu.cn
04/19/2023 15
content title
title content
![Page 16: Automatic Keyphrase Extraction by Bridging Vocabulary Gap](https://reader035.fdocuments.in/reader035/viewer/2022062314/56812d68550346895d927968/html5/thumbnails/16.jpg)
Tag Suggestion Using Triggered Words
Given description, rank tags by computing the scores
THUNLP, Tsinghua Universityhttp://nlp.csai.tsinghua.edu.cn
04/19/2023 16
![Page 17: Automatic Keyphrase Extraction by Bridging Vocabulary Gap](https://reader035.fdocuments.in/reader035/viewer/2022062314/56812d68550346895d927968/html5/thumbnails/17.jpg)
Tag Suggestion Using Triggered Words
Given description, rank tags by computing the scores
Trigger power of the word w in the content TF-IRFw
TextRank Their product
THUNLP, Tsinghua Universityhttp://nlp.csai.tsinghua.edu.cn
04/19/2023 17
![Page 18: Automatic Keyphrase Extraction by Bridging Vocabulary Gap](https://reader035.fdocuments.in/reader035/viewer/2022062314/56812d68550346895d927968/html5/thumbnails/18.jpg)
Keyphrase Extraction Using Triggered Words
Given description, rank tags by computing the scores
Translation probabilities from words in description to keyphraes
THUNLP, Tsinghua Universityhttp://nlp.csai.tsinghua.edu.cn
04/19/2023 18
![Page 19: Automatic Keyphrase Extraction by Bridging Vocabulary Gap](https://reader035.fdocuments.in/reader035/viewer/2022062314/56812d68550346895d927968/html5/thumbnails/19.jpg)
Emphasize Tags Appearing In Content for WTM (EWTM)
Emphasize tags appearing in description
It(w): indicator function to emphasize the tags appearing in content Gets 1 when t = w Gets 0 when t != w
THUNLP, Tsinghua Universityhttp://nlp.csai.tsinghua.edu.cn
04/19/2023 19
![Page 20: Automatic Keyphrase Extraction by Bridging Vocabulary Gap](https://reader035.fdocuments.in/reader035/viewer/2022062314/56812d68550346895d927968/html5/thumbnails/20.jpg)
Experiments Datasets
13702 news from www.163.com
Evaluation Metrics Precision, recall and F-measure
THUNLP, Tsinghua Universityhttp://nlp.csai.tsinghua.edu.cn
04/19/2023 20
Words in documents 72900
Words in keyphrases 12405
Lengths of document 971.7 words
Lengths of titles 11.6 words
Lengths of summarization 45.8 words
Num of Keyphrases 2.4
![Page 21: Automatic Keyphrase Extraction by Bridging Vocabulary Gap](https://reader035.fdocuments.in/reader035/viewer/2022062314/56812d68550346895d927968/html5/thumbnails/21.jpg)
Experiment Results
THUNLP, Tsinghua Universityhttp://nlp.csai.tsinghua.edu.cn
04/19/2023 21
![Page 22: Automatic Keyphrase Extraction by Bridging Vocabulary Gap](https://reader035.fdocuments.in/reader035/viewer/2022062314/56812d68550346895d927968/html5/thumbnails/22.jpg)
Parameters – Length Ratio The length ratio: content/title
04/19/2023 THUNLP, Tsinghua Universityhttp://nlp.csai.tsinghua.edu.cn
22
![Page 23: Automatic Keyphrase Extraction by Bridging Vocabulary Gap](https://reader035.fdocuments.in/reader035/viewer/2022062314/56812d68550346895d927968/html5/thumbnails/23.jpg)
THUNLP, Tsinghua Universityhttp://nlp.csai.tsinghua.edu.cn
04/19/2023 23
SINA APP(http://app.thunlp.org/weibo)Now we have more than 2 million registered users
Application
![Page 24: Automatic Keyphrase Extraction by Bridging Vocabulary Gap](https://reader035.fdocuments.in/reader035/viewer/2022062314/56812d68550346895d927968/html5/thumbnails/24.jpg)
Thank you ! Q & A