1 CS 430: Information Discovery Lecture 12 Latent Semantic Indexing.
Latent Topic-semantic Indexing based Automatic Text Summarization
-
Upload
elaheh-barati -
Category
Education
-
view
91 -
download
0
Transcript of Latent Topic-semantic Indexing based Automatic Text Summarization
1/20
Introduction Method Experiments Conclusions
Latent Topic-semantic Indexing basedAutomatic Text Summarization
Jiangsheng Yu, Xue-wen ChenPresenter: Elaheh Barati
Futurewei Technologies - Wayne State University
December 18, 2016
Jiangsheng Yu,, Xue-wen Chen, Presenter: Elaheh Barati Futurewei Technologies - Wayne State University
Latent Topic-semantic Indexing based Automatic Text Summarization
2/20
Introduction Method Experiments Conclusions
IntroductionAutomatic summarizationLatent Dirichlet Allocation
Method
Experiments
Conclusions
Jiangsheng Yu,, Xue-wen Chen, Presenter: Elaheh Barati Futurewei Technologies - Wayne State University
Latent Topic-semantic Indexing based Automatic Text Summarization
3/20
Introduction Method Experiments Conclusions
IntroductionAutomatic summarizationLatent Dirichlet Allocation
Method
Experiments
Conclusions
Jiangsheng Yu,, Xue-wen Chen, Presenter: Elaheh Barati Futurewei Technologies - Wayne State University
Latent Topic-semantic Indexing based Automatic Text Summarization
4/20
Introduction Method Experiments Conclusions
Automatic summarization
An Introduction to Automatic summarization (AS)
� Automatic summarization (AS), or text summarization, isa challenging task of natural language processing (NLP)and machine learning.
� It transforms source text to summary text, while retainingthe most important information in the source.
Jiangsheng Yu,, Xue-wen Chen, Presenter: Elaheh Barati Futurewei Technologies - Wayne State University
Latent Topic-semantic Indexing based Automatic Text Summarization
4/20
Introduction Method Experiments Conclusions
Automatic summarization
An Introduction to Automatic summarization (AS)
� Many extraction methods have been proposed inliterature, and some of them are implemented as opensource tools, or online services.
� In the last decade, the topic-driven approaches becamepopular, and some work based on pLSI and LDA hasachieved significantly better performance.
Jiangsheng Yu,, Xue-wen Chen, Presenter: Elaheh Barati Futurewei Technologies - Wayne State University
Latent Topic-semantic Indexing based Automatic Text Summarization
5/20
Introduction Method Experiments Conclusions
Latent Dirichlet Allocation
An Introduction to Latent Dirichlet Allocation
The plate notation of LDA, a three-level HB model, in whichθ1:M � Dir(α),ϕ1:K � Dir(β), zm,1:Nm � xθmy and wmn � xϕzmny.
Jiangsheng Yu,, Xue-wen Chen, Presenter: Elaheh Barati Futurewei Technologies - Wayne State University
Latent Topic-semantic Indexing based Automatic Text Summarization
5/20
Introduction Method Experiments Conclusions
Latent Dirichlet Allocation
An Introduction to Latent Dirichlet Allocation
For the n-th word in the m-th document, denoted by wmn, wherem = 1, � � � , M, n = 1, � � � , Nm, its topic zmn is a latent variablevarying in the set of t1, � � � , Ku, satisfying wmn � xϕzmny.
Jiangsheng Yu,, Xue-wen Chen, Presenter: Elaheh Barati Futurewei Technologies - Wayne State University
Latent Topic-semantic Indexing based Automatic Text Summarization
5/20
Introduction Method Experiments Conclusions
Latent Dirichlet Allocation
An Introduction to Latent Dirichlet Allocation
The discrete distribution of words: wmn� Multin(1; ϕzmn ).
Jiangsheng Yu,, Xue-wen Chen, Presenter: Elaheh Barati Futurewei Technologies - Wayne State University
Latent Topic-semantic Indexing based Automatic Text Summarization
5/20
Introduction Method Experiments Conclusions
Latent Dirichlet Allocation
An Introduction to Latent Dirichlet Allocation
The Nm latent topics in the m-th document: zm,1:Nm� xθmy,where θm, the (vector) parameter of multinomial distribution oftopics for the m-th document, is also Dirichlet-distributed inthe way of θ1:M � Dir(α).
Jiangsheng Yu,, Xue-wen Chen, Presenter: Elaheh Barati Futurewei Technologies - Wayne State University
Latent Topic-semantic Indexing based Automatic Text Summarization
5/20
Introduction Method Experiments Conclusions
Latent Dirichlet Allocation
An Introduction to Latent Dirichlet Allocation
LDA models adopt the conjugate prior of multinomialdistribution, to describe the priors of parameters ofmultinomial distributions.
Jiangsheng Yu,, Xue-wen Chen, Presenter: Elaheh Barati Futurewei Technologies - Wayne State University
Latent Topic-semantic Indexing based Automatic Text Summarization
6/20
Introduction Method Experiments Conclusions
Latent Dirichlet Allocation
Limitations on LDA
While topic models have been successfully applied toautomatic summarization, they are limited in several aspects.
Jiangsheng Yu,, Xue-wen Chen, Presenter: Elaheh Barati Futurewei Technologies - Wayne State University
Latent Topic-semantic Indexing based Automatic Text Summarization
6/20
Introduction Method Experiments Conclusions
Latent Dirichlet Allocation
Limitations on LDA
While topic models have been successfully applied toautomatic summarization, they are limited in several aspects.
� in the LDA-type models, the observation units arerestricted to words.
� a topic is usually defined by a discrete distribution overmany polysemous words.
� ...
Jiangsheng Yu,, Xue-wen Chen, Presenter: Elaheh Barati Futurewei Technologies - Wayne State University
Latent Topic-semantic Indexing based Automatic Text Summarization
6/20
Introduction Method Experiments Conclusions
Latent Dirichlet Allocation
Limitations on LDA
While topic models have been successfully applied toautomatic summarization, they are limited in several aspects.
� in the LDA-type models, the observation units arerestricted to words.
� a topic is usually defined by a discrete distribution overmany polysemous words.
� ...
Jiangsheng Yu,, Xue-wen Chen, Presenter: Elaheh Barati Futurewei Technologies - Wayne State University
Latent Topic-semantic Indexing based Automatic Text Summarization
6/20
Introduction Method Experiments Conclusions
Latent Dirichlet Allocation
Limitations on LDA
While topic models have been successfully applied toautomatic summarization, they are limited in several aspects.
� in the LDA-type models, the observation units arerestricted to words.
� a topic is usually defined by a discrete distribution overmany polysemous words.
� ...
Jiangsheng Yu,, Xue-wen Chen, Presenter: Elaheh Barati Futurewei Technologies - Wayne State University
Latent Topic-semantic Indexing based Automatic Text Summarization
6/20
Introduction Method Experiments Conclusions
Latent Dirichlet Allocation
Limitations on LDA
These limitations make the learned topics lack of practicalsignificance in many cases, and prevent the topic models fromfurther applications.
Jiangsheng Yu,, Xue-wen Chen, Presenter: Elaheh Barati Futurewei Technologies - Wayne State University
Latent Topic-semantic Indexing based Automatic Text Summarization
7/20
Introduction Method Experiments Conclusions
IntroductionAutomatic summarizationLatent Dirichlet Allocation
Method
Experiments
Conclusions
Jiangsheng Yu,, Xue-wen Chen, Presenter: Elaheh Barati Futurewei Technologies - Wayne State University
Latent Topic-semantic Indexing based Automatic Text Summarization
8/20
Introduction Method Experiments Conclusions
Latent Topic-Semantic Indexing
Our Assumptions:� Words in each observation window are indexed by the
same topic, where a window in a text corpus may be aword, a sentence, or even a paragraph.
� K: number of topics. Each topic ϕ is a discrete distributionover the semantic categories 1, � � � , L.
� the topics ϕ1, � � � ,ϕK priorly satisfy
ϕ1:K � Dir(β)
Jiangsheng Yu,, Xue-wen Chen, Presenter: Elaheh Barati Futurewei Technologies - Wayne State University
Latent Topic-semantic Indexing based Automatic Text Summarization
8/20
Introduction Method Experiments Conclusions
Latent Topic-Semantic Indexing
Our Assumptions:� Words in each observation window are indexed by the
same topic, where a window in a text corpus may be aword, a sentence, or even a paragraph.
� K: number of topics. Each topic ϕ is a discrete distributionover the semantic categories 1, � � � , L.
� the topics ϕ1, � � � ,ϕK priorly satisfy
ϕ1:K � Dir(β)
Jiangsheng Yu,, Xue-wen Chen, Presenter: Elaheh Barati Futurewei Technologies - Wayne State University
Latent Topic-semantic Indexing based Automatic Text Summarization
8/20
Introduction Method Experiments Conclusions
Latent Topic-Semantic Indexing
Our Assumptions:� Words in each observation window are indexed by the
same topic, where a window in a text corpus may be aword, a sentence, or even a paragraph.
� K: number of topics. Each topic ϕ is a discrete distributionover the semantic categories 1, � � � , L.
� the topics ϕ1, � � � ,ϕK priorly satisfy
ϕ1:K � Dir(β)
Jiangsheng Yu,, Xue-wen Chen, Presenter: Elaheh Barati Futurewei Technologies - Wayne State University
Latent Topic-semantic Indexing based Automatic Text Summarization
9/20
Introduction Method Experiments Conclusions
Latent Topic-Semantic Indexing
� Proposed two deep probabilistic models for topic-drivensummarization:
(a) ψ1:L are given (b) ψ1:L � Dir(γ)
Jiangsheng Yu,, Xue-wen Chen, Presenter: Elaheh Barati Futurewei Technologies - Wayne State University
Latent Topic-semantic Indexing based Automatic Text Summarization
9/20
Introduction Method Experiments Conclusions
Latent Topic-Semantic Indexing
� The probability of observing a word v in the semanticcategory l, say ψlv, is given in the following two ways:
(a) ψ1:L are given (b) ψ1:L � Dir(γ)
Jiangsheng Yu,, Xue-wen Chen, Presenter: Elaheh Barati Futurewei Technologies - Wayne State University
Latent Topic-semantic Indexing based Automatic Text Summarization
9/20
Introduction Method Experiments Conclusions
Latent Topic-Semantic Indexing
(a) ψ1:L are given (b) ψ1:L � Dir(γ)
� a given prior semantic matrix ΨT = (ψlv)L�V, where V isthe number of words in the vocabulary, and L the numberof semantic labels.
� a non-informative prior semantic matrix ψ1:L � Dir(γ),where ψl = (ψl1, � � � ,ψlV)
T is a discrete distribution overall words in the vocabulary, l = 1, � � � , L.
Jiangsheng Yu,, Xue-wen Chen, Presenter: Elaheh Barati Futurewei Technologies - Wayne State University
Latent Topic-semantic Indexing based Automatic Text Summarization
10/20
Introduction Method Experiments Conclusions
Latent Topic-Semantic Indexing (cont.)
(a) ψ1:L are given
Assumption of TSI model: in each window (m, n), thesemantics of words w(1)
mn � � �w(Dmn)mn are drawn from a same but
unknown topic zmn
Jiangsheng Yu,, Xue-wen Chen, Presenter: Elaheh Barati Futurewei Technologies - Wayne State University
Latent Topic-semantic Indexing based Automatic Text Summarization
10/20
Introduction Method Experiments Conclusions
Latent Topic-Semantic Indexing (cont.)
(a) ψ1:L are given
For a TSI model, the m-th document is generated in:(1) Choose θm � Dir(α), where θ1:M � Dir(α).
Jiangsheng Yu,, Xue-wen Chen, Presenter: Elaheh Barati Futurewei Technologies - Wayne State University
Latent Topic-semantic Indexing based Automatic Text Summarization
10/20
Introduction Method Experiments Conclusions
Latent Topic-Semantic Indexing (cont.)
(a) ψ1:L are given
For a TSI model, the m-th document is generated in:(2) zmn, the topic of window (m, n), is drawn from zmn � xθmy.
Jiangsheng Yu,, Xue-wen Chen, Presenter: Elaheh Barati Futurewei Technologies - Wayne State University
Latent Topic-semantic Indexing based Automatic Text Summarization
10/20
Introduction Method Experiments Conclusions
Latent Topic-Semantic Indexing (cont.)
(a) ψ1:L are given
For a TSI model, the m-th document is generated in:(3) The semantics in window (m, n) are generated via
s(1)mn , � � � , s(dmn)
mn , � � � , s(Dmn)mn � xϕzmny
Jiangsheng Yu,, Xue-wen Chen, Presenter: Elaheh Barati Futurewei Technologies - Wayne State University
Latent Topic-semantic Indexing based Automatic Text Summarization
10/20
Introduction Method Experiments Conclusions
Latent Topic-Semantic Indexing (cont.)
(a) ψ1:L are given
For a TSI model, the m-th document is generated in:(4) The word w(d)
mn is drawn from semantic category s(d)mn
independently
Jiangsheng Yu,, Xue-wen Chen, Presenter: Elaheh Barati Futurewei Technologies - Wayne State University
Latent Topic-semantic Indexing based Automatic Text Summarization
11/20
Introduction Method Experiments Conclusions
TSI vs LDA
LDA is a special case of TSI when:� the observation window is a word,
� the semantic labels are the words themselves, and� the semantic matrix is an identity matrix.
Jiangsheng Yu,, Xue-wen Chen, Presenter: Elaheh Barati Futurewei Technologies - Wayne State University
Latent Topic-semantic Indexing based Automatic Text Summarization
11/20
Introduction Method Experiments Conclusions
TSI vs LDA
LDA is a special case of TSI when:� the observation window is a word,� the semantic labels are the words themselves, and
� the semantic matrix is an identity matrix.
Jiangsheng Yu,, Xue-wen Chen, Presenter: Elaheh Barati Futurewei Technologies - Wayne State University
Latent Topic-semantic Indexing based Automatic Text Summarization
11/20
Introduction Method Experiments Conclusions
TSI vs LDA
LDA is a special case of TSI when:� the observation window is a word,� the semantic labels are the words themselves, and� the semantic matrix is an identity matrix.
Jiangsheng Yu,, Xue-wen Chen, Presenter: Elaheh Barati Futurewei Technologies - Wayne State University
Latent Topic-semantic Indexing based Automatic Text Summarization
12/20
Introduction Method Experiments Conclusions
IntroductionAutomatic summarizationLatent Dirichlet Allocation
Method
Experiments
Conclusions
Jiangsheng Yu,, Xue-wen Chen, Presenter: Elaheh Barati Futurewei Technologies - Wayne State University
Latent Topic-semantic Indexing based Automatic Text Summarization
13/20
Introduction Method Experiments Conclusions
Experiments Setup
� Topic-based summarizations are tested on Brown corpusin the public dataset of SemCor-3.0, which contains 186documents classified in 15 categories.
� The semantic indexing is restricted to nouns and nounphrases. For this, all the fourth-level noun SynSets in thehypernymy tree of WordNet-3.0 .
� A total of L = 2017 semantic categories are used in the TSImodel.
Jiangsheng Yu,, Xue-wen Chen, Presenter: Elaheh Barati Futurewei Technologies - Wayne State University
Latent Topic-semantic Indexing based Automatic Text Summarization
13/20
Introduction Method Experiments Conclusions
Experiments Setup
� The prior semantic matrix is set by ψlv = nlv/nl, where nl isthe number of total SynSets in semantic category l, and nlvis the number of SynSets of word v in category l.
� We set α = (0.1, � � � , 0.1)T,β = (0.1, � � � , 0.1)T, which arecommonly set as default values in many applications.
Jiangsheng Yu,, Xue-wen Chen, Presenter: Elaheh Barati Futurewei Technologies - Wayne State University
Latent Topic-semantic Indexing based Automatic Text Summarization
14/20
Introduction Method Experiments Conclusions
Evaluation of Summarizers
� Suppose there are T summarizers under test, M documentsto review and a number of reviewers.
� Which summerizer has the best performance?� We use one-way analysis of variance (ANOVA).
Jiangsheng Yu,, Xue-wen Chen, Presenter: Elaheh Barati Futurewei Technologies - Wayne State University
Latent Topic-semantic Indexing based Automatic Text Summarization
14/20
Introduction Method Experiments Conclusions
Evaluation of Summarizers� AM�T: the index matrix, where M, T are the numbers of
documents and summarizers,� the m-th row of AM�T (i.e., (am1, � � � , amt, � � � , amT))
indicates the ordering of T summary results ofdocument m
� The results are scored by 1, 2, � � � , T, from the worst tothe best.
� BM�T: one human review matrix, in which bmt is the scoreof summarizer [amt].
� CM�T: the feedback matrix, in which cmt is the score ofsummarizer [t] on document m, is recovered by
cm,amt = bmt, where m = 1, � � � , M, t = 1, � � � , T
Jiangsheng Yu,, Xue-wen Chen, Presenter: Elaheh Barati Futurewei Technologies - Wayne State University
Latent Topic-semantic Indexing based Automatic Text Summarization
15/20
Introduction Method Experiments Conclusions
Evaluation by One-way ANOVA
1: Input: data AM�T, BM�T, and significance level α.2: Output: Ranks of summarizers.3: The comparison matrix HT�T is initialized as zero.4: We get the feedback matrix C and the mean scores of T sum-
marizers, then initialize s1 ¨ � � � ¨ sT.5: for all possible pairs (i, j) satisfying i j do6: if H(i,j)
0 is rejected at a given level α then7: si sj, where means “is worse than".8: Let hit = 1 for all t ¥ j.9: end if
10: end for11: The summarizer st is ranked by the sum of t-th column of H,
where t = 1, � � � , T.Jiangsheng Yu,, Xue-wen Chen, Presenter: Elaheh Barati Futurewei Technologies - Wayne State University
Latent Topic-semantic Indexing based Automatic Text Summarization
16/20
Introduction Method Experiments Conclusions
Mean scores of four summarizers on the testing Browncorpus.
LDA TSI FS OTS1 2.541667 3.416667 2.083333 1.9583332 2.375000 2.916667 2.250000 2.4583333 2.916667 2.500000 2.125000 2.4583334 2.791667 3.125000 2.208333 1.8750005 2.791667 2.708333 2.416667 2.1250006 2.416667 2.750000 2.666667 2.1666677 2.333333 3.500000 2.291667 2.041667
� Four summarizers: Topic-based methods and two non-topic-driven references,Open Text Summarizer (OTS) and Free Summarizer (FS).
� Seven volunteers participated in the evaluation.� For each document, the ordering of summaries is disrupted.
Jiangsheng Yu,, Xue-wen Chen, Presenter: Elaheh Barati Futurewei Technologies - Wayne State University
Latent Topic-semantic Indexing based Automatic Text Summarization
16/20
Introduction Method Experiments Conclusions
Mean scores of four summarizers on the testing Browncorpus.
LDA TSI FS OTS1 2.541667 3.416667 2.083333 1.9583332 2.375000 2.916667 2.250000 2.4583333 2.916667 2.500000 2.125000 2.4583334 2.791667 3.125000 2.208333 1.8750005 2.791667 2.708333 2.416667 2.1250006 2.416667 2.750000 2.666667 2.1666677 2.333333 3.500000 2.291667 2.041667
� the viewpoint of the first reviewer:OTS ¨ FS ¨ LDA ¨ TSI.
� TSI-based summarization outperforms other methods canbe verified by the one-way ANOVA of mean scores
Jiangsheng Yu,, Xue-wen Chen, Presenter: Elaheh Barati Futurewei Technologies - Wayne State University
Latent Topic-semantic Indexing based Automatic Text Summarization
17/20
Introduction Method Experiments Conclusions
Evaluation of overall performance of foursummarizers at the significance level of 0.05
1 2 3 4 5 6 7LDA 1 0 1 1 1 0 0
TSI 3 1 0 2 0 0 3FS 0 0 0 0 0 0 0
OTS 0 0 0 0 0 0 0
(TSI, 4) = 2 means that� There are 2 summarizers that are significantly worse than
TSI-based method, in the viewpoint of the 4-th reviewer.
Jiangsheng Yu,, Xue-wen Chen, Presenter: Elaheh Barati Futurewei Technologies - Wayne State University
Latent Topic-semantic Indexing based Automatic Text Summarization
17/20
Introduction Method Experiments Conclusions
Evaluation of overall performance of foursummarizers at the significance level of 0.05
1 2 3 4 5 6 7LDA 1 0 1 1 1 0 0
TSI 3 1 0 2 0 0 3FS 0 0 0 0 0 0 0
OTS 0 0 0 0 0 0 0
The results show that:� The topic-based summarizers are better than non-topic
based methods,� TSI-based method achieves the best performance.
Jiangsheng Yu,, Xue-wen Chen, Presenter: Elaheh Barati Futurewei Technologies - Wayne State University
Latent Topic-semantic Indexing based Automatic Text Summarization
18/20
Introduction Method Experiments Conclusions
IntroductionAutomatic summarizationLatent Dirichlet Allocation
Method
Experiments
Conclusions
Jiangsheng Yu,, Xue-wen Chen, Presenter: Elaheh Barati Futurewei Technologies - Wayne State University
Latent Topic-semantic Indexing based Automatic Text Summarization
19/20
Introduction Method Experiments Conclusions
Conclusions
� We proposed a novel deep probabilistic approach to:� indexing the latent topics and semantics of words in a
collection of documents� apply the topic-semantic indexing (TSI) model to
automatic summarization.
Jiangsheng Yu,, Xue-wen Chen, Presenter: Elaheh Barati Futurewei Technologies - Wayne State University
Latent Topic-semantic Indexing based Automatic Text Summarization
19/20
Introduction Method Experiments Conclusions
Conclusions
� The topic-based summarizers, together with two othernon-topic-driven summarizers, FS and OTS, are tested onBrown corpus in the public dataset of SemCor-3.0.
� The summaries are reviewed by human.� The performance of summarization is analyzed by a
well-designed blind experiment� the summarizer is evaluated by ranks derived from
some hypothesis testings of one-way ANOVA.� The experimental results show that TSI is a promising
method for topic-driven summarization.� In the present TSI-based summarization, each observation
window is a word.
Jiangsheng Yu,, Xue-wen Chen, Presenter: Elaheh Barati Futurewei Technologies - Wayne State University
Latent Topic-semantic Indexing based Automatic Text Summarization
19/20
Introduction Method Experiments Conclusions
Conclusions
� The further work includes more experiments on severaldistinct sizes of observation windows, efficient extractionstrategies and their ensemble learning, etc.
Jiangsheng Yu,, Xue-wen Chen, Presenter: Elaheh Barati Futurewei Technologies - Wayne State University
Latent Topic-semantic Indexing based Automatic Text Summarization