Extraction Based automatic summarization

43
Extraction-Based Automatic Summarization Abdelaziz Al-Rihawi Mohammad Kher Kabbaby Faculty of Information Technology Engineering Damascus University Artificial Intelligence Department

Transcript of Extraction Based automatic summarization

Page 1: Extraction Based automatic summarization

Extraction-Based AutomaticSummarization

Abdelaziz Al-Rihawi

Mohammad Kher Kabbaby

Faculty of Information Technology Engineering

Damascus University

Artificial Intelligence Department

Page 2: Extraction Based automatic summarization

Classification of Summarization Tasks

Page 3: Extraction Based automatic summarization

Summary Type

Extraction-based• extracts objects from the entire collection, without modifying the objects themselves.

• where the goal is to select whole sentences (without modifying them)

Abstraction-based• Retelling the selected sentences to end summary.

Page 4: Extraction Based automatic summarization

Use of External resources

Knowledge-Poor• don’t use any external resources to generate summery

Knowledge-Rich• May utilize external corpus such as Wikipedia or lexical resources such as

WordNet or VerbOcean .

• used to unravel semantic relations between words, phrases or sentences.

Page 5: Extraction Based automatic summarization

Task Specific Constrains

Query-Focused• query is provided to a summarizer in addition to the source documents.• summarizer construct a summary that contains information requested by the

query.

Update• The purpose of the update summary is to identify new pieces of information

in the more recent articles with the assumption that the user has already read the previous ones

Guided• a set of aspects that should be covered in a summary is provided

Page 6: Extraction Based automatic summarization

Summarization Workflow

Preprocessing

Sentence Representation

Similarity Measures

Content Selection

Page 7: Extraction Based automatic summarization

Preprocessing

Sentence Segmentation

Short Sentence Removal

Word Segmentation

Stop Word Removal

Short Word Removal

Stemming

Page 8: Extraction Based automatic summarization

Sentence Segmentation

Splitting a text into sentences using unsupervised sentence boundaryidentification algorithm Punkt.

Page 9: Extraction Based automatic summarization

Short Sentence Removal

Sentences containing less than three words are considered to be notinformative enough or incorrectly segmented.

Page 10: Extraction Based automatic summarization

Word Segmentation

Splitting a sentence into words based on spaces and punctuation and using the conventions used by the Penn Treebank1 to handle special cases

Example:“weren’t” is split into two words “were” and “n’t”

Page 11: Extraction Based automatic summarization

Stop Word Removal

Stop words like: { and, the, or, … }

Define function like is_a to remove stop words.

Page 12: Extraction Based automatic summarization

Short Word Removal

Words smaller than three letters are considered to be non-content words.

Page 13: Extraction Based automatic summarization

Stemming

A rule-based stemming algorithm, the Porter stemmer.

Page 14: Extraction Based automatic summarization

Output of Preprocessing

Example:

“ Every weekend, students in Zhengzhou can take in a free concert, a traditional Chinese opera or a stage play in the city’s Youth and Children’s Palace.”

Page 15: Extraction Based automatic summarization

Output of Preprocessing

Words:

{ weekend, student, zhengzhou, take, free, concert, trait, chines, opera, stage, play, citi, youth, children, palace}

Page 16: Extraction Based automatic summarization

Sentence Representation

Page 17: Extraction Based automatic summarization

Feature Selection

• Term as Feature

• Features are obtained by selecting unique words from the preprocessed sentences.

• Each sentence is represented as a context vector

• The vectors are accumulated into a representation matrix

Page 18: Extraction Based automatic summarization

Feature Selection

Term Count (TC)

Represents sentences as vectors of which elements are the absolute frequency of words in the sentence

Page 19: Extraction Based automatic summarization

Feature Selection

Term frequency-inverse sentence frequency (TF-ISF)

the same weighting scheme as the TF-IDF but sentences are used instead of documents.

T F(w,d) = TC (w,d)

|d|

IDF(w) = log|D|

|D(w)|+ 1

W : word w|d| : the number of words in document d

Page 20: Extraction Based automatic summarization

Feature Selection

Latent semantic analysis:

A distributed representation model

Matrix Vk or the matrix product Sk ·Vk

obtained from SVD are used as the sentence representation matrix.

Page 21: Extraction Based automatic summarization

Feature Selection

LSA Algorithm:

Step 1 - Creating the Count MatrixStep 2 - Modify the Counts with TFIDFStep 3 - Using the Singular Value DecompositionStep 4 - Sentence Selection for summary.

Page 22: Extraction Based automatic summarization

Feature Selection

art concert capit citi children educ

S1 1 0 0 0 1 1

S2 1 1 1 0 0 0

s3 0 1 0 1 0 0

S4 0 0 0 1 0 1

Count Matrix

Page 23: Extraction Based automatic summarization

Feature Selection

art concert capit citi children educ

S1 0.23 0 0 0 0.46 0.23

S2 0.23 0.23 0.46 0 0 0

s3 0 0.35 0 0.35 0 0

S4 0 0 0 0.35 0 0.35

TF-IDF representation Matrix

Page 24: Extraction Based automatic summarization

Feature Selection

D1 D2 D3

S1 0.23 0 0

S2 0.23 0.23 0.46

s3 0 0.35 0

S4 0 0 0

LSA representation Matrix

Page 25: Extraction Based automatic summarization

Similarity Measures

Page 26: Extraction Based automatic summarization

Similarity Measures

• Corpus-based• measures use term frequencies observed in a corpus to relate contexts to

each other

• Knowledge-based• predefined semantic relations between terms obtained from lexical resources

Page 27: Extraction Based automatic summarization

Similarity Measures

Jaccard similarity coefficient

set-based similarity metric used for measuring similarities between sentences represented with TC representation.

Page 28: Extraction Based automatic summarization

Similarity Measures

Cosine similarity

a vector-based similarity metric used for representations with real-valued weights such as TF-ISF and LSA.

Page 29: Extraction Based automatic summarization

Sentence Selection

Page 30: Extraction Based automatic summarization

Sentence Selection

The goal of the selection procedure is to identify a set of sentences that contain important information.

Three criteria are optimized when selecting the sentences:1. Relevance2. Redundancy3. Length

Maximize the relevance while minimizing the redundancy

Page 31: Extraction Based automatic summarization

Sentence Selection

Selection of sentences can be handled either:

1. Supervised Methods

2. Unsupervised Methods

Page 32: Extraction Based automatic summarization

Sentence Selection

Supervised Methods:

• use a classifier trained on a set of documents coupled with corresponding extracts

• possible to label sentences with a binary value: • 1- a sentence is included in the extract• 0 - a sentence is not included in the extract.

• each sentence should be represented by a feature vector

• a classifier is trained on a set of feature vectors

Page 33: Extraction Based automatic summarization

Sentence Selection

Supervised Methods:

1. cue phrases and topic terms

2. position of a sentence in a document

3. centrality of a sentence1. for example a similarity between a sentence and other

4. length of a sentence1. for example the number of open-class words (i.e. nouns, main verbs,

adjectives, adverbs) in a sentence

Page 34: Extraction Based automatic summarization

Sentence Selection

Unsupervised Methods:

• unsupervised summarization algorithms are either centroid-basedor centrality-based.

Page 35: Extraction Based automatic summarization

Sentence Selection

Centroid-based Algorithm:

• select sentences that contain informative words

• Refereed to as topic signatures

• Calculation of informativeness :• using popular weighting schemes such as TF-IDF or log-likelihood ratio

Page 36: Extraction Based automatic summarization

Sentence Selection

Centroid-based Algorithm:

• When the similarity between each pair of sentences is available

• for a sentence S is to take the average of the similarities between S and all the other sentences

• The algorithms described above rely on superficial features ignoring higher-level semantic information such as semantic relations between terms, so we use Graph theory.

Page 37: Extraction Based automatic summarization

Use of Graphs in AutomaticSummarization

Page 38: Extraction Based automatic summarization

Graph Representations

Similarity graph Event graph

Page 39: Extraction Based automatic summarization

Graph Representations

• Similarity relations.

• Semantic relations such as semantic roles, cause-consequences, specifications, time relations.

Page 40: Extraction Based automatic summarization

Centrality Measures

• Graph theory and network analysis provide a great number of different methods and algorithms for working with graphs

• Length of edges in this graph correspond to actual distances between nodes

• The size of a node will be modified according to the centrality of this node

• calculated using a specific centrality measure, so that more central nodes will be larger

Page 41: Extraction Based automatic summarization

Centrality Measures

• Degree-based Methods

• Path-based Methods

Page 42: Extraction Based automatic summarization

Reference

Extraction-Based AutomaticSummarization

Gleb Sizov

June 2010

Master of Science in Computer Science

Gleb Sizov

Page 43: Extraction Based automatic summarization

Thanks!!