Text summarization Dragomir R. Radev School of Information, Department of Electrical Engineering and...
-
Upload
adrian-hunt -
Category
Documents
-
view
214 -
download
1
Transcript of Text summarization Dragomir R. Radev School of Information, Department of Electrical Engineering and...
![Page 1: Text summarization Dragomir R. Radev School of Information, Department of Electrical Engineering and Computer Science, and Department of Linguistics University.](https://reader035.fdocuments.in/reader035/viewer/2022062618/55146220550346284e8b5961/html5/thumbnails/1.jpg)
Text summarization
Dragomir R. RadevSchool of Information, Department of Electrical Engineering and
Computer Science, and Department of Linguistics
University of Michigan
http://www.si.umich.edu/~radev
TutorialACM SIGIR
New Orleans, LouisianaSeptember 9, 2001
![Page 2: Text summarization Dragomir R. Radev School of Information, Department of Electrical Engineering and Computer Science, and Department of Linguistics University.](https://reader035.fdocuments.in/reader035/viewer/2022062618/55146220550346284e8b5961/html5/thumbnails/2.jpg)
Part IIntroduction
![Page 3: Text summarization Dragomir R. Radev School of Information, Department of Electrical Engineering and Computer Science, and Department of Linguistics University.](https://reader035.fdocuments.in/reader035/viewer/2022062618/55146220550346284e8b5961/html5/thumbnails/3.jpg)
The BIG problem
• Information overload: 1.39 Billion URLs catalogued by Google
• Possible approaches:– information retrieval– document clustering– information extraction– visualization– question answering– text summarization
![Page 4: Text summarization Dragomir R. Radev School of Information, Department of Electrical Engineering and Computer Science, and Department of Linguistics University.](https://reader035.fdocuments.in/reader035/viewer/2022062618/55146220550346284e8b5961/html5/thumbnails/4.jpg)
Some concepts
• Abstracts: “a concise summary of the central subject matter of a document” [Paice90].
• Indicative, informative, and critical summaries
• Extracts (representative sentences)
![Page 5: Text summarization Dragomir R. Radev School of Information, Department of Electrical Engineering and Computer Science, and Department of Linguistics University.](https://reader035.fdocuments.in/reader035/viewer/2022062618/55146220550346284e8b5961/html5/thumbnails/5.jpg)
Informative summaries
. .
.
. . .
![Page 6: Text summarization Dragomir R. Radev School of Information, Department of Electrical Engineering and Computer Science, and Department of Linguistics University.](https://reader035.fdocuments.in/reader035/viewer/2022062618/55146220550346284e8b5961/html5/thumbnails/6.jpg)
Lines sometimes blurred
Net Tax Moratorium Clears House
The House passed a bill to extend the current moratorium on new Internet taxes until 2006. The moratorium forbids states from trying to find new ways of taxing Internet use, like imposing taxes on monthly access charges for Internet service providers.
![Page 7: Text summarization Dragomir R. Radev School of Information, Department of Electrical Engineering and Computer Science, and Department of Linguistics University.](https://reader035.fdocuments.in/reader035/viewer/2022062618/55146220550346284e8b5961/html5/thumbnails/7.jpg)
House Votes to Ban Internet Taxes for 5 More Years
By LIZETTE ALVAREZ
WASHINGTON, May 10 -- In a Republican bid to woo the high-technology industry and please taxpayers, the House today rushed to the floor and then handily passed a bill to extend the current moratorium on new Internet taxes until 2006.The moratorium, which is due to expire in October 2001, forbids states to try to find new ways of taxing Internet use, like imposing taxes on monthly access charges for Internet service providers.The legislation passed today, which faces an uncertain future in the Senate, does not directly address the question of sales taxes; it would not stop states from trying to collect taxes for goods sold on the Internet.By failing to address sales taxes, however, the measure alarmed some traditional retailers, as well as state governments that say they have found it nearly impossible to collect taxes for goods sold online."The single largest contributor to our economic prosperity has been the growth of information technology -- the Internet," saidRepresentative John R. Kasich, an Ohio Republican. "Why would we try to tax something, why would we try to abuse something, why would we try to limit something that generates unprecedented growth, wealth, opportunity and unprecedented individual power?"Critics of the bill say the moratorium, while seemingly benign, ignores the thorny question of how state and local governments can best collect taxes on the billions of dollars of merchandise sold over the Internet each year. These taxes are expected to provide a crucial future source of revenue for states, especially as more consumers buy goods online.The bill's opponents -- a consortium of retailers, small-business groups and governors -- say that consumers who buy merchandise over the Internet can easily circumvent the sales and "use" taxes that would be collected automatically if the same merchandise is bought at a bricks-and-mortar retail store.The National Governors' Association is working on the best way to collect electronic sales tax. Estimates have put the loss in sales tax revenue to the states at $8 billion a year by 2004.
http://www.nytimes.com/library/tech/00/05/biztech/articles/11tax.html
![Page 8: Text summarization Dragomir R. Radev School of Information, Department of Electrical Engineering and Computer Science, and Department of Linguistics University.](https://reader035.fdocuments.in/reader035/viewer/2022062618/55146220550346284e8b5961/html5/thumbnails/8.jpg)
Retailers and small businesses have complained that the current system unfairly places at a disadvantage the traditional retailers that do not sell their wares online and must charge sales tax."It's easy to imagine how these kinds of losses can affect state and local governments' ability to provide essential services," said Representative William D. Delahunt, a Massachusetts Democrat, citing the concerns of many governors. "They will be compelled to cut back local services or raise income taxes or property taxes."The bill even drew criticism from a few Republicans. Representative Ernest J. Istook Jr. of Oklahoma circulated a letter stating, "The Internet should not be singled out to be taxed, nor to be freed from tax."Still, the House voted overwhelmingly, 352 to 75, to pass the bill. A number of Democrats approved the measure after they received assurance that Congress would hold hearings concerning sales taxes and would try to come up with a solution.The moratorium "has absolutely nothing to do with the sales tax -- we will have the opportunity to have that debate," said Representative Robert Goodlatte, a Virginia Republican.The House bill faces a murkier future in the Senate. Senator John McCain, chairman of the Commerce Committee, who advocates a permanent tax moratorium, canceled a hearing on the bill last month after Republican senators, some of them former governors, expressed reservations about extending the moratorium.The legislation also faces opposition from the Clinton administration, which signaled support today for a two-year moratorium. The full House today rejected a two-year extension in a separate vote.Gov. George W. Bush, the likely Republican presidential nominee, has said he will support an extension of the moratorium. But the governor must tread carefully around the issue because Texas, which does not have a state income tax, would stand to lose substantial revenue ifsales taxes are not made workable on the Internet.A spokesman for Al Gore said the vice president supported a two-year extension of the moratorium "at a minimum." If a five-year moratorium is put into place, "it should include flexibility" to adjust federal policies on Internet taxation "to take into account the fast-paced change in the Internet world.”
![Page 9: Text summarization Dragomir R. Radev School of Information, Department of Electrical Engineering and Computer Science, and Department of Linguistics University.](https://reader035.fdocuments.in/reader035/viewer/2022062618/55146220550346284e8b5961/html5/thumbnails/9.jpg)
Types of summaries
• dimensions
• genres
• context
![Page 10: Text summarization Dragomir R. Radev School of Information, Department of Electrical Engineering and Computer Science, and Department of Linguistics University.](https://reader035.fdocuments.in/reader035/viewer/2022062618/55146220550346284e8b5961/html5/thumbnails/10.jpg)
Dimensions
• Single-document vs. multi-document
![Page 11: Text summarization Dragomir R. Radev School of Information, Department of Electrical Engineering and Computer Science, and Department of Linguistics University.](https://reader035.fdocuments.in/reader035/viewer/2022062618/55146220550346284e8b5961/html5/thumbnails/11.jpg)
Genres
• headlines• outlines• minutes• biographies• abridgments• sound bites• movie summaries• chronologies, etc.
[Mani and Maybury 1999]
![Page 12: Text summarization Dragomir R. Radev School of Information, Department of Electrical Engineering and Computer Science, and Department of Linguistics University.](https://reader035.fdocuments.in/reader035/viewer/2022062618/55146220550346284e8b5961/html5/thumbnails/12.jpg)
Context
• Query-specific
• Query-independent
![Page 13: Text summarization Dragomir R. Radev School of Information, Department of Electrical Engineering and Computer Science, and Department of Linguistics University.](https://reader035.fdocuments.in/reader035/viewer/2022062618/55146220550346284e8b5961/html5/thumbnails/13.jpg)
What does summarization involve?
• Three stages (typically)– content identification– conceptual organization– realization
![Page 14: Text summarization Dragomir R. Radev School of Information, Department of Electrical Engineering and Computer Science, and Department of Linguistics University.](https://reader035.fdocuments.in/reader035/viewer/2022062618/55146220550346284e8b5961/html5/thumbnails/14.jpg)
Spärck Jones’s three sets of factors
• Input factors (source form, subject type, unit)
• Purpose factors (situation, audience, use)
• Output factors (material, format, style)
[Spärck Jones 99]
![Page 15: Text summarization Dragomir R. Radev School of Information, Department of Electrical Engineering and Computer Science, and Department of Linguistics University.](https://reader035.fdocuments.in/reader035/viewer/2022062618/55146220550346284e8b5961/html5/thumbnails/15.jpg)
![Page 16: Text summarization Dragomir R. Radev School of Information, Department of Electrical Engineering and Computer Science, and Department of Linguistics University.](https://reader035.fdocuments.in/reader035/viewer/2022062618/55146220550346284e8b5961/html5/thumbnails/16.jpg)
http://transend.labs.bt.com/prosum/word/index.html
ProSum
• Profile-based summarization
• Control of summarization length
• Retention of user-defined text
• Customizable heading treatment
• Customizable table treatment
• Customizable text differentiation
![Page 17: Text summarization Dragomir R. Radev School of Information, Department of Electrical Engineering and Computer Science, and Department of Linguistics University.](https://reader035.fdocuments.in/reader035/viewer/2022062618/55146220550346284e8b5961/html5/thumbnails/17.jpg)
![Page 18: Text summarization Dragomir R. Radev School of Information, Department of Electrical Engineering and Computer Science, and Department of Linguistics University.](https://reader035.fdocuments.in/reader035/viewer/2022062618/55146220550346284e8b5961/html5/thumbnails/18.jpg)
Example (New York Times)
Net Tax Moratorium Clears House
The House passed a bill to extend the current moratorium on new Internet taxes until 2006.The moratorium forbids states from trying to find new ways of taxing Internet use, like imposing taxes on monthly access charges for Internet service providers.
![Page 19: Text summarization Dragomir R. Radev School of Information, Department of Electrical Engineering and Computer Science, and Department of Linguistics University.](https://reader035.fdocuments.in/reader035/viewer/2022062618/55146220550346284e8b5961/html5/thumbnails/19.jpg)
House Votes to Ban Internet Taxes for 5 More Years
By LIZETTE ALVAREZ
WASHINGTON, May 10 -- In a Republican bid to woo the high-technology industry and please taxpayers, the House today rushed to the floor and then handily passed a bill to extend the current moratorium on new Internet taxes until 2006.The moratorium, which is due to expire in October 2001, forbids states to try to find new ways of taxing Internet use, like imposing taxes on monthly access charges for Internet service providers.The legislation passed today, which faces an uncertain future in the Senate, does not directly address the question of sales taxes; it would not stop states from trying to collect taxes for goods sold on the Internet.By failing to address sales taxes, however, the measure alarmed some traditional retailers, as well as state governments that say they have found it nearly impossible to collect taxes for goods sold online."The single largest contributor to our economic prosperity has been the growth of information technology -- the Internet," saidRepresentative John R. Kasich, an Ohio Republican. "Why would we try to tax something, why would we try to abuse something, why would we try to limit something that generates unprecedented growth, wealth, opportunity and unprecedented individual power?"Critics of the bill say the moratorium, while seemingly benign, ignores the thorny question of how state and local governments can best collect taxes on the billions of dollars of merchandise sold over the Internet each year. These taxes are expected to provide a crucial future source of revenue for states, especially as more consumers buy goods online.The bill's opponents -- a consortium of retailers, small-business groups and governors -- say that consumers who buy merchandise over the Internet can easily circumvent the sales and "use" taxes that would be collected automatically if the same merchandise is bought at a bricks-and-mortar retail store.The National Governors' Association is working on the best way to collect electronic sales tax. Estimates have put the loss in sales tax revenue to the states at $8 billion a year by 2004.
http://www.nytimes.com/library/tech/00/05/biztech/articles/11tax.html
![Page 20: Text summarization Dragomir R. Radev School of Information, Department of Electrical Engineering and Computer Science, and Department of Linguistics University.](https://reader035.fdocuments.in/reader035/viewer/2022062618/55146220550346284e8b5961/html5/thumbnails/20.jpg)
Retailers and small businesses have complained that the current system unfairly places at a disadvantage the traditional retailers that do not sell their wares online and must charge sales tax."It's easy to imagine how these kinds of losses can affect state and local governments' ability to provide essential services," said Representative William D. Delahunt, a Massachusetts Democrat, citing the concerns of many governors. "They will be compelled to cut back local services or raise income taxes or property taxes."The bill even drew criticism from a few Republicans. Representative Ernest J. Istook Jr. of Oklahoma circulated a letter stating, "The Internet should not be singled out to be taxed, nor to be freed from tax."Still, the House voted overwhelmingly, 352 to 75, to pass the bill. A number of Democrats approved the measure after they received assurance that Congress would hold hearings concerning sales taxes and would try to come up with a solution.The moratorium "has absolutely nothing to do with the sales tax -- we will have the opportunity to have that debate," said Representative Robert Goodlatte, a Virginia Republican.The House bill faces a murkier future in the Senate. Senator John McCain, chairman of the Commerce Committee, who advocates a permanent tax moratorium, canceled a hearing on the bill last month after Republican senators, some of them former governors, expressed reservations about extending the moratorium.The legislation also faces opposition from the Clinton administration, which signaled support today for a two-year moratorium. The full House today rejected a two-year extension in a separate vote.Gov. George W. Bush, the likely Republican presidential nominee, has said he will support an extension of the moratorium. But the governor must tread carefully around the issue because Texas, which does not have a state income tax, would stand to lose substantial revenue ifsales taxes are not made workable on the Internet.A spokesman for Al Gore said the vice president supported a two-year extension of the moratorium "at a minimum." If a five-year moratorium is put into place, "it should include flexibility" to adjust federal policies on Internet taxation "to take into account the fast-paced change in the Internet world.”
![Page 21: Text summarization Dragomir R. Radev School of Information, Department of Electrical Engineering and Computer Science, and Department of Linguistics University.](https://reader035.fdocuments.in/reader035/viewer/2022062618/55146220550346284e8b5961/html5/thumbnails/21.jpg)
Microsoft Autosummarize outputHouse Votes to Ban Internet Taxes for 5 More Years
The moratorium, which is due to expire in October 2001, forbids states to try to find new ways of taxing Internet use, like imposing taxes on monthly access charges for Internet service providers.By failing to address sales taxes, however, the measure alarmed some traditional retailers, as well as state governments that say they have found it nearly impossible to collect taxes for goods sold online.
10% summary
![Page 22: Text summarization Dragomir R. Radev School of Information, Department of Electrical Engineering and Computer Science, and Department of Linguistics University.](https://reader035.fdocuments.in/reader035/viewer/2022062618/55146220550346284e8b5961/html5/thumbnails/22.jpg)
House Votes to Ban Internet Taxes for 5 More Years
By LIZETTE ALVAREZ
WASHINGTON, May 10 -- In a Republican bid to woo the high-technology industry and please taxpayers, the House today rushed to the floor and then handily passed a bill to extend the current moratorium on new Internet taxes until 2006.The moratorium, which is due to expire in October 2001, forbids states to try to find new ways of taxing Internet use, like imposing taxes on monthly access charges for Internet service providers.The legislation passed today, which faces an uncertain future in the Senate, does not directly address the question of sales taxes; it would not stop states from trying to collect taxes for goods sold on the Internet.By failing to address sales taxes, however, the measure alarmed some traditional retailers, as well as state governments that say they have found it nearly impossible to collect taxes for goods sold online."The single largest contributor to our economic prosperity has been the growth of information technology -- the Internet," saidRepresentative John R. Kasich, an Ohio Republican. "Why would we try to tax something, why would we try to abuse something, why would we try to limit something that generates unprecedented growth, wealth, opportunity and unprecedented individual power?"Critics of the bill say the moratorium, while seemingly benign, ignores the thorny question of how state and local governments can best collect taxes on the billions of dollars of merchandise sold over the Internet each year. These taxes are expected to provide a crucial future source of revenue for states, especially as more consumers buy goods online.The bill's opponents -- a consortium of retailers, small-business groups and governors -- say that consumers who buy merchandise over the Internet can easily circumvent the sales and "use" taxes that would be collected automatically if the same merchandise is bought at a bricks-and-mortar retail store.The National Governors' Association is working on the best way to collect electronic sales tax. Estimates have put the loss in sales tax revenue to the states at $8 billion a year by 2004.
http://www.nytimes.com/library/tech/00/05/biztech/articles/11tax.html
![Page 23: Text summarization Dragomir R. Radev School of Information, Department of Electrical Engineering and Computer Science, and Department of Linguistics University.](https://reader035.fdocuments.in/reader035/viewer/2022062618/55146220550346284e8b5961/html5/thumbnails/23.jpg)
Retailers and small businesses have complained that the current system unfairly places at a disadvantage the traditional retailers that do not sell their wares online and must charge sales tax."It's easy to imagine how these kinds of losses can affect state and local governments' ability to provide essential services," said Representative William D. Delahunt, a Massachusetts Democrat, citing the concerns of many governors. "They will be compelled to cut back local services or raise income taxes or property taxes."The bill even drew criticism from a few Republicans. Representative Ernest J. Istook Jr. of Oklahoma circulated a letter stating, "The Internet should not be singled out to be taxed, nor to be freed from tax."Still, the House voted overwhelmingly, 352 to 75, to pass the bill. A number of Democrats approved the measure after they received assurance that Congress would hold hearings concerning sales taxes and would try to come up with a solution.The moratorium "has absolutely nothing to do with the sales tax -- we will have the opportunity to have that debate," said Representative Robert Goodlatte, a Virginia Republican.The House bill faces a murkier future in the Senate. Senator John McCain, chairman of the Commerce Committee, who advocates a permanent tax moratorium, canceled a hearing on the bill last month after Republican senators, some of them former governors, expressed reservations about extending the moratorium.The legislation also faces opposition from the Clinton administration, which signaled support today for a two-year moratorium. The full House today rejected a two-year extension in a separate vote.Gov. George W. Bush, the likely Republican presidential nominee, has said he will support an extension of the moratorium. But the governor must tread carefully around the issue because Texas, which does not have a state income tax, would stand to lose substantial revenue ifsales taxes are not made workable on the Internet.A spokesman for Al Gore said the vice president supported a two-year extension of the moratorium "at a minimum." If a five-year moratorium is put into place, "it should include flexibility" to adjust federal policies on Internet taxation "to take into account the fast-paced change in the Internet world.”
![Page 24: Text summarization Dragomir R. Radev School of Information, Department of Electrical Engineering and Computer Science, and Department of Linguistics University.](https://reader035.fdocuments.in/reader035/viewer/2022062618/55146220550346284e8b5961/html5/thumbnails/24.jpg)
Microsoft Autosummarize outputHouse Votes to Ban Internet Taxes for 5 More Years
The moratorium, which is due to expire in October 2001, forbids states to try to find new ways of taxing Internet use, like imposing taxes on monthly access charges for Internet service providers.The legislation passed today, which faces an uncertain future in the Senate, does not directly address the question of sales taxes; it would not stop states from trying to collect taxes for goods sold on the Internet.By failing to address sales taxes, however, the measure alarmed some traditional retailers, as well as state governments that say they have found it nearly impossible to collect taxes for goods sold online.The National Governors' Association is working on the best way to collect electronic sales tax. Representative Ernest J. Istook Jr. of Oklahoma circulated a letter stating, "The Internet should not be singled out to be taxed, nor to be freed from tax."Senator John McCain, chairman of the Commerce Committee, who advocates a permanent tax moratorium, canceled a hearing on the bill last month after Republican senators, some of them former governors, expressed reservations about extending the moratorium.
25% summary
![Page 25: Text summarization Dragomir R. Radev School of Information, Department of Electrical Engineering and Computer Science, and Department of Linguistics University.](https://reader035.fdocuments.in/reader035/viewer/2022062618/55146220550346284e8b5961/html5/thumbnails/25.jpg)
House Votes to Ban Internet Taxes for 5 More Years
By LIZETTE ALVAREZ
WASHINGTON, May 10 -- In a Republican bid to woo the high-technology industry and please taxpayers, the House today rushed to the floor and then handily passed a bill to extend the current moratorium on new Internet taxes until 2006.The moratorium, which is due to expire in October 2001, forbids states to try to find new ways of taxing Internet use, like imposing taxes on monthly access charges for Internet service providers.The legislation passed today, which faces an uncertain future in the Senate, does not directly address the question of sales taxes; it would not stop states from trying to collect taxes for goods sold on the Internet.By failing to address sales taxes, however, the measure alarmed some traditional retailers, as well as state governments that say they have found it nearly impossible to collect taxes for goods sold online."The single largest contributor to our economic prosperity has been the growth of information technology -- the Internet," saidRepresentative John R. Kasich, an Ohio Republican. "Why would we try to tax something, why would we try to abuse something, why would we try to limit something that generates unprecedented growth, wealth, opportunity and unprecedented individual power?"Critics of the bill say the moratorium, while seemingly benign, ignores the thorny question of how state and local governments can best collect taxes on the billions of dollars of merchandise sold over the Internet each year. These taxes are expected to provide a crucial future source of revenue for states, especially as more consumers buy goods online.The bill's opponents -- a consortium of retailers, small-business groups and governors -- say that consumers who buy merchandise over the Internet can easily circumvent the sales and "use" taxes that would be collected automatically if the same merchandise is bought at a bricks-and-mortar retail store.The National Governors' Association is working on the best way to collect electronic sales tax. Estimates have put the loss in sales tax revenue to the states at $8 billion a year by 2004.
http://www.nytimes.com/library/tech/00/05/biztech/articles/11tax.html
![Page 26: Text summarization Dragomir R. Radev School of Information, Department of Electrical Engineering and Computer Science, and Department of Linguistics University.](https://reader035.fdocuments.in/reader035/viewer/2022062618/55146220550346284e8b5961/html5/thumbnails/26.jpg)
Retailers and small businesses have complained that the current system unfairly places at a disadvantage the traditional retailers that do not sell their wares online and must charge sales tax."It's easy to imagine how these kinds of losses can affect state and local governments' ability to provide essential services," said Representative William D. Delahunt, a Massachusetts Democrat, citing the concerns of many governors. "They will be compelled to cut back local services or raise income taxes or property taxes."The bill even drew criticism from a few Republicans. Representative Ernest J. Istook Jr. of Oklahoma circulated a letter stating, "The Internet should not be singled out to be taxed, nor to be freed from tax."Still, the House voted overwhelmingly, 352 to 75, to pass the bill. A number of Democrats approved the measure after they received assurance that Congress would hold hearings concerning sales taxes and would try to come up with a solution.The moratorium "has absolutely nothing to do with the sales tax -- we will have the opportunity to have that debate," said Representative Robert Goodlatte, a Virginia Republican.The House bill faces a murkier future in the Senate. Senator John McCain, chairman of the Commerce Committee, who advocates a permanent tax moratorium, canceled a hearing on the bill last month after Republican senators, some of them former governors, expressed reservations about extending the moratorium.The legislation also faces opposition from the Clinton administration, which signaled support today for a two-year moratorium. The full House today rejected a two-year extension in a separate vote.Gov. George W. Bush, the likely Republican presidential nominee, has said he will support an extension of the moratorium. But the governor must tread carefully around the issue because Texas, which does not have a state income tax, would stand to lose substantial revenue ifsales taxes are not made workable on the Internet.A spokesman for Al Gore said the vice president supported a two-year extension of the moratorium "at a minimum." If a five-year moratorium is put into place, "it should include flexibility" to adjust federal policies on Internet taxation "to take into account the fast-paced change in the Internet world.”
![Page 27: Text summarization Dragomir R. Radev School of Information, Department of Electrical Engineering and Computer Science, and Department of Linguistics University.](https://reader035.fdocuments.in/reader035/viewer/2022062618/55146220550346284e8b5961/html5/thumbnails/27.jpg)
OutlineIntroduction
Traditional approaches
Multi-document summarization
Knowledge-rich techniques
Evaluation methods
The MEAD project
Language modeling
I
II
III
IV
V
VI
VII
![Page 28: Text summarization Dragomir R. Radev School of Information, Department of Electrical Engineering and Computer Science, and Department of Linguistics University.](https://reader035.fdocuments.in/reader035/viewer/2022062618/55146220550346284e8b5961/html5/thumbnails/28.jpg)
Part II Traditional approaches
![Page 29: Text summarization Dragomir R. Radev School of Information, Department of Electrical Engineering and Computer Science, and Department of Linguistics University.](https://reader035.fdocuments.in/reader035/viewer/2022062618/55146220550346284e8b5961/html5/thumbnails/29.jpg)
Human summarization and abstracting
• What professional abstractors do
• Ashworth:• “To take an original article, understand it
and pack it neatly into a nutshell without loss of substance or clarity presents a challenge which many have felt worth taking up for the joys of achievement alone. These are the characteristics of an art form”.
![Page 30: Text summarization Dragomir R. Radev School of Information, Department of Electrical Engineering and Computer Science, and Department of Linguistics University.](https://reader035.fdocuments.in/reader035/viewer/2022062618/55146220550346284e8b5961/html5/thumbnails/30.jpg)
Borko and Bernier 75
• The abstract and its use:– Abstracts promote current awareness– Abstracts save reading time– Abstracts facilitate selection– Abstracts facilitate literature searches– Abstracts improve indexing efficiency– Abstracts aid in the preparation of
reviews
![Page 31: Text summarization Dragomir R. Radev School of Information, Department of Electrical Engineering and Computer Science, and Department of Linguistics University.](https://reader035.fdocuments.in/reader035/viewer/2022062618/55146220550346284e8b5961/html5/thumbnails/31.jpg)
Cremmins 82, 96
• American National Standard for Writing Abstracts:– State the purpose, methods, results, and conclusions
presented in the original document, either in that order or with an initial emphasis on results and conclusions.
– Make the abstract as informative as the nature of the document will permit, so that readers may decide, quickly and accurately, whether they need to read the entire document.
– Avoid including background information or citing the work of others in the abstract, unless the study is a replication or evaluation of their work.
![Page 32: Text summarization Dragomir R. Radev School of Information, Department of Electrical Engineering and Computer Science, and Department of Linguistics University.](https://reader035.fdocuments.in/reader035/viewer/2022062618/55146220550346284e8b5961/html5/thumbnails/32.jpg)
Cremmins 82, 96
– Do not include information in the abstract that is not contained in the textual material being abstracted.
– Verify that all quantitative and qualitative information used in the abstract agrees with the information contained in the full text of the document.
– Use standard English and precise technical terms, and follow conventional grammar and punctuation rules.
– Give expanded versions of lesser known abbreviations and acronyms, and verbalize symbols that may be unfamiliar to readers of the abstract.
– Omit needless words, phrases, and sentences.
![Page 33: Text summarization Dragomir R. Radev School of Information, Department of Electrical Engineering and Computer Science, and Department of Linguistics University.](https://reader035.fdocuments.in/reader035/viewer/2022062618/55146220550346284e8b5961/html5/thumbnails/33.jpg)
Cremmins 82, 96• Original version:
There were significant positive associations between the concentrations of the substance administered and mortality in rats and mice of both sexes.
There was no convincing evidence to indicate that endrin ingestion induced and of the different types of tumors which were found in the treated animals.
• Edited version:
Mortality in rats and mice of both sexes was dose related.
No treatment-related tumors were found in any of the animals.
![Page 34: Text summarization Dragomir R. Radev School of Information, Department of Electrical Engineering and Computer Science, and Department of Linguistics University.](https://reader035.fdocuments.in/reader035/viewer/2022062618/55146220550346284e8b5961/html5/thumbnails/34.jpg)
Redundancy of English
• 75% redundancy of English [Shannon 51]
• [Burton & Licklider 55] show that humans are as good at guessing the next letter after seeing 32 letters as after 10,000 letters.
![Page 35: Text summarization Dragomir R. Radev School of Information, Department of Electrical Engineering and Computer Science, and Department of Linguistics University.](https://reader035.fdocuments.in/reader035/viewer/2022062618/55146220550346284e8b5961/html5/thumbnails/35.jpg)
Morris et al. 92
• Reading comprehension of summaries• Compare manual abstracts, Edmundson-
style extracts, and full documents• Extracts containing 20% or 30% of original
document are effective surrogates of original document
• Performance on 20% and 30% extracts is no different than informative abstracts
![Page 36: Text summarization Dragomir R. Radev School of Information, Department of Electrical Engineering and Computer Science, and Department of Linguistics University.](https://reader035.fdocuments.in/reader035/viewer/2022062618/55146220550346284e8b5961/html5/thumbnails/36.jpg)
Extraction models
• Extracts vs. abstracts
• Linear model• Text structure
based• New techniques
Compression Ratio =|S|
|D|
Retention Ratio =i (S)
i (D)
Information content
![Page 37: Text summarization Dragomir R. Radev School of Information, Department of Electrical Engineering and Computer Science, and Department of Linguistics University.](https://reader035.fdocuments.in/reader035/viewer/2022062618/55146220550346284e8b5961/html5/thumbnails/37.jpg)
Text compaction techniquesMissam ad amicum pro onsolatione epistolam, dilectissime, vestram ad me forte quidam nuper attulit.
Quam ex ipsa statim tituli fronte vestram esse considerans, tanto ardentius eam cepi legere quanto scriptorem ipsum karius amplector, ut cuius rem perdidi verbis saltem tanquam eius quadam imagine recreer.
Erant, memini, huius epistole fere omnia felle et absintio plena, que scilicet nostre conversionis miserabilem hystoriam et tuas, unice, cruces assiduas referebant.
Complesti revera in epistola illa quod in exordio eius amico promisisti, ut videlicet in omparatione tuarum suas molestias nullas vel parvas reputaret; ubi quidem expositis prius magistrorum tuorum in te persequutionibus, deinde in corpus tuum summe proditionis iniuria, ad condiscipulorum quoque tuorum Alberici videlicet Remensis et Lotulfi Lumbardi execrabilem invidiam et infestationem nimiam stilum contulisti.
Missam ad amicum pro onsolatione epistolam, dilectissime, vestram ad me forte quidam nuper attulit.
Erant, memini, huius epistole fere omnia felle et absintio plena, que scilicet nostre conversionis miserabilem hystoriam et tuas, unice, cruces assiduas referebant.
![Page 38: Text summarization Dragomir R. Radev School of Information, Department of Electrical Engineering and Computer Science, and Department of Linguistics University.](https://reader035.fdocuments.in/reader035/viewer/2022062618/55146220550346284e8b5961/html5/thumbnails/38.jpg)
Text compaction techniques
Missam ad amicum pro onsolatione epistolam, dilectissime, vestram ad me forte quidam nuper attulit.
Erant, memini, huius epistole fere omnia felle et absintio plena, que scilicet nostre conversionis miserabilem hystoriam et tuas, unice, cruces assiduas referebant.
Missam vestram nuper attulit.
Erant, scilicet nostre conversionis miserabilem hystoriam referebant.
![Page 39: Text summarization Dragomir R. Radev School of Information, Department of Electrical Engineering and Computer Science, and Department of Linguistics University.](https://reader035.fdocuments.in/reader035/viewer/2022062618/55146220550346284e8b5961/html5/thumbnails/39.jpg)
Luhn 58
• Very first work in automated summarization
• Computes measures of significance
• Words:– stemming– bag of words
WORDSF
RE
QU
EN
CY
E
Resolving power of significant words
![Page 40: Text summarization Dragomir R. Radev School of Information, Department of Electrical Engineering and Computer Science, and Department of Linguistics University.](https://reader035.fdocuments.in/reader035/viewer/2022062618/55146220550346284e8b5961/html5/thumbnails/40.jpg)
Luhn 58
• Sentences:– concentration of
high-score words
• Cutoff values established in experiments with 100 human subjects
SIGNIFICANT WORDS
ALL WORDS
* * * * 1 2 3 4 5 6 7
SENTENCE
SCORE = 42/7 2.3
![Page 41: Text summarization Dragomir R. Radev School of Information, Department of Electrical Engineering and Computer Science, and Department of Linguistics University.](https://reader035.fdocuments.in/reader035/viewer/2022062618/55146220550346284e8b5961/html5/thumbnails/41.jpg)
Edmundson 69
• Cue method:– stigma words
(“hardly”, “impossible”)
– bonus words (“significant”)
• Key method:– similar to Luhn
• Title method:– title + headings
• Location method:– sentences under
headings– sentences near
beginning or end of document and/or paragraphs (also [Baxendale 58])
![Page 42: Text summarization Dragomir R. Radev School of Information, Department of Electrical Engineering and Computer Science, and Department of Linguistics University.](https://reader035.fdocuments.in/reader035/viewer/2022062618/55146220550346284e8b5961/html5/thumbnails/42.jpg)
Edmundson 69
• Linear combination of four features:
1C + 2K + 3T + 4L
• Manually labelled training corpus
• Key not important!0 10 20 30 40 50 60 70 80 90 100 %
RANDOM
KEY
TITLE
CUE
LOCATION
C + K + T + L
C + T + L
1
![Page 43: Text summarization Dragomir R. Radev School of Information, Department of Electrical Engineering and Computer Science, and Department of Linguistics University.](https://reader035.fdocuments.in/reader035/viewer/2022062618/55146220550346284e8b5961/html5/thumbnails/43.jpg)
Paice 90
• Survey up to 1990• Techniques that
(mostly) failed:– syntactic criteria
[Earl 70]– indicator phrases
(“The purpose of this article is to review…)
• Problems with extracts:– lack of balance– lack of cohesion
• anaphoric reference• lexical or definite
reference• rhetorical
connectives
![Page 44: Text summarization Dragomir R. Radev School of Information, Department of Electrical Engineering and Computer Science, and Department of Linguistics University.](https://reader035.fdocuments.in/reader035/viewer/2022062618/55146220550346284e8b5961/html5/thumbnails/44.jpg)
Paice 90
• Lack of balance– later approaches
based on text rhetorical structure
• Lack of cohesion– recognition of
anaphors [Liddy et al. 87]
• Example: “that” is– nonanaphoric if
preceded by a research-verb (e.g., “demonstrat-”),
– nonanaphoric if followed by a pronoun, article, quantifier,…,
– external if no later than 10th word,else
– internal
![Page 45: Text summarization Dragomir R. Radev School of Information, Department of Electrical Engineering and Computer Science, and Department of Linguistics University.](https://reader035.fdocuments.in/reader035/viewer/2022062618/55146220550346284e8b5961/html5/thumbnails/45.jpg)
Brandow et al. 95
• ANES: commercial news from 41 publications
• “Lead” achieves acceptability of 90% vs. 74.4% for “intelligent” summaries
• 20,997 documents• words selected
based on tf*idf• sentence-based
features:– signature words– location– anaphora words– length of abstract
![Page 46: Text summarization Dragomir R. Radev School of Information, Department of Electrical Engineering and Computer Science, and Department of Linguistics University.](https://reader035.fdocuments.in/reader035/viewer/2022062618/55146220550346284e8b5961/html5/thumbnails/46.jpg)
Brandow et al. 95
• Sentences with no signature words are included if between two selected sentences
• Evaluation done at 60, 150, and 250 word length
• Non-task-driven evaluation:
“Most summaries judged less-than-perfect would not be detectable as such to a user”
![Page 47: Text summarization Dragomir R. Radev School of Information, Department of Electrical Engineering and Computer Science, and Department of Linguistics University.](https://reader035.fdocuments.in/reader035/viewer/2022062618/55146220550346284e8b5961/html5/thumbnails/47.jpg)
Lin & Hovy 97
• Optimum position policy
• Measuring yield of each sentence position against keywords (signature words) from Ziff-Davis corpus
• Preferred order
[(T) (P2,S1) (P3,S1) (P2,S2) {(P4,S1) (P5,S1) (P3,S2)} {(P1,S1) (P6,S1) (P7,S1) (P1,S3)(P2,S3) …]
![Page 48: Text summarization Dragomir R. Radev School of Information, Department of Electrical Engineering and Computer Science, and Department of Linguistics University.](https://reader035.fdocuments.in/reader035/viewer/2022062618/55146220550346284e8b5961/html5/thumbnails/48.jpg)
Kupiec et al. 95
• Extracts of roughly 20% of original text
• Feature set:– sentence length
• |S| > 5
– fixed phrases• 26 manually chosen
– paragraph• sentence position in
paragraph
– thematic words• binary: whether
sentence is included in manual extract
– uppercase words• not common
acronyms
• Corpus:• 188 document +
summary pairs from scientific journals
![Page 49: Text summarization Dragomir R. Radev School of Information, Department of Electrical Engineering and Computer Science, and Department of Linguistics University.](https://reader035.fdocuments.in/reader035/viewer/2022062618/55146220550346284e8b5961/html5/thumbnails/49.jpg)
Kupiec et al. 95
• Uses Bayesian classifier:
• Assuming statistical independence:
k
j j
k
j j
kFP
SsPSsFPFFFSsP
1
121
)(
)()|(),...,|(
),(
)()|,...,(),...,|(
,...21
2121
k
kk FFFP
SsPSsFFFPFFFSsP
![Page 50: Text summarization Dragomir R. Radev School of Information, Department of Electrical Engineering and Computer Science, and Department of Linguistics University.](https://reader035.fdocuments.in/reader035/viewer/2022062618/55146220550346284e8b5961/html5/thumbnails/50.jpg)
Kupiec et al. 95
• Performance:– For 25% summaries, 84% precision– For smaller summaries, 74%
improvement over Lead
![Page 51: Text summarization Dragomir R. Radev School of Information, Department of Electrical Engineering and Computer Science, and Department of Linguistics University.](https://reader035.fdocuments.in/reader035/viewer/2022062618/55146220550346284e8b5961/html5/thumbnails/51.jpg)
Salton et al. 97
• document analysis based on semantic hyperlinks (among pairs of paragraphs related by a lexical similarity significantly higher than random)
• Bushy paths (or paths connecting highly connected paragraphs) are more likely to contain information central to the topic of the article
![Page 52: Text summarization Dragomir R. Radev School of Information, Department of Electrical Engineering and Computer Science, and Department of Linguistics University.](https://reader035.fdocuments.in/reader035/viewer/2022062618/55146220550346284e8b5961/html5/thumbnails/52.jpg)
Salton et al. 97
…
…
![Page 53: Text summarization Dragomir R. Radev School of Information, Department of Electrical Engineering and Computer Science, and Department of Linguistics University.](https://reader035.fdocuments.in/reader035/viewer/2022062618/55146220550346284e8b5961/html5/thumbnails/53.jpg)
Salton et al. 97
Overlap between manual extracts: 46%Algorithm Optimistic Pessimistic Intersection Union
Globalbushy
45.60% 30.74% 47.33% 55.16%
Globaldepth-first
43.98% 27.76% 42.33% 52.48%
Segmentedbushy
45.48% 26.37% 38.17% 52.95%
Random 39.16% 22.07% 38.47% 44.24%
![Page 54: Text summarization Dragomir R. Radev School of Information, Department of Electrical Engineering and Computer Science, and Department of Linguistics University.](https://reader035.fdocuments.in/reader035/viewer/2022062618/55146220550346284e8b5961/html5/thumbnails/54.jpg)
Marcu 97-99
• Based on RST (nucleus+satellite relations)
• text coherence• 70% precision and
recall in matching the most important units in a text
• Example: evidence
[The truth is that the pressure to smoke in junior high is greater than it will be any other time of one’s life:][we know that 3,000 teens start smoking each day.]
• N+S combination increases R’s belief in N [Mann and Thompson 88]
![Page 55: Text summarization Dragomir R. Radev School of Information, Department of Electrical Engineering and Computer Science, and Department of Linguistics University.](https://reader035.fdocuments.in/reader035/viewer/2022062618/55146220550346284e8b5961/html5/thumbnails/55.jpg)
2Elaboration
2Elaboration
8Example
2BackgroundJustification
3Elaboration
8Concession
10Antithesis
Mars experiences
frigid weather
conditions(2)
Surface temperatures typically average
about -60 degrees
Celsius (-76 degrees
Fahrenheit) at the
equator and can dip to -
123 degrees C near the
poles(3)
4 5Contrast
Although the atmosphere
holds a small
amount of water, and water-ice
clouds sometimes develop,
(7)
Most Martian weather involves
blowing dust and carbon monoxide.
(8)
Each winter, for example, a blizzard of
frozen carbon dioxide
rages over one pole, and a few meters of
this dry-ice snow
accumulate as
previously frozen carbon dioxide
evaporates from the opposite
polar cap.(9)
Yet even on the summer pole, where
the sun remains in the sky all day long,
temperatures never warm
enough to melt frozen
water.(10)
With its distant orbit (50 percent farther from the sun than Earth) and
slim atmospheric
blanket,(1)
Only the midday sun at tropical latitudes is
warm enough to
thaw ice on occasion,
(4)
5Evidence
Cause
but any liquid water formed in this way would
evaporate almost
instantly(5)
because of the low
atmospheric pressure
(6)
![Page 56: Text summarization Dragomir R. Radev School of Information, Department of Electrical Engineering and Computer Science, and Department of Linguistics University.](https://reader035.fdocuments.in/reader035/viewer/2022062618/55146220550346284e8b5961/html5/thumbnails/56.jpg)
Barzilay and Elhadad 97
• Lexical chains [Stairmand 96]
Mr. Kenny is the person that invented the anesthetic machine which uses micro-computers to control the rate at which an anesthetic is pumped into the blood. Such machines are nothing new. But his device uses two micro-computers to achineve much closer monitoring of the pump feeding the anesthetic into the patient.
![Page 57: Text summarization Dragomir R. Radev School of Information, Department of Electrical Engineering and Computer Science, and Department of Linguistics University.](https://reader035.fdocuments.in/reader035/viewer/2022062618/55146220550346284e8b5961/html5/thumbnails/57.jpg)
Barzilay and Elhadad 97
• WordNet-based
• three types of relations:– extra-strong (repetitions)– strong (WordNet relations)– medium-strong (link between synsets is
longer than one + some additional constraints)
![Page 58: Text summarization Dragomir R. Radev School of Information, Department of Electrical Engineering and Computer Science, and Department of Linguistics University.](https://reader035.fdocuments.in/reader035/viewer/2022062618/55146220550346284e8b5961/html5/thumbnails/58.jpg)
Barzilay and Elhadad 97
• Scoring chains:– Length– Homogeneity index:
= 1 - # distinct words in chain
Score = Length * Homogeneity
Score > Average + 2 * st.dev.
![Page 59: Text summarization Dragomir R. Radev School of Information, Department of Electrical Engineering and Computer Science, and Department of Linguistics University.](https://reader035.fdocuments.in/reader035/viewer/2022062618/55146220550346284e8b5961/html5/thumbnails/59.jpg)
Other approaches
• Salience-based [Boguraev and Kennedy 97]
• Computational linguistics papers [Teufel and Moens 97]
![Page 60: Text summarization Dragomir R. Radev School of Information, Department of Electrical Engineering and Computer Science, and Department of Linguistics University.](https://reader035.fdocuments.in/reader035/viewer/2022062618/55146220550346284e8b5961/html5/thumbnails/60.jpg)
Part III Multi-document summarization
![Page 61: Text summarization Dragomir R. Radev School of Information, Department of Electrical Engineering and Computer Science, and Department of Linguistics University.](https://reader035.fdocuments.in/reader035/viewer/2022062618/55146220550346284e8b5961/html5/thumbnails/61.jpg)
Mani & Bloedorn 97,99
• Summarizing differences and similarities across documents
• Single event or a sequence of events
• Text segments are aligned
• Evaluation: TREC relevance judgments
• Significant reduction in time with no significant loss of accuracy
![Page 62: Text summarization Dragomir R. Radev School of Information, Department of Electrical Engineering and Computer Science, and Department of Linguistics University.](https://reader035.fdocuments.in/reader035/viewer/2022062618/55146220550346284e8b5961/html5/thumbnails/62.jpg)
Carbonell & Goldstein 98
• Maximal Marginal Relevance (MMR)
• Query-based summaries
• Law of diminishing returns
C = doc collectionQ = user queryR = IR(C,Q,)S = already retrieved
documentsSim = similarity
metric used
MMR = argmax [ (Sim1(Di,Q) - (1-) max Sim2(Di,Dj)]DiR\S DiS
![Page 63: Text summarization Dragomir R. Radev School of Information, Department of Electrical Engineering and Computer Science, and Department of Linguistics University.](https://reader035.fdocuments.in/reader035/viewer/2022062618/55146220550346284e8b5961/html5/thumbnails/63.jpg)
Radev et al. 00
• MEAD• Centroid-based• Based on sentence
utility
• Topic detection and tracking initiative [Allen et al. 98, Wayne 98]
TIME
![Page 64: Text summarization Dragomir R. Radev School of Information, Department of Electrical Engineering and Computer Science, and Department of Linguistics University.](https://reader035.fdocuments.in/reader035/viewer/2022062618/55146220550346284e8b5961/html5/thumbnails/64.jpg)
1. Algerian newspapers have reported that 18 decapitated bodies have been found by authorities in the south of the country.
2. Police found the ``decapitated bodies of women, children and old men,with their heads thrown on a road'' near the town of Jelfa, 275 kilometers (170 miles) south of the capital Algiers.
3. In another incident on Wednesday, seven people -- including six children -- were killed by terrorists, Algerian security forces said.
4. Extremist Muslim militants were responsible for the slaughter of the seven people in the province of Medea, 120 kilometers (74 miles) south of Algiers.
5. The killers also kidnapped three girls during the same attack, authorities said, and one of the girls was found wounded on a nearby road.
6. Meanwhile, the Algerian daily Le Matin today quoted Interior Minister Abdul Malik Silal as saying that ``terrorism has not been eradicated, but the movement of the terrorists has significantly declined.''
7. Algerian violence has claimed the lives of more than 70,000 people since the army cancelled the 1992 general elections that Islamic parties were likely to win.
8. Mainstream Islamic groups, most of which are banned in the country, insist their members are not responsible for the violence against civilians.
9. Some Muslim groups have blamed the army, while others accuse ``foreign elements conspiring against Algeria.’’
1. Eighteen decapitated bodies have been found in a mass grave in northern Algeria, press reports said Thursday, adding that two shepherds were murdered earlier this week.
2. Security forces found the mass grave on Wednesday at Chbika, near Djelfa, 275 kilometers (170 miles) south of the capital.
3. It contained the bodies of people killed last year during a wedding ceremony, according to Le Quotidien Liberte.
4. The victims included women, children and old men.
5. Most of them had been decapitated and their heads thrown on a road, reported the Es Sahafa.
6. Another mass grave containing the bodies of around 10 people was discovered recently near Algiers, in the Eucalyptus district.
7. The two shepherds were killed Monday evening by a group of nine armed Islamists near the Moulay Slissen forest.
8. After being injured in a hail of automatic weapons fire, the pair were finished off with machete blows before being decapitated, Le Quotidien d'Oran reported.
9. Seven people, six of them children, were killed and two injured Wednesday by armed Islamists near Medea, 120 kilometers (75 miles) south of Algiers, security forces said.
10. The same day a parcel bomb explosion injured 17 people in Algiers itself.
11. Since early March, violence linked to armed Islamists has claimed more than 500 lives, according to press tallies.
ARTICLE 18854: ALGIERS, May 20 (UPI) ARTICLE 18853: ALGIERS, May 20 (AFP)
![Page 65: Text summarization Dragomir R. Radev School of Information, Department of Electrical Engineering and Computer Science, and Department of Linguistics University.](https://reader035.fdocuments.in/reader035/viewer/2022062618/55146220550346284e8b5961/html5/thumbnails/65.jpg)
Vector-based representation
Term 1
Term 2
Term 3
Document
Centroid
![Page 66: Text summarization Dragomir R. Radev School of Information, Department of Electrical Engineering and Computer Science, and Department of Linguistics University.](https://reader035.fdocuments.in/reader035/viewer/2022062618/55146220550346284e8b5961/html5/thumbnails/66.jpg)
Vector-based matching
• The cosine measure
kk
kk
k kk
cd
kidfcdCDsim
22 .
)(..),(
![Page 67: Text summarization Dragomir R. Radev School of Information, Department of Electrical Engineering and Computer Science, and Department of Linguistics University.](https://reader035.fdocuments.in/reader035/viewer/2022062618/55146220550346284e8b5961/html5/thumbnails/67.jpg)
CIDR
sim T
sim < T
![Page 68: Text summarization Dragomir R. Radev School of Information, Department of Electrical Engineering and Computer Science, and Department of Linguistics University.](https://reader035.fdocuments.in/reader035/viewer/2022062618/55146220550346284e8b5961/html5/thumbnails/68.jpg)
CentroidsC 00022 (N=44)
(10000)diana 1.93princess 1.52
C 00025 (N=19)(10000)albanians 3.00
C 00026 (N=10)(10000)universe 1.50
expansion 1.00bang 0.90
C 10007 (N=11)(10000)crashes 1.00
safety 0.55transportat
ion0.55
drivers 0.45board 0.36flight 0.27buckle 0.27
pittsburgh 0.18graduating 0.18automobile 0.18
C 00035 (N=22)(10000)airlines 1.45
finnair 0.45
C 00031 (N=34)(10000)el 1.85
nino 1.56
C 00008 (N=113)(10000)space 1.98
shuttle 1.17station 0.75nasa 0.51
columbia 0.37mission 0.33mir 0.30
astronauts
0.14steering 0.11safely 0.07
C 10062 (N=161)microsoft 3.24justice 0.93
department
0.88windows 0.98corp 0.61
software 0.57ellison 0.07hatch 0.06
netscape 0.04metcalfe 0.02
![Page 69: Text summarization Dragomir R. Radev School of Information, Department of Electrical Engineering and Computer Science, and Department of Linguistics University.](https://reader035.fdocuments.in/reader035/viewer/2022062618/55146220550346284e8b5961/html5/thumbnails/69.jpg)
MEAD
...
...
![Page 70: Text summarization Dragomir R. Radev School of Information, Department of Electrical Engineering and Computer Science, and Department of Linguistics University.](https://reader035.fdocuments.in/reader035/viewer/2022062618/55146220550346284e8b5961/html5/thumbnails/70.jpg)
MEAD
• INPUT: Cluster of d documents with n sentences (compression rate = r)
• OUTPUT: (n * r) sentences from the cluster with the highest values of SCORE
SCORE (s) = i (wcCi + wpPi + wfFi)
![Page 71: Text summarization Dragomir R. Radev School of Information, Department of Electrical Engineering and Computer Science, and Department of Linguistics University.](https://reader035.fdocuments.in/reader035/viewer/2022062618/55146220550346284e8b5961/html5/thumbnails/71.jpg)
[Barzilay et al. 99]
• Theme intersection (paraphrases)
• Identifying common phrases across multiple sentences:– evaluated on 39 sentence-level
predicate-argument structures– 74% of p-a structures automatically
identified
![Page 72: Text summarization Dragomir R. Radev School of Information, Department of Electrical Engineering and Computer Science, and Department of Linguistics University.](https://reader035.fdocuments.in/reader035/viewer/2022062618/55146220550346284e8b5961/html5/thumbnails/72.jpg)
Other multi-document approaches
• Reformulation [McKeown et al. 99]
• Generation by Selection and Repair [DiMarco et al. 97]
• Topic and event distinctions [Fukumoto & Suzuki 00]
![Page 73: Text summarization Dragomir R. Radev School of Information, Department of Electrical Engineering and Computer Science, and Department of Linguistics University.](https://reader035.fdocuments.in/reader035/viewer/2022062618/55146220550346284e8b5961/html5/thumbnails/73.jpg)
Part IV Knowledge-rich
approaches
![Page 74: Text summarization Dragomir R. Radev School of Information, Department of Electrical Engineering and Computer Science, and Department of Linguistics University.](https://reader035.fdocuments.in/reader035/viewer/2022062618/55146220550346284e8b5961/html5/thumbnails/74.jpg)
Overview
• Schank and Abelson 77– scripts
• DeJong 79– FRUMP (slot-filling from UPI news)
• Graesser 81– Ratio of inferred propositions to these
explicitly stated is 8:1
• Young & Hayes 85– banking telexes
![Page 75: Text summarization Dragomir R. Radev School of Information, Department of Electrical Engineering and Computer Science, and Department of Linguistics University.](https://reader035.fdocuments.in/reader035/viewer/2022062618/55146220550346284e8b5961/html5/thumbnails/75.jpg)
Radev and McKeown 98
MESSAGE: ID TST3-MUC4-0010 MESSAGE: TEMPLATE 2 INCIDENT: DATE 30 OCT 89 INCIDENT: LOCATION EL SALVADOR INCIDENT: TYPE ATTACK INCIDENT: STAGE OF EXECUTION ACCOMPLISHED INCIDENT: INSTRUMENT ID INCIDENT: INSTRUMENT TYPEPERP: INCIDENT CATEGORY TERRORIST ACT PERP: INDIVIDUAL ID "TERRORIST" PERP: ORGANIZATION ID "THE FMLN" PERP: ORG. CONFIDENCE REPORTED: "THE FMLN" PHYS TGT: ID PHYS TGT: TYPEPHYS TGT: NUMBERPHYS TGT: FOREIGN NATIONPHYS TGT: EFFECT OF INCIDENTPHYS TGT: TOTAL NUMBERHUM TGT: NAMEHUM TGT: DESCRIPTION "1 CIVILIAN"HUM TGT: TYPE CIVILIAN: "1 CIVILIAN"HUM TGT: NUMBER 1: "1 CIVILIAN"HUM TGT: FOREIGN NATIONHUM TGT: EFFECT OF INCIDENT DEATH: "1 CIVILIAN"HUM TGT: TOTAL NUMBER
![Page 76: Text summarization Dragomir R. Radev School of Information, Department of Electrical Engineering and Computer Science, and Department of Linguistics University.](https://reader035.fdocuments.in/reader035/viewer/2022062618/55146220550346284e8b5961/html5/thumbnails/76.jpg)
Generating text from templates
On October 30, 1989, one civilian was killed in a reported FMLN attack in El Salvador.
![Page 77: Text summarization Dragomir R. Radev School of Information, Department of Electrical Engineering and Computer Science, and Department of Linguistics University.](https://reader035.fdocuments.in/reader035/viewer/2022062618/55146220550346284e8b5961/html5/thumbnails/77.jpg)
Input: Cluster of templates
T1 Tm
Conceptual combiner
T2 …..
Combiner
Paragraph planner
Planningoperators
Linguistic realizer
Sentence planner
Sentence generator
Lexical chooserLexicon
OUTPUT: Base summary
SURGE
Domainontology
![Page 78: Text summarization Dragomir R. Radev School of Information, Department of Electrical Engineering and Computer Science, and Department of Linguistics University.](https://reader035.fdocuments.in/reader035/viewer/2022062618/55146220550346284e8b5961/html5/thumbnails/78.jpg)
Excerpts from four articles
JERUSALEM - A Muslim suicide bomber blew apart 18 people on a Jerusalem bus and wounded 10 in a mirror-image of an attack one week ago. The carnage could rob Israel's Prime Minister Shimon Peres of the May 29 election victory he needs to pursue Middle East peacemaking. Peres declared all-out war on Hamas but his tough talk did little to impress stunned residents of Jerusalem who said the election would turn on the issue of personal security.
JERUSALEM - A bomb at a busy Tel Aviv shopping mall killed at least 10 people and wounded 30, Israel radio said quoting police. Army radio said the blast was apparently caused by a suicide bomber. Police said there were many wounded.
A bomb blast ripped through the commercial heart of Tel Aviv Monday, killing at least 13 people and wounding more than 100. Israeli police say an Islamic suicide bomber blew himself up outside a crowded shopping mall. It was the fourth deadly bombing in Israel in nine days. The Islamic fundamentalist group Hamas claimed responsibility for the attacks, which have killed at least 54 people. Hamas is intent on stopping the Middle East peace process. President Clinton joined the voices of international condemnation after the latest attack. He said the ``forces of terror shall not triumph'' over peacemaking efforts.
TEL AVIV (Reuter) - A Muslim suicide bomber killed at least 12 people and wounded 105, including children, outside a crowded Tel Aviv shopping mall Monday, police said. Sunday, a Hamas suicide bomber killed 18 people on a Jerusalem bus. Hamas has now killed at least 56 people in four attacks in nine days. The windows of stores lining both sides of Dizengoff Street were shattered, the charred skeletons of cars lay in the street, the sidewalks were strewn with blood. The last attack on Dizengoff was in October 1994 when a Hamas suicide bomber killed 22 people on a bus.
1
2
3
4
![Page 79: Text summarization Dragomir R. Radev School of Information, Department of Electrical Engineering and Computer Science, and Department of Linguistics University.](https://reader035.fdocuments.in/reader035/viewer/2022062618/55146220550346284e8b5961/html5/thumbnails/79.jpg)
Four templates
MESSAGE: ID TST-REU-0001 SECSOURCE: SOURCE Reuters SECSOURCE: DATE March 3, 1996 11:30 PRIMSOURCE: SOURCE INCIDENT: DATE March 3, 1996 INCIDENT: LOCATION Jerusalem INCIDENT: TYPE Bombing HUM TGT: NUMBER “killed: 18'' “wounded: 10” PERP: ORGANIZATION ID
MESSAGE: ID TST-REU-0002 SECSOURCE: SOURCE Reuters SECSOURCE: DATE March 4, 1996 07:20 PRIMSOURCE: SOURCE Israel Radio INCIDENT: DATE March 4, 1996 INCIDENT: LOCATION Tel Aviv INCIDENT: TYPE Bombing HUM TGT: NUMBER “killed: at least 10'' “wounded: more than 100” PERP: ORGANIZATION ID
MESSAGE: ID TST-REU-0003 SECSOURCE: SOURCE Reuters SECSOURCE: DATE March 4, 1996 14:20 PRIMSOURCE: SOURCE INCIDENT: DATE March 4, 1996 INCIDENT: LOCATION Tel Aviv INCIDENT: TYPE Bombing HUM TGT: NUMBER “killed: at least 13'' “wounded: more than 100” PERP: ORGANIZATION ID “Hamas”
MESSAGE: ID TST-REU-0004 SECSOURCE: SOURCE Reuters SECSOURCE: DATE March 4, 1996 14:30 PRIMSOURCE: SOURCE INCIDENT: DATE March 4, 1996 INCIDENT: LOCATION Tel Aviv INCIDENT: TYPE Bombing HUM TGT: NUMBER “killed: at least 12'' “wounded: 105” PERP: ORGANIZATION ID
43
21
![Page 80: Text summarization Dragomir R. Radev School of Information, Department of Electrical Engineering and Computer Science, and Department of Linguistics University.](https://reader035.fdocuments.in/reader035/viewer/2022062618/55146220550346284e8b5961/html5/thumbnails/80.jpg)
Fluent summary with comparisons
Reuters reported that 18 people were killed on Sunday in a bombing in Jerusalem. The next day, a bomb in Tel Aviv killed at least 10 people and wounded 30 according to Israel radio. Reuters reported that at least 12 people were killed and 105 wounded in the second incident. Later the same day, Reuters reported that Hamas has claimed responsibility for the act.
(OUTPUT OF SUMMONS)
![Page 81: Text summarization Dragomir R. Radev School of Information, Department of Electrical Engineering and Computer Science, and Department of Linguistics University.](https://reader035.fdocuments.in/reader035/viewer/2022062618/55146220550346284e8b5961/html5/thumbnails/81.jpg)
Operators
• If there are two templatesAND
the location is the sameAND
the time of the second template is after the time of the first template
ANDthe source of the first template is different from the source of the second template
ANDat least one slot differs
THENcombine the templates using the contradiction operator...
![Page 82: Text summarization Dragomir R. Radev School of Information, Department of Electrical Engineering and Computer Science, and Department of Linguistics University.](https://reader035.fdocuments.in/reader035/viewer/2022062618/55146220550346284e8b5961/html5/thumbnails/82.jpg)
Operators: Change of Perspective
Change of perspective
March 4th, Reuters reported that a bomb in Tel Aviv killed at least 10 people and wounded 30. Later the same day, Reuters reported that exactly 12 people were actually killed and 105 wounded.
Precondition:The same source reports a change in a small number of slots
![Page 83: Text summarization Dragomir R. Radev School of Information, Department of Electrical Engineering and Computer Science, and Department of Linguistics University.](https://reader035.fdocuments.in/reader035/viewer/2022062618/55146220550346284e8b5961/html5/thumbnails/83.jpg)
Operators: Contradiction
Contradiction
The afternoon of February 26, 1993, Reuters reported that a suspected bomb killed at least six people in the World Trade Center. However, Associated Press announced that exactly five people were killed in the blast.
Precondition:Different sources report contradictory values for a small number of slots
![Page 84: Text summarization Dragomir R. Radev School of Information, Department of Electrical Engineering and Computer Science, and Department of Linguistics University.](https://reader035.fdocuments.in/reader035/viewer/2022062618/55146220550346284e8b5961/html5/thumbnails/84.jpg)
Operators: Refinement and Agreement
Refinement
On Monday morning, Reuters announced that a suicide bomber killed at least 10 people in Tel Aviv. In the afternoon, Reuters reported that Hamas claimed responsibility for the act.
Agreement
The morning of March 1st 1994, both UPI and Reuters reported that a man was kidnapped in the Bronx.
![Page 85: Text summarization Dragomir R. Radev School of Information, Department of Electrical Engineering and Computer Science, and Department of Linguistics University.](https://reader035.fdocuments.in/reader035/viewer/2022062618/55146220550346284e8b5961/html5/thumbnails/85.jpg)
Operators: Generalization
Generalization
According to UPI, three terrorists were arrested in Medellín last Tuesday. Reuters announced that the police arrested two drug traffickers in Bogotá the next day.
A total of five criminals were arrested in Colombia last week.
![Page 86: Text summarization Dragomir R. Radev School of Information, Department of Electrical Engineering and Computer Science, and Department of Linguistics University.](https://reader035.fdocuments.in/reader035/viewer/2022062618/55146220550346284e8b5961/html5/thumbnails/86.jpg)
Other conceptual methods
• Operator-based transformations using terminological knowledge representation [Reimer and Hahn 97]
• Topic interpretation [Hovy and Lin 98]
![Page 87: Text summarization Dragomir R. Radev School of Information, Department of Electrical Engineering and Computer Science, and Department of Linguistics University.](https://reader035.fdocuments.in/reader035/viewer/2022062618/55146220550346284e8b5961/html5/thumbnails/87.jpg)
Part V Evaluation techniques
![Page 88: Text summarization Dragomir R. Radev School of Information, Department of Electrical Engineering and Computer Science, and Department of Linguistics University.](https://reader035.fdocuments.in/reader035/viewer/2022062618/55146220550346284e8b5961/html5/thumbnails/88.jpg)
Overview of techniques
• Extrinsic techniques (task-based)
• Intrinsic techniques
![Page 89: Text summarization Dragomir R. Radev School of Information, Department of Electrical Engineering and Computer Science, and Department of Linguistics University.](https://reader035.fdocuments.in/reader035/viewer/2022062618/55146220550346284e8b5961/html5/thumbnails/89.jpg)
• Can you recreate what’s in the original? – the Shannon Game [Shannon 1947–50].
– but often only some of it is really important. • Measure info retention (number of keystrokes):
– 3 groups of subjects, each must recreate text:• group 1 sees original text before starting. • group 2 sees summary of original text before
starting. • group 3 sees nothing before starting.
• Results (# of keystrokes; two different paragraphs):
Group 1 Group 2 Group 3approx. 10 approx. 150 approx. 1100
Hovy 98
![Page 90: Text summarization Dragomir R. Radev School of Information, Department of Electrical Engineering and Computer Science, and Department of Linguistics University.](https://reader035.fdocuments.in/reader035/viewer/2022062618/55146220550346284e8b5961/html5/thumbnails/90.jpg)
• Burning questions:1. How do different evaluation methods compare for each type of summary? 2. How do different summary types fare under different methods? 3. How much does the evaluator affect things?
4. Is there a preferred evaluation method? Shannon Q&A
Original 1 1 1 1 1
Abstract Background 1 3 1 1 1Just-the-News 3 1 1 1
Regular 1 2 1 1 1Extract Keywords 2 4 1 1 1
Random 3 1 1 1
No Text 3 5
1-2: 50% 1-2: 30%2-3: 50% 2-3: 20%
3-4: 20%4-5:100%
Classification
Hovy 98
• Small Experiment– 2 texts, 7 groups.
• Results:– No difference!– As other
experiment…– ? Extract is best?
![Page 91: Text summarization Dragomir R. Radev School of Information, Department of Electrical Engineering and Computer Science, and Department of Linguistics University.](https://reader035.fdocuments.in/reader035/viewer/2022062618/55146220550346284e8b5961/html5/thumbnails/91.jpg)
Precision and Recall
Relevant Non-relevant
System:relevant
A BSystem:
non-relevantC D
![Page 92: Text summarization Dragomir R. Radev School of Information, Department of Electrical Engineering and Computer Science, and Department of Linguistics University.](https://reader035.fdocuments.in/reader035/viewer/2022062618/55146220550346284e8b5961/html5/thumbnails/92.jpg)
Precision and Recall
CA
A R
:Recall
BA
A P
:Precision
)(
2
RP
PRF
![Page 93: Text summarization Dragomir R. Radev School of Information, Department of Electrical Engineering and Computer Science, and Department of Linguistics University.](https://reader035.fdocuments.in/reader035/viewer/2022062618/55146220550346284e8b5961/html5/thumbnails/93.jpg)
Jing et al. 98
• Small experiment with 40 articles
• When summary length is given, humans are pretty consistent in selecting the same sentences
• Percent agreement
• Different systems achieved maximum performance at different summary lengths
• Human agreement higher for longer summaries
![Page 94: Text summarization Dragomir R. Radev School of Information, Department of Electrical Engineering and Computer Science, and Department of Linguistics University.](https://reader035.fdocuments.in/reader035/viewer/2022062618/55146220550346284e8b5961/html5/thumbnails/94.jpg)
SUMMAC [Mani et al. 98]
• 16 participants• 3 tasks:
– ad hoc: indicative, user-focused summaries
– categorization: generic summaries, five categories
– question-answering
• 20 TREC topics• 50 documents per
topic (short ones are omitted)
![Page 95: Text summarization Dragomir R. Radev School of Information, Department of Electrical Engineering and Computer Science, and Department of Linguistics University.](https://reader035.fdocuments.in/reader035/viewer/2022062618/55146220550346284e8b5961/html5/thumbnails/95.jpg)
SUMMAC [Mani et al. 98]
• Participants submit a fixed-length summary limited to 10% and a “best” summary, not limited in length.
• variable-length summaries are as accurate as full text
• over 80% of summaries are intelligible
• technologies perform similarly
![Page 96: Text summarization Dragomir R. Radev School of Information, Department of Electrical Engineering and Computer Science, and Department of Linguistics University.](https://reader035.fdocuments.in/reader035/viewer/2022062618/55146220550346284e8b5961/html5/thumbnails/96.jpg)
Goldstein et al. 99
• Reuters, LA Times• Manual summaries• Summary length
rather than summarization ratio is typically fixed
• Normalized version of R & F.
C)B,A(A
A R'
min
)R(P
PR F
'
''
2
![Page 97: Text summarization Dragomir R. Radev School of Information, Department of Electrical Engineering and Computer Science, and Department of Linguistics University.](https://reader035.fdocuments.in/reader035/viewer/2022062618/55146220550346284e8b5961/html5/thumbnails/97.jpg)
Goldstein et al. 99
b)(
bp p'
1
)(
b)(g
gs
g
gs
)()(
'
''
• How to measure relative performance?
p = performanceb = baselineg = “good” systems = “superior” system
![Page 98: Text summarization Dragomir R. Radev School of Information, Department of Electrical Engineering and Computer Science, and Department of Linguistics University.](https://reader035.fdocuments.in/reader035/viewer/2022062618/55146220550346284e8b5961/html5/thumbnails/98.jpg)
Radev et al. 00
---S10
---S9
---S8
---S7
---S6
---S5
+--S4
---S3
+++S2
-++S1
System 2System 1Ideal
Cluster-Based Sentence Utility
![Page 99: Text summarization Dragomir R. Radev School of Information, Department of Electrical Engineering and Computer Science, and Department of Linguistics University.](https://reader035.fdocuments.in/reader035/viewer/2022062618/55146220550346284e8b5961/html5/thumbnails/99.jpg)
Cluster-Based Sentence Utility
---S10
---S9
---S8
---S7
---S6
---S5
+--S4
---S3
+++S2
-++S1
System 2System 1Ideal
9(+)67S4
432S3
8(+)9(+)8(+)S2
510(+)10(+)S1
System 2System 1Ideal
Summary sentence extraction method
CBSU method
CBSU(system, ideal)= % of ideal utility covered by system summary
![Page 100: Text summarization Dragomir R. Radev School of Information, Department of Electrical Engineering and Computer Science, and Department of Linguistics University.](https://reader035.fdocuments.in/reader035/viewer/2022062618/55146220550346284e8b5961/html5/thumbnails/100.jpg)
Interjudge agreement
Judge1 Judge2 Judge3
Sentence 1 10 10 5
Sentence 2 8 9 8
Sentence 3 2 3 4
Sentence 4 5 6 9
![Page 101: Text summarization Dragomir R. Radev School of Information, Department of Electrical Engineering and Computer Science, and Department of Linguistics University.](https://reader035.fdocuments.in/reader035/viewer/2022062618/55146220550346284e8b5961/html5/thumbnails/101.jpg)
Relative utility
Judge1 Judge2 Judge3
Sentence 1 10 10 5
Sentence 2 8 9 8
Sentence 3 2 3 4
Sentence 4 5 6 9
RU =
![Page 102: Text summarization Dragomir R. Radev School of Information, Department of Electrical Engineering and Computer Science, and Department of Linguistics University.](https://reader035.fdocuments.in/reader035/viewer/2022062618/55146220550346284e8b5961/html5/thumbnails/102.jpg)
Relative utility
Judge1 Judge2 Judge3
Sentence 1 10 10 5
Sentence 2 8 9 8
Sentence 3 2 3 4
Sentence 4 5 6 9
17
RU =
![Page 103: Text summarization Dragomir R. Radev School of Information, Department of Electrical Engineering and Computer Science, and Department of Linguistics University.](https://reader035.fdocuments.in/reader035/viewer/2022062618/55146220550346284e8b5961/html5/thumbnails/103.jpg)
Relative utility
Judge1 Judge2 Judge3
Sentence 1 10 10 5
Sentence 2 8 9 8
Sentence 3 2 3 4
Sentence 4 5 6 9
1317
RU =
= 0.765
![Page 104: Text summarization Dragomir R. Radev School of Information, Department of Electrical Engineering and Computer Science, and Department of Linguistics University.](https://reader035.fdocuments.in/reader035/viewer/2022062618/55146220550346284e8b5961/html5/thumbnails/104.jpg)
Normalized System Performance
1.000
0.765
0.765
Judge 3
0.7560.7890.722Judge 3
0.8831.0001.000Judge 2
0.8831.0001.000Judge 1
AverageJudge 2Judge 1
D = (S-R)
(J-R)
System performance
Interjudge agreement
Normalized system performance Random performance
![Page 105: Text summarization Dragomir R. Radev School of Information, Department of Electrical Engineering and Computer Science, and Department of Linguistics University.](https://reader035.fdocuments.in/reader035/viewer/2022062618/55146220550346284e8b5961/html5/thumbnails/105.jpg)
Random Performance
D = (S-R)
(J-R)
![Page 106: Text summarization Dragomir R. Radev School of Information, Department of Electrical Engineering and Computer Science, and Department of Linguistics University.](https://reader035.fdocuments.in/reader035/viewer/2022062618/55146220550346284e8b5961/html5/thumbnails/106.jpg)
Random Performance
D = (S-R)
(J-R)
n !
( n(1-r))! (r*n)!systemsaverage of all
![Page 107: Text summarization Dragomir R. Radev School of Information, Department of Electrical Engineering and Computer Science, and Department of Linguistics University.](https://reader035.fdocuments.in/reader035/viewer/2022062618/55146220550346284e8b5961/html5/thumbnails/107.jpg)
Random Performance
D = (S-R)
(J-R)
n !
( n(1-r))! (r*n)!systemsaverage of all
{12}{13}{14}{23}{24}{34}
![Page 108: Text summarization Dragomir R. Radev School of Information, Department of Electrical Engineering and Computer Science, and Department of Linguistics University.](https://reader035.fdocuments.in/reader035/viewer/2022062618/55146220550346284e8b5961/html5/thumbnails/108.jpg)
Examples
0.833 - 0.732
0.841 - 0.732= 0.927D {14} =
(S-R)
(J-R)=
![Page 109: Text summarization Dragomir R. Radev School of Information, Department of Electrical Engineering and Computer Science, and Department of Linguistics University.](https://reader035.fdocuments.in/reader035/viewer/2022062618/55146220550346284e8b5961/html5/thumbnails/109.jpg)
Examples
0.833 - 0.732
0.841 - 0.732= 0.927D {14} =
(S-R)
(J-R)=
0.963D {24} =
![Page 110: Text summarization Dragomir R. Radev School of Information, Department of Electrical Engineering and Computer Science, and Department of Linguistics University.](https://reader035.fdocuments.in/reader035/viewer/2022062618/55146220550346284e8b5961/html5/thumbnails/110.jpg)
1.0
J = 0.841
0.5
0.0
J’ = 1.0
0.5
R’= 0.0
R = 0.732
S = 0.833
S’ = 0.927 = D
Normalized evaluation of {14}
![Page 111: Text summarization Dragomir R. Radev School of Information, Department of Electrical Engineering and Computer Science, and Department of Linguistics University.](https://reader035.fdocuments.in/reader035/viewer/2022062618/55146220550346284e8b5961/html5/thumbnails/111.jpg)
Cross-sentence Informational Subsumption and Equivalence
• Subsumption: If the information content of sentence a (denoted as I(a)) is contained within sentence b, then a becomes informationally redundant and the content of b is said to subsume that of a:
I(a) I(b)
• Equivalence: If I(a) I(b) I(b) I(a)
![Page 112: Text summarization Dragomir R. Radev School of Information, Department of Electrical Engineering and Computer Science, and Department of Linguistics University.](https://reader035.fdocuments.in/reader035/viewer/2022062618/55146220550346284e8b5961/html5/thumbnails/112.jpg)
Example
(1) John Doe was found guilty of the murder.
(2) The court found John Doe guilty of the murder of Jane Doe last August and sentenced him to life.
![Page 113: Text summarization Dragomir R. Radev School of Information, Department of Electrical Engineering and Computer Science, and Department of Linguistics University.](https://reader035.fdocuments.in/reader035/viewer/2022062618/55146220550346284e8b5961/html5/thumbnails/113.jpg)
Cross-sentence Informational Subsumption
967S4
432S3
898S2
51010S1
Article 3Article 2Article 1
![Page 114: Text summarization Dragomir R. Radev School of Information, Department of Electrical Engineering and Computer Science, and Department of Linguistics University.](https://reader035.fdocuments.in/reader035/viewer/2022062618/55146220550346284e8b5961/html5/thumbnails/114.jpg)
Toxic spill in SpainAP, NYTTDT-3 corpus, topic 67833F
General strike in Denmark
AP, PRI, VOATDT-3 corpus, topic 7815110E
Explosion in a Moscow apartment building (Sept. 13, 1999)
AP, AFP, UPIclari.world.europe.russia1897D
Explosion in a Moscow apartment building (Sept. 9, 1999)AP, AFPclari.world.europe.russia652C
The FBI puts Osama bin Laden on the most wanted listAFP, UPIclari.world.terrorism453B
Algerian terrorists threaten BelgiumAFP, UPI
clari.world.africa.northwestern252A
topicnews sourcessource
# sents
# docsCluster
Evaluation
![Page 115: Text summarization Dragomir R. Radev School of Information, Department of Electrical Engineering and Computer Science, and Department of Linguistics University.](https://reader035.fdocuments.in/reader035/viewer/2022062618/55146220550346284e8b5961/html5/thumbnails/115.jpg)
0.75
0.8
0.85
0.9
0.95
1
10 20 30 40 50 60 70 80 90 100
Compression rate (r)
Ag
ree
me
nt
(J) Cluster A
Cluster B
Cluster C
Cluster D
Cluster E
Cluster F
Inter-judge agreementversus compression
![Page 116: Text summarization Dragomir R. Radev School of Information, Department of Electrical Engineering and Computer Science, and Department of Linguistics University.](https://reader035.fdocuments.in/reader035/viewer/2022062618/55146220550346284e8b5961/html5/thumbnails/116.jpg)
4A2-8----A1-7
4A2-7----A1-6
2A2-4A2-2-A2-1-A1-5
4A2-10-A2-10A2-10A2-10A1-4
4A2-10----A1-3
3A2-5--A2-5A2-5A1-2
3A2-1-A2-1A2-1-A1-1
- score+ scoreJudge5
Judge4
Judge3
Judge2
Judge1
Sent
Evaluating Sentence Subsumption
![Page 117: Text summarization Dragomir R. Radev School of Information, Department of Electrical Engineering and Computer Science, and Department of Linguistics University.](https://reader035.fdocuments.in/reader035/viewer/2022062618/55146220550346284e8b5961/html5/thumbnails/117.jpg)
Subsumption (Cont’d)
SCORE (s) = i (wcCi + wpPi + wfFi) - wRRs
Rs = cross-sentence word overlap
Rs = 2 * (# overlapping words) / (# words in sentence 1 + # words in sentence 2)
wR = Maxs (SCORE(s))
![Page 118: Text summarization Dragomir R. Radev School of Information, Department of Electrical Engineering and Computer Science, and Department of Linguistics University.](https://reader035.fdocuments.in/reader035/viewer/2022062618/55146220550346284e8b5961/html5/thumbnails/118.jpg)
Subsumption analysis
0107070112112
7323520284454633
11035837910163614
610731880450240705
-+-+-+-+-+-+#judges agreeing
Cluster F
Cluster E
Cluster D
Cluster C
Cluster B
Cluster A
Total: 558 sentences, full agreement on 292 (1+291), partial on 406 (23+383)Of 80 sentences with some indication of subsumption, only 24 had agreement of 4 or more judges.
![Page 119: Text summarization Dragomir R. Radev School of Information, Department of Electrical Engineering and Computer Science, and Department of Linguistics University.](https://reader035.fdocuments.in/reader035/viewer/2022062618/55146220550346284e8b5961/html5/thumbnails/119.jpg)
Results
10% 20% 30% 40% 50% 60% 70% 80% 90%
Cluster A 0.855 0.572 0.427 0.759 0.862 0.910 0.554 1.001 0.584
Cluster B 0.365 0.402 0.690 0.714 0.867 0.640 0.845 0.713 1.317
Cluster C 0.753 0.938 0.841 1.029 0.751 0.819 0.595 0.611 0.683
Cluster D 0.739 0.764 0.683 0.723 0.614 0.568 0.668 0.719 1.100
Cluster E 1.083 0.937 0.581 0.373 0.438 0.369 0.429 0.487 0.261
Cluster F 1.064 0.893 0.928 1.000 0.732 0.805 0.910 0.689 0.199
MEAD performed better than Lead in 29 (in bold) out of 54 cases.
MEAD+Lead performed better than the Lead baseline in 41 cases
![Page 120: Text summarization Dragomir R. Radev School of Information, Department of Electrical Engineering and Computer Science, and Department of Linguistics University.](https://reader035.fdocuments.in/reader035/viewer/2022062618/55146220550346284e8b5961/html5/thumbnails/120.jpg)
Donaway et al. 00
• Sentence-rank based measures– IDEAL={2,3,5}:
compare {2,3,4} and {2,3,9}
• Content-based measures– vector comparisons of summary and
document
![Page 121: Text summarization Dragomir R. Radev School of Information, Department of Electrical Engineering and Computer Science, and Department of Linguistics University.](https://reader035.fdocuments.in/reader035/viewer/2022062618/55146220550346284e8b5961/html5/thumbnails/121.jpg)
Proposed TIDES evaluation
• Creation of corpora
• Development of evaluation software
• TREC-style evaluation
• Intrinsic and extrinsic evaluations
• Multilingual summaries (over time)
• Question-answering evaluation
![Page 122: Text summarization Dragomir R. Radev School of Information, Department of Electrical Engineering and Computer Science, and Department of Linguistics University.](https://reader035.fdocuments.in/reader035/viewer/2022062618/55146220550346284e8b5961/html5/thumbnails/122.jpg)
Part VIIThe MEAD project
![Page 123: Text summarization Dragomir R. Radev School of Information, Department of Electrical Engineering and Computer Science, and Department of Linguistics University.](https://reader035.fdocuments.in/reader035/viewer/2022062618/55146220550346284e8b5961/html5/thumbnails/123.jpg)
Background
• Summer 2001
• Eight weeks
• Johns Hopkins University• Participants: Dragomir Radev, Simone
Teufel, Horacio Saggion, Wai Lam, Elliott Drabek, Hong Qi, Danyu Liu, John Blitzer, and Arda Çelebi
![Page 124: Text summarization Dragomir R. Radev School of Information, Department of Electrical Engineering and Computer Science, and Department of Linguistics University.](https://reader035.fdocuments.in/reader035/viewer/2022062618/55146220550346284e8b5961/html5/thumbnails/124.jpg)
Technical objectives
• Develop a summarization toolkit including a modular state-of-the art summarizer: single-document, multi-document, generic, query-based
• Develop a summarization evaluation toolkit allowing comparisons between extractive and non-extractive summaries
• Produce an annotated corpus for further research in text summarization
![Page 125: Text summarization Dragomir R. Radev School of Information, Department of Electrical Engineering and Computer Science, and Department of Linguistics University.](https://reader035.fdocuments.in/reader035/viewer/2022062618/55146220550346284e8b5961/html5/thumbnails/125.jpg)
Sample scenarios
• Evaluate an existing summarizer
• Build a summarizer from scratch
• Test a summarization feature
• Test a new evaluation metric
• Test a machine translation system
![Page 126: Text summarization Dragomir R. Radev School of Information, Department of Electrical Engineering and Computer Science, and Department of Linguistics University.](https://reader035.fdocuments.in/reader035/viewer/2022062618/55146220550346284e8b5961/html5/thumbnails/126.jpg)
Resources
• manual summaries (extracts and abstracts)• baseline summaries• automatic summaries• manual and automatic relevance judgements• XREF, lemmatized, tagged versions of the corpus• manual and automatic query translations• sentence segmentation• sentence alignments• XML DTDs, converters• subsumption judgements• guidelines for judges• guidelines for building summarizers• evaluation software• modular, trainable summarizer
![Page 127: Text summarization Dragomir R. Radev School of Information, Department of Electrical Engineering and Computer Science, and Department of Linguistics University.](https://reader035.fdocuments.in/reader035/viewer/2022062618/55146220550346284e8b5961/html5/thumbnails/127.jpg)
<?xml version='1.0'?><!DOCTYPE QUERY SYSTEM "../../../dtd/query.dtd" ><QUERY QID="Q-241-E" QNO="241" TRANSLATED="NO"><TITLE>Fire safety, building management concerns</TITLE></QUERY>
<?xml version='1.0'?><!DOCTYPE QUERY SYSTEM “../../../dtd/query.dtd" ><QUERY QID="Q-241-C" QNO="241" TRANSLATED="NO"><TITLE>¨¾¤õ·NÃÑ,¤j·HºÞ²z</TITLE></QUERY>
Sample Chinese Query
Sample English Query
![Page 128: Text summarization Dragomir R. Radev School of Information, Department of Electrical Engineering and Computer Science, and Department of Linguistics University.](https://reader035.fdocuments.in/reader035/viewer/2022062618/55146220550346284e8b5961/html5/thumbnails/128.jpg)
Sample Retrieval Result for Full-length Documents
<?xml version='1.0'?><!DOCTYPE DOC-JUDGE SYSTEM "/export/ws01summ/dtd/docjudge.dtd" ><DOC-JUDGE QID="Q-241-E" SYSTEM="SMART" LANG="ENG"> <D DID="D-20000126_008.e" RANK="1" SCORE="135.0000" CORR-DOC="D-20000126_012.c"/> <D DID="D-19980625_007.e" RANK="2" SCORE="99.0000" CORR-DOC="D-19980625_006.c"/> <D DID="D-19990126_017.e" RANK="3" SCORE="98.0000" CORR-DOC="D-19990126_018.c"/> <D DID="D-19981007_018.e" RANK="4" SCORE="91.0000" CORR-DOC="D-19981007_023.c"/> <D DID="D-19980121_004.e" RANK="5" SCORE="78.0000" CORR-DOC="D-19980121_009.c"/> <D DID="D-19971016_004.e" RANK="6" SCORE="72.0000" CORR-DOC="D-19971016_005.c"/>
Sample Retrieval Result for Lead-Based Summary (5%)
<?xml version='1.0'?><!DOCTYPE DOC-JUDGE SYSTEM"/export/ws01summ/dtd/docjudge.dtd" ><DOC-JUDGE QID="Q-241-E" SYSTEM="SMART" LANG="ENG"> <D DID="D-20000126_008.e" RANK="1" SCORE="14.0000" CORR-DOC="D-20000126_012.c"/> <D DID="D-19991214_002.e" RANK="2" SCORE="11.0000" CORR-DOC="D-19991214_001.c"/> <D DID="D-19980810_006.e" RANK="3" SCORE="10.0000" CORR-DOC="D-19980810_003.c"/> <D DID="D-19990505_028.e" RANK="4" SCORE="9.0000" CORR-DOC="D-19990505_034.c"/> <D DID="D-19980115_009.e" RANK="4" SCORE="9.0000" CORR-DOC="D-19980115_013.c"/>:
![Page 129: Text summarization Dragomir R. Radev School of Information, Department of Electrical Engineering and Computer Science, and Department of Linguistics University.](https://reader035.fdocuments.in/reader035/viewer/2022062618/55146220550346284e8b5961/html5/thumbnails/129.jpg)
querySMART
LDC Judges
Rankeddocumentlist
Rankeddocumentlist
IR results
document
Summarycomparison
Correlation
Summarizer
Baselines
Single-document situation
Extract
1. Co-selection2. Similarity
![Page 130: Text summarization Dragomir R. Radev School of Information, Department of Electrical Engineering and Computer Science, and Department of Linguistics University.](https://reader035.fdocuments.in/reader035/viewer/2022062618/55146220550346284e8b5961/html5/thumbnails/130.jpg)
LDC Judges
Summarycomparison
Manual sum.
Summarizer
Baselines
documentcluster
Multi-document situation
1. Co-selection2. Similarity
Extracts
![Page 131: Text summarization Dragomir R. Radev School of Information, Department of Electrical Engineering and Computer Science, and Department of Linguistics University.](https://reader035.fdocuments.in/reader035/viewer/2022062618/55146220550346284e8b5961/html5/thumbnails/131.jpg)
Summaries produced
• Single-document extracts– automatic (135 runs on 18,146 documents
each): 10 compression rates, Word/Sentence, English/Chinese/Xlingual, 10 summarization methods
– manual (80 runs on 200 documents each): 10 compression rates, Word/Sentence, (3 judges + average)
![Page 132: Text summarization Dragomir R. Radev School of Information, Department of Electrical Engineering and Computer Science, and Department of Linguistics University.](https://reader035.fdocuments.in/reader035/viewer/2022062618/55146220550346284e8b5961/html5/thumbnails/132.jpg)
Summaries produced
• Multi-document summaries– 3 lengths, 3 judges, 14 queries (out of 40)
• Multi-document extracts– automatic (160 extracts) = 8 compression rates
(5-40%,50-200AW) x 20 clusters– manual (320 extracts) = 8 compression rates x
10 clusters x (3 judges + average)
![Page 133: Text summarization Dragomir R. Radev School of Information, Department of Electrical Engineering and Computer Science, and Department of Linguistics University.](https://reader035.fdocuments.in/reader035/viewer/2022062618/55146220550346284e8b5961/html5/thumbnails/133.jpg)
List of summarizers
• MEAD, Websumm, Summarist, LexChains, Align
• English, Chinese
• Single-document, Multi-document
![Page 134: Text summarization Dragomir R. Radev School of Information, Department of Electrical Engineering and Computer Science, and Department of Linguistics University.](https://reader035.fdocuments.in/reader035/viewer/2022062618/55146220550346284e8b5961/html5/thumbnails/134.jpg)
MEAD architecture
Feature scorer Relation scorer
……………
… … … … …
……………
……………
SVM
Extractor…
……
Subsumption
![Page 135: Text summarization Dragomir R. Radev School of Information, Department of Electrical Engineering and Computer Science, and Department of Linguistics University.](https://reader035.fdocuments.in/reader035/viewer/2022062618/55146220550346284e8b5961/html5/thumbnails/135.jpg)
WEBSUMM:
Some of them are taking temporary shelter at Lung Hang Estate Community Centre in Sha Tin, and Shek Lei Estate Community Centre and Princess Alexandra Community Centre in Tsuen Wan.
Emergency relief by SWD
The Social Welfare Department has provided relief articles and hot meals to 114 people who were affected by the rainstorm or mudslip throughout the territory. The people, comprising adults and children, come from 30 families. Some of them are taking temporary shelter at Lung Hang Estate Community Centre in Sha Tin, and Shek Lei Estate Community Centre and Princess Alexandra Community Centre in Tsuen Wan. The Regional Social Welfare Officer (New Territories East), Mrs Lily Wong, visited victims at Lung Hang State Community Centre this (Thursday) afternoon to offer any necessary assistance. Six victims have so far requested for Comprehensive Social Security Allowance and the applications are being processed. Social workers also escorted an 88-year old man who was feeling unwell to the Prince of Wales hospital for medical checkup.
MEAD:
The Social Welfare Department has provided relief articles and hot meals to 114 people who were affected by the rainstorm or mudslip throughout the territory. The Regional Social Welfare Officer (New Territories East), Mrs Lily Wong, visited victims at Lung Hang State Community Centre this (Thursday) afternoon to offer any necessary assistance.
RANDOM:
The Social Welfare Department has provided relief articles and hot meals to 114 people who were affected by the rainstorm or mudslip throughout the territory. Some of them are taking temporary shelter at Lung Hang Estate Community Centre in Sha Tin, and Shek Lei Estate Community Centre and Princess Alexandra Community Centre in Tsuen Wan.
LEAD:
The Social Welfare Department has provided relief articles and hot meals to 114 people who were affected by the rainstorm or mudslip throughout the territory. The people, comprising adults and children, come from 30 families.
![Page 136: Text summarization Dragomir R. Radev School of Information, Department of Electrical Engineering and Computer Science, and Department of Linguistics University.](https://reader035.fdocuments.in/reader035/viewer/2022062618/55146220550346284e8b5961/html5/thumbnails/136.jpg)
![Page 137: Text summarization Dragomir R. Radev School of Information, Department of Electrical Engineering and Computer Science, and Department of Linguistics University.](https://reader035.fdocuments.in/reader035/viewer/2022062618/55146220550346284e8b5961/html5/thumbnails/137.jpg)
510
2030
4050
6070
8090
0
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
1
% agreement
compression
Humans: Percent Agreement (20-cluster average) and compression
![Page 138: Text summarization Dragomir R. Radev School of Information, Department of Electrical Engineering and Computer Science, and Department of Linguistics University.](https://reader035.fdocuments.in/reader035/viewer/2022062618/55146220550346284e8b5961/html5/thumbnails/138.jpg)
510
2030
4050
6070
8090
0
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
1
p/r
compression
Random
Humans
Humans: precision/recall (cluster average) and compression
![Page 139: Text summarization Dragomir R. Radev School of Information, Department of Electrical Engineering and Computer Science, and Department of Linguistics University.](https://reader035.fdocuments.in/reader035/viewer/2022062618/55146220550346284e8b5961/html5/thumbnails/139.jpg)
Kappa
• N: number of items (index i)
• n: number of categories (index j)
• k: number of annotators
)(1
)()(
EP
EPAP
N
i
n
jij k
mkNk
AP1 1
2
1
1
)1(
1)(
2
1
1
)(
Nk
mEP
N
iijn
j
![Page 140: Text summarization Dragomir R. Radev School of Information, Department of Electrical Engineering and Computer Science, and Department of Linguistics University.](https://reader035.fdocuments.in/reader035/viewer/2022062618/55146220550346284e8b5961/html5/thumbnails/140.jpg)
510
2030
4050
6070
8090
0
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
1
K
compression
Humans: Kappa and compression
![Page 141: Text summarization Dragomir R. Radev School of Information, Department of Electrical Engineering and Computer Science, and Department of Linguistics University.](https://reader035.fdocuments.in/reader035/viewer/2022062618/55146220550346284e8b5961/html5/thumbnails/141.jpg)
2 46 54 60 61 62 112 125 199 323 398 447 551 827 883 885 1014 1197 241 1018
0
0.1
0.2
0.3
0.4
0.5
0.6
0.7
K
cluster no
Kappa, human agreement, 40%
![Page 142: Text summarization Dragomir R. Radev School of Information, Department of Electrical Engineering and Computer Science, and Department of Linguistics University.](https://reader035.fdocuments.in/reader035/viewer/2022062618/55146220550346284e8b5961/html5/thumbnails/142.jpg)
112125
199241
323398
551883
10141197
0
0.1
0.2
0.3
0.4
0.5
0.6
0.7
K
cluster no
MEAD
Humans
Multi-document summaries of length 50 words, kappa on 10
clusters
![Page 143: Text summarization Dragomir R. Radev School of Information, Department of Electrical Engineering and Computer Science, and Department of Linguistics University.](https://reader035.fdocuments.in/reader035/viewer/2022062618/55146220550346284e8b5961/html5/thumbnails/143.jpg)
A B C D E FG H
I J
R
J
0.45
0.5
0.55
0.6
0.65
0.7
0.75
0.8
0.85
0.9
0.95
Relative utility (upper and lower bounds), Q125, 5%
R
J
R 0.648 0.65 0.652 0.465 0.626 0.727 0.509 0.497 0.644 0.566
J 0.715 0.666 0.859 0.726 0.876 0.944 0.909 0.776 0.71 0.869
A B C D E F G H I J
![Page 144: Text summarization Dragomir R. Radev School of Information, Department of Electrical Engineering and Computer Science, and Department of Linguistics University.](https://reader035.fdocuments.in/reader035/viewer/2022062618/55146220550346284e8b5961/html5/thumbnails/144.jpg)
A B C D E FG H
I J
R
J
0.45
0.5
0.55
0.6
0.65
0.7
0.75
0.8
0.85
0.9
0.95
Relative utility (upper and lower bounds), Q125, 20%
R
J
R 0.69 0.685 0.679 0.523 0.642 0.741 0.541 0.553 0.699 0.595
J 0.827 0.73 0.866 0.828 0.838 0.913 0.861 0.876 0.736 0.874
A B C D E F G H I J
![Page 145: Text summarization Dragomir R. Radev School of Information, Department of Electrical Engineering and Computer Science, and Department of Linguistics University.](https://reader035.fdocuments.in/reader035/viewer/2022062618/55146220550346284e8b5961/html5/thumbnails/145.jpg)
A B C D E FG H
I J
R
J
0.45
0.55
0.65
0.75
0.85
0.95
Relative utility (upper and lower bounds), Q125, 40%
R
J
R 0.74 0.738 0.724 0.653 0.695 0.77 0.647 0.679 0.764 0.664
J 0.836 0.754 0.878 0.954 0.91 0.952 0.919 0.954 0.811 0.904
A B C D E F G H I J
![Page 146: Text summarization Dragomir R. Radev School of Information, Department of Electrical Engineering and Computer Science, and Department of Linguistics University.](https://reader035.fdocuments.in/reader035/viewer/2022062618/55146220550346284e8b5961/html5/thumbnails/146.jpg)
Relative Utility (RU) per summarizer and compression rate (Single-document)
0.6
0.65
0.7
0.75
0.8
0.85
0.9
0.95
1
Compression rate
Su
mm
ariz
er J
R
WEBS
MEAD
LEAD
J 0.785 0.79 0.81 0.833 0.853 0.875 0.913 0.94 0.962 0.982
R 0.636 0.65 0.68 0.711 0.738 0.765 0.804 0.84 0.896 0.961
WEBS 0.761 0.765 0.776 0.801 0.828
MEAD 0.748 0.756 0.764 0.782 0.808 0.834 0.863 0.895 0.921 0.968
LEAD 0.733 0.738 0.772 0.797 0.829 0.85 0.877 0.906 0.936 0.973
5 10 20 30 40 50 60 70 80 90
![Page 147: Text summarization Dragomir R. Radev School of Information, Department of Electrical Engineering and Computer Science, and Department of Linguistics University.](https://reader035.fdocuments.in/reader035/viewer/2022062618/55146220550346284e8b5961/html5/thumbnails/147.jpg)
Relative Utility (RU) per compression rate (Multi-document)
0.61
0.63
0.65
0.67
0.69
0.71
0.73
0.75
0.77
0.79
0.81
Compression rate
RUR
S
J
R 0.6116 0.6302 0.6614 0.6894
S 0.6928 0.7246 0.7476 0.766
J 0.6886 0.7296 0.7582 0.7904
5 10 20 30
![Page 148: Text summarization Dragomir R. Radev School of Information, Department of Electrical Engineering and Computer Science, and Department of Linguistics University.](https://reader035.fdocuments.in/reader035/viewer/2022062618/55146220550346284e8b5961/html5/thumbnails/148.jpg)
Relevance correlation (RC)
22)()(
))((
ii
ii
iii
yyxx
yyxxr
![Page 149: Text summarization Dragomir R. Radev School of Information, Department of Electrical Engineering and Computer Science, and Department of Linguistics University.](https://reader035.fdocuments.in/reader035/viewer/2022062618/55146220550346284e8b5961/html5/thumbnails/149.jpg)
Relevance Preservation Value (RPV) as a function of compression rate (RANDOM)
0.44
0.54
0.64
0.74
0.84
0.94
Summary length (%)
RPV
Query 112
Query 125
Query 241
Query 323
Query 551
AVERAGE (10 queries)
Query 112 0.5 0.64 0.8 0.86 0.91 0.93 0.95 0.97 0.98 0.99
Query 125 0.44 0.66 0.78 0.87 0.91 0.91 0.96 0.97 0.98 0.99
Query 241 0.68 0.77 0.87 0.91 0.94 0.96 0.97 0.98 0.99 1
Query 323 0.63 0.78 0.85 0.9 0.93 0.95 0.97 0.98 0.99 1
Query 551 0.52 0.69 0.79 0.88 0.92 0.94 0.95 0.97 0.98 0.99
AVERAGE (10 queries) 0.553 0.687 0.8 0.874 0.912 0.932 0.956 0.973 0.984 0.992
5 10 20 30 40 50 60 70 80 90
![Page 150: Text summarization Dragomir R. Radev School of Information, Department of Electrical Engineering and Computer Science, and Department of Linguistics University.](https://reader035.fdocuments.in/reader035/viewer/2022062618/55146220550346284e8b5961/html5/thumbnails/150.jpg)
FD
ME
AD
WE
BS
LE
AD
RA
ND
SU
MM
Q125
Q551
AV
G(1
0Q
)
Q112
Q241
Q323
0.77
0.82
0.87
0.92
0.97
RPV
Summarizer
Query
Relevance Preservation Value (RPV) for different summarizers (English, 20%)
Q125
Q551
AVG(10Q)
Q112
Q241
Q323
Q125 1 0.92 0.82 0.8 0.78 0.79
Q551 1 0.9 0.88 0.81 0.79 0.81
AVG(10Q) 1 0.903 0.843 0.802 0.8 0.775
Q112 1 0.91 0.88 0.8 0.8 0.77
Q241 1 0.93 0.89 0.84 0.87 0.85
Q323 1 0.92 0.91 0.85 0.85 0.88
FD MEAD WEBS LEAD RAND SUMM
![Page 151: Text summarization Dragomir R. Radev School of Information, Department of Electrical Engineering and Computer Science, and Department of Linguistics University.](https://reader035.fdocuments.in/reader035/viewer/2022062618/55146220550346284e8b5961/html5/thumbnails/151.jpg)
FD
ME
AD
SU
MM
ALG
N
LE
AD
RA
ND
Q112
Q323
Q551
AV
G(1
0Q
)
Q125
Q241
0.58
0.63
0.68
0.73
0.78
0.83
0.88
0.93
0.98
RPV
Summarizer
Query
Relevance Preservation Value (RPV) for different summarizers (Chinese, 20%)
Q112
Q323
Q551
AVG(10Q)
Q125
Q241
Q112 1 0.87 0.76 0.74 0.72 0.71
Q323 1 0.66 0.84 0.59 0.58 0.6
Q551 1 0.91 0.75 0.72 0.75 0.74
AVG(10Q) 1 0.85 0.755 0.738 0.733 0.744
Q125 1 0.87 0.75 0.72 0.71 0.75
Q241 1 0.93 0.85 0.83 0.83 0.85
FD MEAD SUMM ALGN LEAD RAND
![Page 152: Text summarization Dragomir R. Radev School of Information, Department of Electrical Engineering and Computer Science, and Department of Linguistics University.](https://reader035.fdocuments.in/reader035/viewer/2022062618/55146220550346284e8b5961/html5/thumbnails/152.jpg)
FDMEAD
WEBSLEAD
SUMMRAND 5%
10%20%
30%40%
0.55
0.6
0.65
0.7
0.75
0.8
0.85
0.9
0.95
1
RPV
Summarizer
Compression rate
Relevance Preservation Value (RPV) per compression rate and summarizer (English, 5 queries)
5%
10%
20%
30%
40%
5% 1 0.724 0.73 0.66 0.622 0.554
10% 1 0.834 0.804 0.73 0.71 0.708
20% 1 0.916 0.876 0.82 0.82 0.818
30% 1 0.946 0.912 0.88 0.848 0.884
40% 1 0.962 0.936 0.906 0.862 0.922
FD MEAD WEBS LEAD SUMM RAND
![Page 153: Text summarization Dragomir R. Radev School of Information, Department of Electrical Engineering and Computer Science, and Department of Linguistics University.](https://reader035.fdocuments.in/reader035/viewer/2022062618/55146220550346284e8b5961/html5/thumbnails/153.jpg)
SUMMLEAD
MEADRAND
WEBS
with cutoff
no cutoff0
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
RPV
Summarizer
Correlation method
Relevance Preservation Value (RPV) with and without cutoff (English, 5%)
with cutoff
no cutoff
with cutoff 0.48 0.55 0.61 0.29 0.6
no cutoff 0.61 0.59 0.74 0.44 0.63
SUMM LEAD MEAD RAND WEBS
![Page 154: Text summarization Dragomir R. Radev School of Information, Department of Electrical Engineering and Computer Science, and Department of Linguistics University.](https://reader035.fdocuments.in/reader035/viewer/2022062618/55146220550346284e8b5961/html5/thumbnails/154.jpg)
SUMMLEAD
MEADRAND
WEBS
with cutoff
no cutoff0
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
RPV
Summarizer
Correlation method
Relevance Preservation Value (RPV) with and without cutoff (English, 10%)
with cutoff
no cutoff
with cutoff 0.65 0.65 0.76 0.56 0.7
no cutoff 0.73 0.71 0.84 0.66 0.72
SUMM LEAD MEAD RAND WEBS
![Page 155: Text summarization Dragomir R. Radev School of Information, Department of Electrical Engineering and Computer Science, and Department of Linguistics University.](https://reader035.fdocuments.in/reader035/viewer/2022062618/55146220550346284e8b5961/html5/thumbnails/155.jpg)
SUMMLEAD
MEADRAND
WEBS
with cutoff
no cutoff0
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
1
RPV
Summarizer
Correlation method
Relevance Preservation Value (RPV) with and without cutoff (English, 20%)
with cutoff
no cutoff
with cutoff 0.71 0.74 0.88 0.72 0.8
no cutoff 0.79 0.8 0.92 0.78 0.82
SUMM LEAD MEAD RAND WEBS
![Page 156: Text summarization Dragomir R. Radev School of Information, Department of Electrical Engineering and Computer Science, and Department of Linguistics University.](https://reader035.fdocuments.in/reader035/viewer/2022062618/55146220550346284e8b5961/html5/thumbnails/156.jpg)
AS
GE
ME
AD
ME
AD
OR
IG
ME
AD
002
ME
AD
003
ME
AD
S00
2
Q55
1
Q11
2
Q-A
VG
Q12
5
Q32
3
Q24
1
0.85
0.86
0.87
0.88
0.89
0.9
0.91
0.92
0.93
RPV
MEAD policy
Query
Relevance Preservation Value (RPV) per MEAD policy (5 queries)
Q551
Q112
Q-AVG
Q125
Q323
Q241
Q551 0.88 0.9 0.89 0.89
Q112 0.86 0.91 0.9 0.9 0.9
Q-AVG 0.886 0.916 0.908 0.908 0.9125
Q125 0.87 0.92 0.91 0.91 0.91
Q323 0.89 0.92 0.91 0.91 0.91
Q241 0.93 0.93 0.93 0.93 0.93
ASGEMEAD MEADORIG MEAD002 MEAD003 MEADS002
![Page 157: Text summarization Dragomir R. Radev School of Information, Department of Electrical Engineering and Computer Science, and Department of Linguistics University.](https://reader035.fdocuments.in/reader035/viewer/2022062618/55146220550346284e8b5961/html5/thumbnails/157.jpg)
Properties of evaluation metricsKappa,P/R,accuracy
RU Wordoverlap,cosine, lcs
Relevancepreserv.
Agreement Humanextracts
X X X
Agreement humanextracts – automaticextracts
X X X
Agreement humansummaries/extracts
X
Non-binary decisions X X X
Full documents vs.extracts
X X
Systems with differentsentence segm.
X X
Multidocument extracts X X X
Full corpus coverage X X
![Page 158: Text summarization Dragomir R. Radev School of Information, Department of Electrical Engineering and Computer Science, and Department of Linguistics University.](https://reader035.fdocuments.in/reader035/viewer/2022062618/55146220550346284e8b5961/html5/thumbnails/158.jpg)
Part VII Language modeling
![Page 159: Text summarization Dragomir R. Radev School of Information, Department of Electrical Engineering and Computer Science, and Department of Linguistics University.](https://reader035.fdocuments.in/reader035/viewer/2022062618/55146220550346284e8b5961/html5/thumbnails/159.jpg)
Language modeling
• Source/target language• Coding process
Noisy channel Recovery
e f e*
![Page 160: Text summarization Dragomir R. Radev School of Information, Department of Electrical Engineering and Computer Science, and Department of Linguistics University.](https://reader035.fdocuments.in/reader035/viewer/2022062618/55146220550346284e8b5961/html5/thumbnails/160.jpg)
Language modeling
• Source/target language• Coding process
e* = argmax p(e|f) = argmax p(e) . p(f|e)e e
p(E) = p(e1).p(e2|e1).p(e3|e1e2)…p(en|e1…en-1)
p(E) = p(e1).p(e2|e1).p(e3|e2)…p(en|en-1)
![Page 161: Text summarization Dragomir R. Radev School of Information, Department of Electrical Engineering and Computer Science, and Department of Linguistics University.](https://reader035.fdocuments.in/reader035/viewer/2022062618/55146220550346284e8b5961/html5/thumbnails/161.jpg)
Summarization using LM
• Source language: full document• Target language: summary
![Page 162: Text summarization Dragomir R. Radev School of Information, Department of Electrical Engineering and Computer Science, and Department of Linguistics University.](https://reader035.fdocuments.in/reader035/viewer/2022062618/55146220550346284e8b5961/html5/thumbnails/162.jpg)
Berger & Mittal 00
• Gisting (OCELOT)
• content selection (preserve frequencies)• word ordering (single words, consecutive
positions)• search: readability & fidelity
g* = argmax p(g|d) = argmax p(g) . p(d|g)g g
![Page 163: Text summarization Dragomir R. Radev School of Information, Department of Electrical Engineering and Computer Science, and Department of Linguistics University.](https://reader035.fdocuments.in/reader035/viewer/2022062618/55146220550346284e8b5961/html5/thumbnails/163.jpg)
Berger & Mittal 00
• Limit on top 65K words• word relatedness = alignment• Training on 100K summary+document
pairs• Testing on 1046 pairs• Use Viterbi-type search• Evaluation: word overlap (0.2-0.4)• transilingual gisting is possible• No word ordering
![Page 164: Text summarization Dragomir R. Radev School of Information, Department of Electrical Engineering and Computer Science, and Department of Linguistics University.](https://reader035.fdocuments.in/reader035/viewer/2022062618/55146220550346284e8b5961/html5/thumbnails/164.jpg)
Berger & Mittal 00
Sample output:
Audubon society atlanta area savannah georgia chatham and local birding savannah keepers chapter of the audubon georgia and leasing
![Page 165: Text summarization Dragomir R. Radev School of Information, Department of Electrical Engineering and Computer Science, and Department of Linguistics University.](https://reader035.fdocuments.in/reader035/viewer/2022062618/55146220550346284e8b5961/html5/thumbnails/165.jpg)
Banko et al. 00
• Summaries shorter than 1 sentence• headline generation• zero-level model: unigram probabilities• other models: Part-of-speech and position• Sample output:
Clinton to meet Netanyahu Arafat Israel
![Page 166: Text summarization Dragomir R. Radev School of Information, Department of Electrical Engineering and Computer Science, and Department of Linguistics University.](https://reader035.fdocuments.in/reader035/viewer/2022062618/55146220550346284e8b5961/html5/thumbnails/166.jpg)
Knight and Marcu 00
• Use structured (syntactic) information
• Two approaches:– noisy channel– decision based
• Longer summaries
• Higher accuracy
![Page 167: Text summarization Dragomir R. Radev School of Information, Department of Electrical Engineering and Computer Science, and Department of Linguistics University.](https://reader035.fdocuments.in/reader035/viewer/2022062618/55146220550346284e8b5961/html5/thumbnails/167.jpg)
Conclusion
• Summarization is coming of age
• For general domains: sentence extraction
• IR techniques not always appropriate: NLP needed
• New challenges: language modeling, multilingual summaries
![Page 168: Text summarization Dragomir R. Radev School of Information, Department of Electrical Engineering and Computer Science, and Department of Linguistics University.](https://reader035.fdocuments.in/reader035/viewer/2022062618/55146220550346284e8b5961/html5/thumbnails/168.jpg)
APPENDIX
![Page 169: Text summarization Dragomir R. Radev School of Information, Department of Electrical Engineering and Computer Science, and Department of Linguistics University.](https://reader035.fdocuments.in/reader035/viewer/2022062618/55146220550346284e8b5961/html5/thumbnails/169.jpg)
Conferences
• Dagstuhl Meeting, 1993 (Karen Spärck Jones, Brigitte Endres-Niggemeyer)
• ACL/EACL Workshop, Madrid, 1997 (Inderjeet Mani, Mark Maybury)
• AAAI Spring Symposium, Stanford, 1998 (Dragomir Radev, Eduard Hovy)
• ANLP/NAACL, Seattle, 2000 (Udo Hahn, Chin-Yew Lin, Inderjeet Mani, Dragomir Radev)
• NAACL, Pittsburgh, 2001 (Jade Goldstein and Chin-Yew Lin
• DUC, 2001 (Donna Harman and Daniel Marcu)
![Page 170: Text summarization Dragomir R. Radev School of Information, Department of Electrical Engineering and Computer Science, and Department of Linguistics University.](https://reader035.fdocuments.in/reader035/viewer/2022062618/55146220550346284e8b5961/html5/thumbnails/170.jpg)
Readings
http://mitpress.mit.edu/book-table-of-contents.tcl?isbn=0262133598
(A detailed bibliography is available at the end of this handout)
Advances in Automatic Text Summarization by Inderjeet Mani and Mark T. Maybury (eds.)
![Page 171: Text summarization Dragomir R. Radev School of Information, Department of Electrical Engineering and Computer Science, and Department of Linguistics University.](https://reader035.fdocuments.in/reader035/viewer/2022062618/55146220550346284e8b5961/html5/thumbnails/171.jpg)
1 Automatic Summarizing : Factors and Directions (K. Spärck-Jones )
2 The Automatic Creation of Literature Abstracts (H. P. Luhn)
3 New Methods in Automatic Extracting (H. P. Edmundson)
4 Automatic Abstracting Research at Chemical Abstracts Service (J. J. Pollock and A. Zamora)
5 A Trainable Document Summarizer (J. Kupiec, J. Pedersen, and F. Chen)
6 Development and Evaluation of a Statistically Based Document Summarization System (S. H. Myaeng and D. Jang)
7 A Trainable Summarizer with Knowledge Acquired from Robust NLP Techniques (C. Aone, M. E. Okurowski, J. Gorlinsky, and B. Larsen)
8 Automated Text Summarization in SUMMARIST (E. Hovy and C. Lin)
9 Salience-based Content Characterization of Text Documents (B. Boguraev and C. Kennedy)
10 Using Lexical Chains for Text Summarization (R. Barzilay and M. Elhadad)
11 Discourse Trees Are Good Indicators of Importance in Text (D. Marcu)
12 A Robust Practical Text Summarizer (T. Strzalkowski, G. Stein, J. Wang, and B. Wise)
13 Argumentative Classification of Extracted Sentenses as a First Step Towards Flexible Abstracting (S. Teufel and M. Moens)
14 Plot Units: A Narrative Summarization Strategy (W. G. Lehnert)
15 Knowledge-based text Summarization: Salience and Generalization Operators for Knowledge Base Abstraction (U. Hahn and U. Reimer)
16 Generating Concise Natural Language Summaries (K. McKeown, J. Robin, and K. Kukich)
17 Generating Summaries from Event Data (M. Maybury)
18 The Formation of Abstracts by the Selection of Sentences (G. J. Rath, A. Resnick, and T. R. Savage)
19 Automatic Condensation of Electronic Publications by Sentence Selection (R. Brandow, K. Mitze, and L. F. Rau)
20 The Effects and Limitations of Automated Text Condensing on Reading Comprehension Performance (A. H. Morris, G. M. Kasper, and D. A. Adams)
21 An Evaluation of Automatic Text Summarization Systems (T. Firmin and M J. Chrzanowski)
22 Automatic Text Structuring and Summarization (G. Salton, A. Singhal, M. Mitra, and C. Buckley)
23 Summarizing Similarities and Differences among Related Documents (I. Mani and E. Bloedorn)
24 Generating Summaries of Multiple News Articles (K. McKeown and D. R. Radev)
25 An Empirical Study of the Optimal Presentation of Multimedia Summaries of Broadcast News (A Merlino and M. Maybury)
26 Summarization of Diagrams in Documents (R. P. Futrelle)
![Page 172: Text summarization Dragomir R. Radev School of Information, Department of Electrical Engineering and Computer Science, and Department of Linguistics University.](https://reader035.fdocuments.in/reader035/viewer/2022062618/55146220550346284e8b5961/html5/thumbnails/172.jpg)
Collections of papers
• Information Processing and Management, 1995
• Computational Linguistics (in progress), 2002
![Page 173: Text summarization Dragomir R. Radev School of Information, Department of Electrical Engineering and Computer Science, and Department of Linguistics University.](https://reader035.fdocuments.in/reader035/viewer/2022062618/55146220550346284e8b5961/html5/thumbnails/173.jpg)
Web resources
http://www.summarization.comhttp://www.cs.columbia.edu/~jing/summarization.htmlhttp://www.dcs.shef.ac.uk/~gael/alphalist.htmlhttp://www.csi.uottawa.ca/tanka/ts.htmlhttp://www.ics.mq.edu.au/~swan/summarization/
![Page 174: Text summarization Dragomir R. Radev School of Information, Department of Electrical Engineering and Computer Science, and Department of Linguistics University.](https://reader035.fdocuments.in/reader035/viewer/2022062618/55146220550346284e8b5961/html5/thumbnails/174.jpg)
Ongoing projects
• Columbia
• ISI
• JHU, Michigan
• CMU, JPRC, etc.
• Sheffield
• elsewhere ...
![Page 175: Text summarization Dragomir R. Radev School of Information, Department of Electrical Engineering and Computer Science, and Department of Linguistics University.](https://reader035.fdocuments.in/reader035/viewer/2022062618/55146220550346284e8b5961/html5/thumbnails/175.jpg)
Existing companies/systems
• Microsoft
• British Telecom
• http://extractor.iit.nrc.ca/
• inXight
• http://www.islandsoft.com/products.html (IslandInTEXT )
• www.pertinence.net
![Page 176: Text summarization Dragomir R. Radev School of Information, Department of Electrical Engineering and Computer Science, and Department of Linguistics University.](https://reader035.fdocuments.in/reader035/viewer/2022062618/55146220550346284e8b5961/html5/thumbnails/176.jpg)
Available corpora
– SUMMAC corpus• send mail to [email protected]
– <Text+Abstract+Extract> corpus• send mail to [email protected]
– Open directory project• http://dmoz.org
– MEAD corpus• send mail to [email protected]
![Page 177: Text summarization Dragomir R. Radev School of Information, Department of Electrical Engineering and Computer Science, and Department of Linguistics University.](https://reader035.fdocuments.in/reader035/viewer/2022062618/55146220550346284e8b5961/html5/thumbnails/177.jpg)
Possible research topics
• Corpus creation and annotation
• MMM: Multidocument, Multimedia, Multilingual
• Evolving summaries
• Personalized summarization
• Web-based summarization
![Page 178: Text summarization Dragomir R. Radev School of Information, Department of Electrical Engineering and Computer Science, and Department of Linguistics University.](https://reader035.fdocuments.in/reader035/viewer/2022062618/55146220550346284e8b5961/html5/thumbnails/178.jpg)
Number Relationship type Level Description1 Identity Any The same text appears in more than one
location2 Equivalence (paraphrasing) S, D Two text spans have the same
information content3 Translation P, S Same information content in different
languages4 Subsumption S, D One sentence contains more
information than another5 Contradiction S, D Conflicting information6 Historical background S Information that puts current
information in context7 Cross-reference P The same entity is mentioned8 Citation S, D One sentence cites another document9 Modality S Qualified version of a sentence10 Attribution S One sentence repeats the information of
another while adding an attribution11 Summary S, D Similar to Summary in RST: one
sentence summarizes another
Cross-document structure theory
![Page 179: Text summarization Dragomir R. Radev School of Information, Department of Electrical Engineering and Computer Science, and Department of Linguistics University.](https://reader035.fdocuments.in/reader035/viewer/2022062618/55146220550346284e8b5961/html5/thumbnails/179.jpg)
DOC 1
Word levelPhrase level Paragraph/sentence levelDocument level
DOC 2 DOC 3
phrasal link
word link
cross-sentential link
cross-document link
![Page 180: Text summarization Dragomir R. Radev School of Information, Department of Electrical Engineering and Computer Science, and Department of Linguistics University.](https://reader035.fdocuments.in/reader035/viewer/2022062618/55146220550346284e8b5961/html5/thumbnails/180.jpg)
1. Clustering 2. DocumentAnalysis
3. LinkAnalysis
4. Summarization
![Page 181: Text summarization Dragomir R. Radev School of Information, Department of Electrical Engineering and Computer Science, and Department of Linguistics University.](https://reader035.fdocuments.in/reader035/viewer/2022062618/55146220550346284e8b5961/html5/thumbnails/181.jpg)
Principles of Summarization
• Put a disclaimer indicating that (automated) summaries may not preserve the emphasis and meaning of the document.
• Preserve attribution.• Always give users a pointer to the original
document.• Indicate that the summary has been generated
automatically.• In case of conflicting sources, give all points of
view.
![Page 182: Text summarization Dragomir R. Radev School of Information, Department of Electrical Engineering and Computer Science, and Department of Linguistics University.](https://reader035.fdocuments.in/reader035/viewer/2022062618/55146220550346284e8b5961/html5/thumbnails/182.jpg)
Bibliography
![Page 183: Text summarization Dragomir R. Radev School of Information, Department of Electrical Engineering and Computer Science, and Department of Linguistics University.](https://reader035.fdocuments.in/reader035/viewer/2022062618/55146220550346284e8b5961/html5/thumbnails/183.jpg)
THE END