We the News Investigating Blog Punditry IS256 Applied Natural Language Processing IS290-6 Web-based...
-
Upload
douglas-floyd -
Category
Documents
-
view
214 -
download
0
Transcript of We the News Investigating Blog Punditry IS256 Applied Natural Language Processing IS290-6 Web-based...
We the News
Investigating Blog Punditry
IS256 Applied Natural Language Processing
IS290-6 Web-based services
Yiming Liu, Kevin Lim, Olga Amuzinskaya
Conceptual Outline
NLP analyzer: Summarizes the blog authors' reactions to a news
event Attempts to extract “interesting” opinions from the
blogosphere A component of an overall blog retrieval, analysis,
and output framework Point/counterpoint formulation and presentation
Core Value Proposition
Blogs are interesting in many ways But sometimes not for their “truth value” Often because they are hugely personal and opinionated
Extracting core terms out of news stories and bringing together professionally and non-professionally generated news and analysis opinions, pictures putting information pieces that are interesting and relevant together
NLP Analyzer: summarization
The goal is to pick up the “reactive”, opinion-infused summary sentences: "Gore's right, there is a catastrophic climate change"
vs "Wear less layers, idiot"
Emotional content and affect: a proxy for “opinion”.
Hypothesis: Highly affective sentences are more likely to convey what the authors' core opinions are.
Conceptual Architecture: Retrieval
NewsAdaptor
TermExtractor
Orchestration
REST
articles
Python data structure
terms
XMLWriter
BlogAdaptor
PhotosAdaptor
terms terms
XML NLP Analyzer
Python data structure
NewsFeeds
Search Terms
articlesblogs
photos
XML
Common Data Format: XML
Conceptual Architecture: Summarizer
NaïveBayes
classifier
NLP Analyzer
topictraining& testingcollections
XMLReader
coll.
emotional opinions
requestscoring curse words
capitalization
exclamations
Simpleclassifier
coll.
classified sentences
Orchestration
XMLWriter
classified sentences
News collection
NewsTopic
GoldStandard
Gold standard / training set
Obtained data for our training from Technorati and other blog search engines. Formatted into the shared XML data format Manually picked summary sentences out of text
Retrieved blogs relevant to 3 topics Elections 2006 Inconvenient Truth IE7
Summarizer
Multinomial Naïve Bayes classifier Applied scorers to evaluate blog features:
curse words bonus cue words exclamation points imperative sentences emotional words pleasure words capitalization
strong words search term negation words partisan labels sentence positions pronouns valence of words
Classifiers
Comparison Baseline Multinomial Naïve Bayes Struggled with SVM Focused on getting better scorers and data set
instead of working on SVM
A sample ranking Election:
Terrorists are cheering because Democrats have been championing their cause since 2003 … Islamic throat-cutting fascists know that a Democrat win is a win for Islamic throat-cutting fascists. (correct)
How miserable is your political party when you have the enemy of your country cheering for your victory [sic] … (correct)
Yesterday was a victory for all of you useful idiots who claim to be smarter than everyone else and a victory for the terrorists who played you like idiots against your own government. (miss)
As we improved, a hit or miss became an arbitrary thing.
Machine vs. human summarization Election:
Machine: ...Democrats ... will have won a stunning 73 % of Senate seats ... . Human: Enjoy!
Inconvenient Truth: Machine: You don't have to be a fan of Gore , or his politics, to find his
message about global warming worth considering . Human: An Inconvenient Truth is a powerful film that makes you think
about the topic of conservation.
IE7: Machine: Fortunately, I use Firefox for most things, so I still have web
access. Human: Yes , I know it is hard to imagine incompetence at Microsoft , but
I have to bring up the latest turd from Redmond that has bee foisted upon an unsuspecting population : Internet Explorer 7 Or should I say Internet Destroyer 7 ... .
Cross-Validation results
Election: Accuracy: retrieved 25 of actual 26, out of 335 possible Recall: 0.77
Inconvenient Truth: Accuracy: retrieved 10 of actual 18 out of 137 possible Recall: 0.56
IE7: Accuracy: retrieve 12 of actual 21, out of 88 possible Recall: 0.38
Precision: 0.80
Precision: 0.67
Precision: 0.67
NLP Analyzer: demo run on test set
Demo:http://harbinger.sims.berkeley.edu/~k7lim/ANLPWebservice/affectservice.wordy.xml
Challenges Full-text extraction:
resolve dependency on blog formats. Informality of bloggers:
smart quotes, elipses, etc., which require special handling
our segmenter fails to segment sentences that don't have capitalization
Stemmers are hard to obtain (bottleneck): morphy is slow Porter is terrible
Future work: The Automatic Pundit
Point/Counterpoint formulation and presentation: automatic agent that can advocate the core arguments
on behalf of each side of given issue This would require classification of summaries into
positive/negative valences… …and more accurate summaries…
Questions?