We the News Investigating Blog Punditry IS256 Applied Natural Language Processing IS290-6 Web-based...

We the News

Investigating Blog Punditry

IS256 Applied Natural Language Processing

IS290-6 Web-based services

Yiming Liu, Kevin Lim, Olga Amuzinskaya

Conceptual Outline

NLP analyzer: Summarizes the blog authors' reactions to a news

event Attempts to extract “interesting” opinions from the

blogosphere A component of an overall blog retrieval, analysis,

and output framework Point/counterpoint formulation and presentation

Core Value Proposition

Blogs are interesting in many ways But sometimes not for their “truth value” Often because they are hugely personal and opinionated

Extracting core terms out of news stories and bringing together professionally and non-professionally generated news and analysis opinions, pictures putting information pieces that are interesting and relevant together

NLP Analyzer: summarization

The goal is to pick up the “reactive”, opinion-infused summary sentences: "Gore's right, there is a catastrophic climate change"

vs "Wear less layers, idiot"

Emotional content and affect: a proxy for “opinion”.

Hypothesis: Highly affective sentences are more likely to convey what the authors' core opinions are.

Conceptual Architecture: Retrieval

NewsAdaptor

TermExtractor

Orchestration

REST

articles

Python data structure

terms

XMLWriter

BlogAdaptor

PhotosAdaptor

terms terms

XML NLP Analyzer

Python data structure

NewsFeeds

Search Terms

articlesblogs

photos

XML

Common Data Format: XML

Conceptual Architecture: Summarizer

NaïveBayes

classifier

NLP Analyzer

topictraining& testingcollections

XMLReader

coll.

emotional opinions

requestscoring curse words

capitalization

exclamations

Simpleclassifier

coll.

classified sentences

Orchestration

XMLWriter

classified sentences

News collection

NewsTopic

GoldStandard

Gold standard / training set

Obtained data for our training from Technorati and other blog search engines. Formatted into the shared XML data format Manually picked summary sentences out of text

Retrieved blogs relevant to 3 topics Elections 2006 Inconvenient Truth IE7

Summarizer

Multinomial Naïve Bayes classifier Applied scorers to evaluate blog features:

curse words bonus cue words exclamation points imperative sentences emotional words pleasure words capitalization

strong words search term negation words partisan labels sentence positions pronouns valence of words

Classifiers

Comparison Baseline Multinomial Naïve Bayes Struggled with SVM Focused on getting better scorers and data set

instead of working on SVM

A sample ranking Election:

Terrorists are cheering because Democrats have been championing their cause since 2003 … Islamic throat-cutting fascists know that a Democrat win is a win for Islamic throat-cutting fascists. (correct)

How miserable is your political party when you have the enemy of your country cheering for your victory [sic] … (correct)

Yesterday was a victory for all of you useful idiots who claim to be smarter than everyone else and a victory for the terrorists who played you like idiots against your own government. (miss)

As we improved, a hit or miss became an arbitrary thing.

Machine vs. human summarization Election:

Machine: ...Democrats ... will have won a stunning 73 % of Senate seats ... . Human: Enjoy!

Inconvenient Truth: Machine: You don't have to be a fan of Gore , or his politics, to find his

message about global warming worth considering . Human: An Inconvenient Truth is a powerful film that makes you think

about the topic of conservation.

IE7: Machine: Fortunately, I use Firefox for most things, so I still have web

access. Human: Yes , I know it is hard to imagine incompetence at Microsoft , but

I have to bring up the latest turd from Redmond that has bee foisted upon an unsuspecting population : Internet Explorer 7 Or should I say Internet Destroyer 7 ... .

Cross-Validation results

Election: Accuracy: retrieved 25 of actual 26, out of 335 possible Recall: 0.77

Inconvenient Truth: Accuracy: retrieved 10 of actual 18 out of 137 possible Recall: 0.56

IE7: Accuracy: retrieve 12 of actual 21, out of 88 possible Recall: 0.38

Precision: 0.80

Precision: 0.67

Precision: 0.67

NLP Analyzer: demo run on test set

Demo:http://harbinger.sims.berkeley.edu/~k7lim/ANLPWebservice/affectservice.wordy.xml

Challenges Full-text extraction:

resolve dependency on blog formats. Informality of bloggers:

smart quotes, elipses, etc., which require special handling

our segmenter fails to segment sentences that don't have capitalization

Stemmers are hard to obtain (bottleneck): morphy is slow Porter is terrible

Future work: The Automatic Pundit

Point/Counterpoint formulation and presentation: automatic agent that can advocate the core arguments

on behalf of each side of given issue This would require classification of summaries into

positive/negative valences… …and more accurate summaries…

Questions?

We the News Investigating Blog Punditry IS256 Applied Natural Language Processing IS290-6 Web-based...

Documents

Transcript of We the News Investigating Blog Punditry IS256 Applied Natural Language Processing IS290-6 Web-based...