1 I256: Applied Natural Language Processing Marti Hearst Nov 8, 2006.

47
1 I256: Applied Natural Language Processing Marti Hearst Nov 8, 2006
  • date post

    21-Dec-2015
  • Category

    Documents

  • view

    215
  • download

    1

Transcript of 1 I256: Applied Natural Language Processing Marti Hearst Nov 8, 2006.

Page 1: 1 I256: Applied Natural Language Processing Marti Hearst Nov 8, 2006.

1

I256: Applied Natural Language Processing

Marti HearstNov 8, 2006

 

 

Page 2: 1 I256: Applied Natural Language Processing Marti Hearst Nov 8, 2006.

2

Today

Comparing term clustering and category outputClustering in WekaData mining from blogs

Page 3: 1 I256: Applied Natural Language Processing Marti Hearst Nov 8, 2006.

3

LDA

Latent Dirchelet AllocationBlei, Ng, Jordan, JLMR 03.LDA is a hierarchical probabilistic model of documents. “LDA allows you to analyze of corpus, and extract the topics that combined to form its documents.”http://www.cs.princeton.edu/~blei/lda-c/Not really clustering, but in the “soft clustering” ballpark.

Page 4: 1 I256: Applied Natural Language Processing Marti Hearst Nov 8, 2006.

4

LDA on Recipeshttp://orange.sims.berkeley.edu/cgi-bin/flamenco.cgi/recipes-newblei/

Flamenco

Page 5: 1 I256: Applied Natural Language Processing Marti Hearst Nov 8, 2006.

5

LDA on Recipeshttp://orange.sims.berkeley.edu/cgi-bin/flamenco.cgi/recipes-newblei/

Flamenco

Page 6: 1 I256: Applied Natural Language Processing Marti Hearst Nov 8, 2006.

6

CastaNet

(Semi)automated facet creationStoica & HearstBuild up from WordNetAlgorithm is fully automatic but we think you can improve results manually afterwards.

Page 7: 1 I256: Applied Natural Language Processing Marti Hearst Nov 8, 2006.

7

CastaNet on Recipeshttp://orange.sims.berkeley.edu/cgi-bin/flamenco.cgi/recipes-automated/

Flamenco

Page 8: 1 I256: Applied Natural Language Processing Marti Hearst Nov 8, 2006.

8

CastaNet on Recipeshttp://orange.sims.berkeley.edu/cgi-bin/flamenco.cgi/recipes-automated/

Flamenco

Page 9: 1 I256: Applied Natural Language Processing Marti Hearst Nov 8, 2006.

9

TopicSeek on Enron EmailTechnique: pLSI (probabilistic LSI, Hofmann 99)Hand-picked example for websitehttp://topicseek.com/enron.html

Page 10: 1 I256: Applied Natural Language Processing Marti Hearst Nov 8, 2006.

10

TopicSeek on MedlineTechnique: pLSI (probabilistic LSI, Hofmann 99)Hand-picked example for websitehttp://topicseek.com/pubmed.html

Page 11: 1 I256: Applied Natural Language Processing Marti Hearst Nov 8, 2006.

11

CastaNet on Medline Journal Titleshttp://orange.sims.berkeley.edu/cgi-bin/flamenco.cgi/medicine-automated/

Flamenco

Page 12: 1 I256: Applied Natural Language Processing Marti Hearst Nov 8, 2006.

12

Clustering in Weka

Page 13: 1 I256: Applied Natural Language Processing Marti Hearst Nov 8, 2006.

13

Page 14: 1 I256: Applied Natural Language Processing Marti Hearst Nov 8, 2006.

14

Page 15: 1 I256: Applied Natural Language Processing Marti Hearst Nov 8, 2006.

15

Page 16: 1 I256: Applied Natural Language Processing Marti Hearst Nov 8, 2006.

16

Looking at Clustering Results

Weka lets you save cluster results to an ARFF fileI wrote some python code to process this file and pull out the Subject headings for each newsgroup posting in each cluster.

Page 17: 1 I256: Applied Natural Language Processing Marti Hearst Nov 8, 2006.

17

15-way clustering

Page 18: 1 I256: Applied Natural Language Processing Marti Hearst Nov 8, 2006.

18

Page 19: 1 I256: Applied Natural Language Processing Marti Hearst Nov 8, 2006.

19

Cobweb clustering

Page 20: 1 I256: Applied Natural Language Processing Marti Hearst Nov 8, 2006.

20

Page 21: 1 I256: Applied Natural Language Processing Marti Hearst Nov 8, 2006.

21

Blog Analysis

What’s special about blogs?

Page 22: 1 I256: Applied Natural Language Processing Marti Hearst Nov 8, 2006.

22

Blog analysis sites

http://dijest.com/bc/Called blogcount; lots of stats and news about blogs

http://blogcensus.net/?page=toolsLanguage, location, marketshare

http://www.perseus.com/blogsurvey/Stats about biggest blogs, demographics

http://www.weblogs.com/Notify when new content posted

http://blogpulse.com/Trends and recent popular topics

Page 23: 1 I256: Applied Natural Language Processing Marti Hearst Nov 8, 2006.

23

Blogs vs. Newsgroups

Posting about products … what can we tell?Blog:

Newsgroup:

Example from Glance, Hurst, and Tomokiyo ‘04

Page 24: 1 I256: Applied Natural Language Processing Marti Hearst Nov 8, 2006.

24

Analyzing Blogs for Market Data

Figure from Glance, Hurst, Nigam, Siegler, Stockton, & Tomokiyo, KDD’05

Idea: examine comments about a product (or a product’s competition or market) in an automated fashion.Application area: handheld electronic devices.

Page 25: 1 I256: Applied Natural Language Processing Marti Hearst Nov 8, 2006.

25

Analyzing Blogs for Market Data

Figure from Glance, Hurst, Nigam, Siegler, Stockton, & Tomokiyo, KDD’05

Page 26: 1 I256: Applied Natural Language Processing Marti Hearst Nov 8, 2006.

26

Technology usedPost segmentationImportant phrases

Foreground vs. background corpus– Background: text about product– Foreground: certain negative paragraphs about product

Sentiment classificationWhat do people talk about when saying negative things about product X?

Social network analysis (on discussion boards)What does this group of people talk about when saying negative things about product X?Author dispersion

– Many people talking about it, or just a few?

Page 27: 1 I256: Applied Natural Language Processing Marti Hearst Nov 8, 2006.

27

Example

What common phrases to people use when saying negative things about product X?

Page 28: 1 I256: Applied Natural Language Processing Marti Hearst Nov 8, 2006.

28

ExampleWhat do people in this group say when saying negative things about product X?

Page 29: 1 I256: Applied Natural Language Processing Marti Hearst Nov 8, 2006.

29

Example

What do people in this group say when saying negative things about product X?

Page 30: 1 I256: Applied Natural Language Processing Marti Hearst Nov 8, 2006.

30

Predicting Film Sales

Idea: Use discussion before a film to predict its opening weekend box office scoresUse discussion afterwards to predict longer-term salesSeparate out topic labels from sentiment labels

Outcome:Good predictor for opening weekend, but not for longer termObservation: the nature of discussion gets (and thus harder to analyze) after the film has been out a while.

Example from Mishne & Glance, 2006

Page 31: 1 I256: Applied Natural Language Processing Marti Hearst Nov 8, 2006.

31

Predicting Film Sales

Example from Mishne & Glance, 2006

Page 32: 1 I256: Applied Natural Language Processing Marti Hearst Nov 8, 2006.

32

Prediction Film Sales

Example from Mishne & Glance, 2006

Page 33: 1 I256: Applied Natural Language Processing Marti Hearst Nov 8, 2006.

33

Predicting Film Sales

Example from Mishne & Glance, 2006

Page 34: 1 I256: Applied Natural Language Processing Marti Hearst Nov 8, 2006.

34

Analyzing Political Blogs

Analyze:Who links to whomWhat the popularity profile looks like

– A powerlaw/Zipf/Pareto, of course

Look at structure of topic-specific blogs

– By #inbound links

Image from blogsphere ecosystem via Shirky

Page 35: 1 I256: Applied Natural Language Processing Marti Hearst Nov 8, 2006.

35

Analyzing Political Blogs

Earlier work examined books bought together in pairs at major retailers

Krebs, Divided we Stand??? http://www.orgnet.com/leftright.html

In other domains the groupings are more distributed.

Page 36: 1 I256: Applied Natural Language Processing Marti Hearst Nov 8, 2006.

36 http://www.orgnet.com/booknet.html

Page 37: 1 I256: Applied Natural Language Processing Marti Hearst Nov 8, 2006.

37 http://www.orgnet.com/leftright.html from Jan 2003

Page 38: 1 I256: Applied Natural Language Processing Marti Hearst Nov 8, 2006.

38 http://www.orgnet.com/divided.html from 2004 election

Page 39: 1 I256: Applied Natural Language Processing Marti Hearst Nov 8, 2006.

39

Analyzing Political Blogs

Study by Adamic and Glance, 2005Analyzed 40 most popular political blogs2 months preceding 2004 US presidential electionAlso study 1000 political blogs on a one day snapshotFindings for the latter:

Liberal and conservative blogs had distinct lists of favorate news sources, people, and topics, with some overlap on current news

– Use labels from aggregator sources

Linking patterns were indeed pretty internal (91% stayed within political leaning)More and more frequent linking among conservatives

– 82% conservative linked out vs. 74% of liberal

Page 40: 1 I256: Applied Natural Language Processing Marti Hearst Nov 8, 2006.

40

Analyzing Political Blogs

For the 40 most popular blogs:Looked for “echo chamber” effect

The conservative blogs are more tightly interlinked.Question: do they repeat the same concepts more?

– Measured textual similarity among blog posts– Slightly stronger within a political leaning than

between, but not one orientation more than the other.

Looked for interaction with “mainstream” media

Found strong distinctions between which sources cited

Page 41: 1 I256: Applied Natural Language Processing Marti Hearst Nov 8, 2006.

41 Image from Adamic & Glance 200

Page 42: 1 I256: Applied Natural Language Processing Marti Hearst Nov 8, 2006.

42 Image from Adamic & Glance 200

Page 43: 1 I256: Applied Natural Language Processing Marti Hearst Nov 8, 2006.

43 Image from Adamic & Glance 200

Page 44: 1 I256: Applied Natural Language Processing Marti Hearst Nov 8, 2006.

44 Image from Adamic & Glance 200

Page 45: 1 I256: Applied Natural Language Processing Marti Hearst Nov 8, 2006.

45 Image from Adamic & Glance 200

Page 46: 1 I256: Applied Natural Language Processing Marti Hearst Nov 8, 2006.

46 Image from Adamic & Glance 200

Page 47: 1 I256: Applied Natural Language Processing Marti Hearst Nov 8, 2006.

47

Next Time

Sentiment and Opinion Analysis