CETR: Content Extraction via Tag Ratios

26
Data and Information Systems Laboratory University of Illinois Urbana-Champaign WWW Conference April 30, 2010 CETR: Content Extraction via Tag Ratios Tim Weninger , William H. Hsu and Jiawei Han Department of Computer Science University of Illinois Urbana-Champaign, Urbana, IL Department of Computing and Information Sciences Kansas State University, Manhattan KS

description

CETR: Content Extraction via Tag Ratios. Tim Weninger , William H. Hsu and Jiawei Han. Department of Computer Science University of Illinois Urbana-Champaign, Urbana, IL Department of Computing and Information Sciences Kansas State University, Manhattan KS. Problem: - PowerPoint PPT Presentation

Transcript of CETR: Content Extraction via Tag Ratios

Data and Information Systems LaboratoryUniversity of Illinois Urbana-Champaign

WWW ConferenceApril 30, 2010

CETR: Content Extraction via Tag Ratios

Tim Weninger, William H. Hsu and Jiawei Han

Department of Computer ScienceUniversity of Illinois Urbana-Champaign, Urbana, IL

Department of Computing and Information SciencesKansas State University, Manhattan KS

Data and Information Systems LaboratoryUniversity of Illinois Urbana-Champaign

WWW ConferenceApril 30, 2010

Problem:Too much junk in a

web page

Goal:Extract only the

content of a page

Taken from The Hutchinson News on 8/14/2008

Data and Information Systems LaboratoryUniversity of Illinois Urbana-Champaign

WWW ConferenceApril 30, 2010

Rendered HTML Document

Text content of the document

Published online 8/13/2008

A home away from schoolDay care has after-school duties as some clients start academic yearBy Kristen Roderick - The Hutchinson News - [email protected]

(Travis Morisse/The Hutchinson News) Mary Waln, 7, and Nija Morris, 6, read “The Magic Mat” together Wednesday at Hadley Day Care.

The doors at Hadley Day Care opened Wednesday afternoon, and children scurried in with tales of their first day of school.

Nija Morris, a 6-year-old attending Faris Elementary, smiled as she hung her pink-and-blue flowered backpack on a hook and talked to her classmates about her first day.

"I played and I did art and I played outside and I went to the gym, and I went inside and did centers," she said. "And then I went to meet the other classes and then we went home."

The school-aged children were a little more wound up on Wednesday, program director Christie Gardner said. The excitement is always higher the first day of school, and not everyone is in a routine.

Example

Data and Information Systems LaboratoryUniversity of Illinois Urbana-Champaign

WWW ConferenceApril 30, 2010

Naïve Approach› Remove all HTML tags

Original, Rendered HTML Document

RSSCIRCULATIONYOUR ACCOUNTCONTACT USHomeNews

Top StoriesLocal/Regional NewsBriefsEducationAsk HutchBusinessPublic RecordSpecial sectionsVideosPhotosSlideshowsForums

WeatherObituariesSports

Latest SportsNJCAA Tournament

Opinion EditorialsLetters to the EditorColumnists

Lifestyles FoodHealthReligionOutdoors

Life’s Little MomentsWeddingsEngagementsAnniversariesComing EventsEt cetera

Entertainment PreviewTV listings

JobsAutosClassifiedsMarketplace Archive searchThursday, August 14, 2008 10 : 35 AM Search the Web using Hutch News Weather Forecast today’s top stories

Published online 8/13/2008

A home away from schoolDay care has after-school duties as some clients start academic yearBy Kristen Roderick - The Hutchinson News - [email protected]

(Travis Morisse/The Hutchinson News) Mary

Waln, 7, and Nija Morris, 6, read “The Magic Mat” together Wednesday at Hadley Day Care. The doors at Hadley Day Care opened Wednesday afternoon, and children scurried in with tales of their first day of school.

Nija Morris, a 6-year-old attending Faris Elementary, smiled as she hung her pink-and-blue flowered backpack on a hook and talked to her classmates about her first day.

"I played and I did art and I played outside and I went to the gym, and I went inside and did centers," she said. "And then I went to meet the other classes and then we went home."

All Text of the Document

Data and Information Systems LaboratoryUniversity of Illinois Urbana-Champaign

WWW ConferenceApril 30, 2010

Tag-based Approach› Use HTML tags as clues for content› Problem: Style-sheets

Original, Rendered HTML Document

<div><div></div><div>

<div>Eat at Joes

</div></div><div>

<div><div>

The doors at Hadley Day Care opened Wednesday afternoon, and children scurried in with tales of their first day of school.

</div><div>

Nija Morris, a 6-year-old attending Faris Elementary, smiled as she hung her pink-and-blue flowered backpack on a hook and talked to her classmates about her first day.

</div></div>

</div></div>

Data and Information Systems LaboratoryUniversity of Illinois Urbana-Champaign

WWW ConferenceApril 30, 2010

• Wrapper Generation› Learn rules via machine learning from webpage› Problem: pages vary, web changes

Data and Information Systems LaboratoryUniversity of Illinois Urbana-Champaign

WWW ConferenceApril 30, 2010

Text-to-Tag Ratio

Algorithm 1: Text-To-Tag Ratio pseudocodeinput

h ← HTML source codebegin

Remove all script, remark tags and empty linesfor each line k to numLines( h ) do

x ← number of non-tag ASCII characters in h[k]y ← number of tags in h[k]if y = 0 then

TTRArray[i] ← xelse

TTRArray[i] ← x / yend if

end forreturn TTRArray

end

Data and Information Systems LaboratoryUniversity of Illinois Urbana-Champaign

WWW ConferenceApril 30, 2010

Example

http://www2010.org/www/2010/04/program-guide/

Text: 21 - Tags: 8 -> TTR: 2.63

Text: 22 - Tags: 8 -> TTR: 2.75

Text: 298 - Tags: 6 -> TTR: 49.67

Text: 0 - Tags: 0 -> TTR: 0Text: 0 - Tags: 1 -> TTR: 0

Data and Information Systems LaboratoryUniversity of Illinois Urbana-Champaign

WWW ConferenceApril 30, 2010

1 26 51 76 1011261511762012262512763013263513764014260

50

100

150

200

250

Line Number

Text

To

Tag

Ratio

Data and Information Systems LaboratoryUniversity of Illinois Urbana-Champaign

WWW ConferenceApril 30, 2010

Preprocessing – Blur the tag ratios

1 11 21 31 41 51 61 71 81 91 1010

100

200

300

400

500

Line Number

Text

To

Tag

Rati

o

1 9 17 25 33 41 49 57 65 73 81 89 971050

50

100

150

200

250

Line Number

Text

To

Tag

Rati

o

Data and Information Systems LaboratoryUniversity of Illinois Urbana-Champaign

WWW ConferenceApril 30, 2010

Apply a Threshold

Threshold based on the standard deviation

1 25 49 73 97 1211451691932172412652893133373613854090

20

40

60

80

100

120

Line Number

Text

To

Tag

Ratio

Std. Dev. Is 20.3TTR for Hutchinson News document

Data and Information Systems LaboratoryUniversity of Illinois Urbana-Champaign

WWW ConferenceApril 30, 2010

What’s wrong with this?

Data and Information Systems LaboratoryUniversity of Illinois Urbana-Champaign

WWW ConferenceApril 30, 2010

Worst CaseAmerican Declaration of Independence Web page

American Declaration of IndependenceTTR computed from digital copy at

http://www.ushistory.org/declaration/document/index.htm

1 8 15 22 29 36 43 50 57 64 71 78 85 92 99 1060

100

200

300

400

500

Line NumberTe

xt T

o Ta

g Ra

tio

Data and Information Systems LaboratoryUniversity of Illinois Urbana-Champaign

WWW ConferenceApril 30, 2010

Histogram Clustering in 2-Dimensions

Looks for jumps in the moving average of TTR

1 50 99 1481972462953443930

20

40

60

80

100

120

Line Number

Text

To

Tag

Ratio

1 50 99 148197246295344393-150

-100

-50

0

50

100

150

Line Number

Data and Information Systems LaboratoryUniversity of Illinois Urbana-Champaign

WWW ConferenceApril 30, 2010

Histogram Clustering in 2-Dimensions

Absolute value gives insight

1 52 103154205256307358409-150

-100

-50

0

50

100

150

Line Number

1 46 91 1361812262713163614060

100200300400500600700800

Line Number

Data and Information Systems LaboratoryUniversity of Illinois Urbana-Champaign

WWW ConferenceApril 30, 2010

0 25 50 75 1000

102030405060708090

100

TTR (hʹ)

Diff

eren

ces

(g')

Histogram Clustering in 2-Dimensions

Make a scatterplot

0 25 50 75 1000

20

40

60

80

100

TTR (hʹ)

Diff

eren

ces

(g')

Data and Information Systems LaboratoryUniversity of Illinois Urbana-Champaign

WWW ConferenceApril 30, 2010

0 25 50 75 1000

10

20

30

40

50

60

70

80

90

100

TTR (hʹ)

Diff

eren

ces

(g')

How should we cluster this?

Data and Information Systems LaboratoryUniversity of Illinois Urbana-Champaign

WWW ConferenceApril 30, 2010

0 25 50 75 1000

10

20

30

40

50

60

70

80

90

100

TTR (hʹ)

Diff

eren

ces

(g')

Modified k-Means

Data and Information Systems LaboratoryUniversity of Illinois Urbana-Champaign

WWW ConferenceApril 30, 2010

0 25 50 75 1000

10

20

30

40

50

60

70

80

90

100

TTR (hʹ)

Diff

eren

ces

(g')

Can also use a Max-Margin ApproachBut we don’t

Data and Information Systems LaboratoryUniversity of Illinois Urbana-Champaign

WWW ConferenceApril 30, 2010

Evaluation methodCleanEval Dataset (Standard Evaluation dataset)

741 English documents 713 Chinese documents

Myriad 40 Dataset40 News Websites – 206 documents at random

Big 5 DatasetNY Post, Freep, Suntimes, Techweb, Tribune – 50 each

BBC and NYTimes50 documents each

Evaluation MetricsPrecision, Recall, F1-Measure

Data and Information Systems LaboratoryUniversity of Illinois Urbana-Champaign

WWW ConferenceApril 30, 2010

CETR Results

Data and Information Systems LaboratoryUniversity of Illinois Urbana-Champaign

WWW ConferenceApril 30, 2010

Case study

Data and Information Systems LaboratoryUniversity of Illinois Urbana-Champaign

WWW ConferenceApril 30, 2010

CETR-TM as the σ coefficient (λ) (i.e. threshold) varies

Data and Information Systems LaboratoryUniversity of Illinois Urbana-Champaign

WWW ConferenceApril 30, 2010

Worst Case RevisitedNon-HTML or all content pages

1 11 21 31 41 51 61 71 81 91 1011111211311410

20

40

60

80

100

120

Line Number

Text

To

Tag

Rati

o

approximation

WWW’10 Paper

Data and Information Systems LaboratoryUniversity of Illinois Urbana-Champaign

WWW ConferenceApril 30, 2010

Conclusions and Future Work

Tag Ratio ApproachContent extraction technique – outperforms existing

methodsSimple… easy to implement and useStatic method

no training requiredJust give a document and get content

Future WorkUse for page segmentationUse in Search/Retrieval

Data and Information Systems LaboratoryUniversity of Illinois Urbana-Champaign

WWW ConferenceApril 30, 2010

Questions?