Data and Information Systems LaboratoryUniversity of Illinois Urbana-Champaign
WWW ConferenceApril 30, 2010
CETR: Content Extraction via Tag Ratios
Tim Weninger, William H. Hsu and Jiawei Han
Department of Computer ScienceUniversity of Illinois Urbana-Champaign, Urbana, IL
Department of Computing and Information SciencesKansas State University, Manhattan KS
Data and Information Systems LaboratoryUniversity of Illinois Urbana-Champaign
WWW ConferenceApril 30, 2010
Problem:Too much junk in a
web page
Goal:Extract only the
content of a page
Taken from The Hutchinson News on 8/14/2008
Data and Information Systems LaboratoryUniversity of Illinois Urbana-Champaign
WWW ConferenceApril 30, 2010
Rendered HTML Document
Text content of the document
Published online 8/13/2008
A home away from schoolDay care has after-school duties as some clients start academic yearBy Kristen Roderick - The Hutchinson News - [email protected]
(Travis Morisse/The Hutchinson News) Mary Waln, 7, and Nija Morris, 6, read “The Magic Mat” together Wednesday at Hadley Day Care.
The doors at Hadley Day Care opened Wednesday afternoon, and children scurried in with tales of their first day of school.
Nija Morris, a 6-year-old attending Faris Elementary, smiled as she hung her pink-and-blue flowered backpack on a hook and talked to her classmates about her first day.
"I played and I did art and I played outside and I went to the gym, and I went inside and did centers," she said. "And then I went to meet the other classes and then we went home."
The school-aged children were a little more wound up on Wednesday, program director Christie Gardner said. The excitement is always higher the first day of school, and not everyone is in a routine.
Example
Data and Information Systems LaboratoryUniversity of Illinois Urbana-Champaign
WWW ConferenceApril 30, 2010
Naïve Approach› Remove all HTML tags
Original, Rendered HTML Document
RSSCIRCULATIONYOUR ACCOUNTCONTACT USHomeNews
Top StoriesLocal/Regional NewsBriefsEducationAsk HutchBusinessPublic RecordSpecial sectionsVideosPhotosSlideshowsForums
WeatherObituariesSports
Latest SportsNJCAA Tournament
Opinion EditorialsLetters to the EditorColumnists
Lifestyles FoodHealthReligionOutdoors
Life’s Little MomentsWeddingsEngagementsAnniversariesComing EventsEt cetera
Entertainment PreviewTV listings
JobsAutosClassifiedsMarketplace Archive searchThursday, August 14, 2008 10 : 35 AM Search the Web using Hutch News Weather Forecast today’s top stories
Published online 8/13/2008
A home away from schoolDay care has after-school duties as some clients start academic yearBy Kristen Roderick - The Hutchinson News - [email protected]
(Travis Morisse/The Hutchinson News) Mary
Waln, 7, and Nija Morris, 6, read “The Magic Mat” together Wednesday at Hadley Day Care. The doors at Hadley Day Care opened Wednesday afternoon, and children scurried in with tales of their first day of school.
Nija Morris, a 6-year-old attending Faris Elementary, smiled as she hung her pink-and-blue flowered backpack on a hook and talked to her classmates about her first day.
"I played and I did art and I played outside and I went to the gym, and I went inside and did centers," she said. "And then I went to meet the other classes and then we went home."
All Text of the Document
Data and Information Systems LaboratoryUniversity of Illinois Urbana-Champaign
WWW ConferenceApril 30, 2010
Tag-based Approach› Use HTML tags as clues for content› Problem: Style-sheets
Original, Rendered HTML Document
<div><div></div><div>
<div>Eat at Joes
</div></div><div>
<div><div>
The doors at Hadley Day Care opened Wednesday afternoon, and children scurried in with tales of their first day of school.
</div><div>
Nija Morris, a 6-year-old attending Faris Elementary, smiled as she hung her pink-and-blue flowered backpack on a hook and talked to her classmates about her first day.
</div></div>
</div></div>
Data and Information Systems LaboratoryUniversity of Illinois Urbana-Champaign
WWW ConferenceApril 30, 2010
• Wrapper Generation› Learn rules via machine learning from webpage› Problem: pages vary, web changes
Data and Information Systems LaboratoryUniversity of Illinois Urbana-Champaign
WWW ConferenceApril 30, 2010
Text-to-Tag Ratio
Algorithm 1: Text-To-Tag Ratio pseudocodeinput
h ← HTML source codebegin
Remove all script, remark tags and empty linesfor each line k to numLines( h ) do
x ← number of non-tag ASCII characters in h[k]y ← number of tags in h[k]if y = 0 then
TTRArray[i] ← xelse
TTRArray[i] ← x / yend if
end forreturn TTRArray
end
Data and Information Systems LaboratoryUniversity of Illinois Urbana-Champaign
WWW ConferenceApril 30, 2010
Example
http://www2010.org/www/2010/04/program-guide/
Text: 21 - Tags: 8 -> TTR: 2.63
Text: 22 - Tags: 8 -> TTR: 2.75
Text: 298 - Tags: 6 -> TTR: 49.67
Text: 0 - Tags: 0 -> TTR: 0Text: 0 - Tags: 1 -> TTR: 0
Data and Information Systems LaboratoryUniversity of Illinois Urbana-Champaign
WWW ConferenceApril 30, 2010
1 26 51 76 1011261511762012262512763013263513764014260
50
100
150
200
250
Line Number
Text
To
Tag
Ratio
Data and Information Systems LaboratoryUniversity of Illinois Urbana-Champaign
WWW ConferenceApril 30, 2010
Preprocessing – Blur the tag ratios
1 11 21 31 41 51 61 71 81 91 1010
100
200
300
400
500
Line Number
Text
To
Tag
Rati
o
1 9 17 25 33 41 49 57 65 73 81 89 971050
50
100
150
200
250
Line Number
Text
To
Tag
Rati
o
Data and Information Systems LaboratoryUniversity of Illinois Urbana-Champaign
WWW ConferenceApril 30, 2010
Apply a Threshold
Threshold based on the standard deviation
1 25 49 73 97 1211451691932172412652893133373613854090
20
40
60
80
100
120
Line Number
Text
To
Tag
Ratio
Std. Dev. Is 20.3TTR for Hutchinson News document
Data and Information Systems LaboratoryUniversity of Illinois Urbana-Champaign
WWW ConferenceApril 30, 2010
What’s wrong with this?
Data and Information Systems LaboratoryUniversity of Illinois Urbana-Champaign
WWW ConferenceApril 30, 2010
Worst CaseAmerican Declaration of Independence Web page
American Declaration of IndependenceTTR computed from digital copy at
http://www.ushistory.org/declaration/document/index.htm
1 8 15 22 29 36 43 50 57 64 71 78 85 92 99 1060
100
200
300
400
500
Line NumberTe
xt T
o Ta
g Ra
tio
Data and Information Systems LaboratoryUniversity of Illinois Urbana-Champaign
WWW ConferenceApril 30, 2010
Histogram Clustering in 2-Dimensions
Looks for jumps in the moving average of TTR
1 50 99 1481972462953443930
20
40
60
80
100
120
Line Number
Text
To
Tag
Ratio
1 50 99 148197246295344393-150
-100
-50
0
50
100
150
Line Number
Data and Information Systems LaboratoryUniversity of Illinois Urbana-Champaign
WWW ConferenceApril 30, 2010
Histogram Clustering in 2-Dimensions
Absolute value gives insight
1 52 103154205256307358409-150
-100
-50
0
50
100
150
Line Number
1 46 91 1361812262713163614060
100200300400500600700800
Line Number
gʹ
Data and Information Systems LaboratoryUniversity of Illinois Urbana-Champaign
WWW ConferenceApril 30, 2010
0 25 50 75 1000
102030405060708090
100
TTR (hʹ)
Diff
eren
ces
(g')
Histogram Clustering in 2-Dimensions
Make a scatterplot
0 25 50 75 1000
20
40
60
80
100
TTR (hʹ)
Diff
eren
ces
(g')
Data and Information Systems LaboratoryUniversity of Illinois Urbana-Champaign
WWW ConferenceApril 30, 2010
0 25 50 75 1000
10
20
30
40
50
60
70
80
90
100
TTR (hʹ)
Diff
eren
ces
(g')
How should we cluster this?
Data and Information Systems LaboratoryUniversity of Illinois Urbana-Champaign
WWW ConferenceApril 30, 2010
0 25 50 75 1000
10
20
30
40
50
60
70
80
90
100
TTR (hʹ)
Diff
eren
ces
(g')
Modified k-Means
Data and Information Systems LaboratoryUniversity of Illinois Urbana-Champaign
WWW ConferenceApril 30, 2010
0 25 50 75 1000
10
20
30
40
50
60
70
80
90
100
TTR (hʹ)
Diff
eren
ces
(g')
Can also use a Max-Margin ApproachBut we don’t
Data and Information Systems LaboratoryUniversity of Illinois Urbana-Champaign
WWW ConferenceApril 30, 2010
Evaluation methodCleanEval Dataset (Standard Evaluation dataset)
741 English documents 713 Chinese documents
Myriad 40 Dataset40 News Websites – 206 documents at random
Big 5 DatasetNY Post, Freep, Suntimes, Techweb, Tribune – 50 each
BBC and NYTimes50 documents each
Evaluation MetricsPrecision, Recall, F1-Measure
Data and Information Systems LaboratoryUniversity of Illinois Urbana-Champaign
WWW ConferenceApril 30, 2010
CETR Results
Data and Information Systems LaboratoryUniversity of Illinois Urbana-Champaign
WWW ConferenceApril 30, 2010
Case study
Data and Information Systems LaboratoryUniversity of Illinois Urbana-Champaign
WWW ConferenceApril 30, 2010
CETR-TM as the σ coefficient (λ) (i.e. threshold) varies
Data and Information Systems LaboratoryUniversity of Illinois Urbana-Champaign
WWW ConferenceApril 30, 2010
Worst Case RevisitedNon-HTML or all content pages
1 11 21 31 41 51 61 71 81 91 1011111211311410
20
40
60
80
100
120
Line Number
Text
To
Tag
Rati
o
approximation
WWW’10 Paper
Data and Information Systems LaboratoryUniversity of Illinois Urbana-Champaign
WWW ConferenceApril 30, 2010
Conclusions and Future Work
Tag Ratio ApproachContent extraction technique – outperforms existing
methodsSimple… easy to implement and useStatic method
no training requiredJust give a document and get content
Future WorkUse for page segmentationUse in Search/Retrieval
Top Related