Web Taxonomy Integration through Co-Bootstrapping Dell Zhang National University of Singapore Wee...
-
Upload
winfred-manning -
Category
Documents
-
view
215 -
download
1
Transcript of Web Taxonomy Integration through Co-Bootstrapping Dell Zhang National University of Singapore Wee...
Web Taxonomy Integration throughCo-Bootstrapping
Dell ZhangNational University of Singapore
Wee Sun LeeNational University of Singapore
SIGIR’04
Problem Statement
Games > Roleplaying
•Final Fantasy Fan
•Dragon Quest Home
Games > Strategy
•Shogun: Total War
Games > Online
•EverQuest Addict
•Warcraft III Clan
Games > Single-Player
•Warcraft III Clan
Games > Roleplaying
•Final Fantasy Fan
•Dragon Quest Home
•EverQuest Addict
•Warcraft III Clan
Games > Strategy
•Shogun: Total War
•Warcraft III Clan
Possible Approach
Games > Roleplaying
•Final Fantasy Fan
•Dragon Quest Home
Games > Strategy
•Shogun: Total War
Train
•EverQuest Addict
•Warcraft III Clan
Classify
ignores original Yahoo! categories
Another Approach (1/2)
Use Yahoo! categories Advantage similar categories
Potential Problem different structure categories do not match exactly
Another Approach (2/2)
Example: Crayon Shin-chan
Entertainment > Comics and Animation > Animation > Anime > Titles > Crayon Shin-chan
Arts > Animation > Anime > Titles > C > Crayon Shin-
chan
This Paper’s Approach
1. Weak Learner (as opposed to Naïve Bayes)
2. Boosting to combine Weak Hypotheses
3. New Idea: Co-Bootstrapping to exploit source categories
Assumptions
Multi-category data are reduced to binary data Totoro Fan Cartoon > My Neighbor Totoro
Toys > My Neighbor Totoro
is converted into
Totoro Fan Cartoon > My Neighbor Totoro
Totoro Fan Toys > My Neighbor Totoro Hierarchies are ignored Console > Sega and Console > Sega > Dreamcast are
not related
Weak Learner A type of classifier similar to Naïve Bayes
+ = accept - = reject Term may be a word or n-gram or …
Weak LearnerWeak Hypothesis
(term-based classifier)
After Training
Weak Hypothesis Example
contain “Crayon Shin-chan” in “Comics > Crayon Shin-chan” not in “Education > Early Childhood”
not contain “Crayon Shin-chan” not in “Comics > Crayon Shin-chan” in “Education > Early Childhood”
Weak Learner Inputs (1/2)
Training data are in the form [x1, y1], [x2, y2], …, [xm, ym] xi is a document yi is a category [xi, yi] means document xi is in category yi
D(x, y) is a distribution over all combinations of xi and yi
D(xi, yj) indicates the “importance” of (xi, yj)w is the term (automatically found)
Weak Learner Algorithm
For each possible category y, compute four values:
Note: (xi,y) with greater D (xi,y) has more influence.
ycategoryinnotisxwcontainnotdoesxyxDW
ycategoryinisxwcontainnotdoesxyxDW
ycategoryinnotisxwcontainsxyxDW
ycategoryinisxwcontainsxyxDW
ii
m
ii
y
ii
m
ii
y
ii
m
ii
y
ii
m
ii
y
1
0
1
0
1
1
1
1
),(
),(
),(
),(
Weak Hypothesis h(x, y)
Given unclassified document x and category y If x contains w, then
Else if x does not contain w, then
yinnotisdocwcontainsdocChance
yinisdocwcontainsdocChance
W
Wyxh
y
y
ln2
1ln2
1),(
1
1
yinnotisdocwcontainnotdoesdocChance
yinisdocwcontainnotdoesdocChance
W
Wyxh
y
y
ln2
1ln2
1),(
0
0
Weak Learner Comments
If sign[ h(x,y) ] = +, then x is in y | h(x,y) | is the confidence The term w is found as follows:
Repeatedly run weak learner for all possible w Choose the run with the smallest
value as the model Boosting: Minimizes probability of h(x,y) having wrong sign
y
yy
y
yy WWWW 1100
Boosting Idea
1. Train the weak learner on different Dt(x, y) distributions
2. After each run, adjust Dt(x, y) by putting more weight on the most often misclassified training data
3. Output the final hypothesis as a linear combination of weak hypotheses
Boosting Algorithm
Given: [x1, y1], [x2, y2], …, [xm, ym], where xi X and yi Y
Initialize D1(x,y) = 1/(mk)
for t = 1,…,T do
Pass distribution Dt to weak learner
Get weak hypothesis ht(x, y)
Choose t R
Update
end for
Output the final hypothesis
t
txttt Z
yxhyYyxDyxD
)),(][exp(),(),(1
T
ttt yxhyxH
1
),(),(
Boosting Algorithm Initialization
Given: [x1, y1], [x2, y2], …, [xm, ym]
Initialize D(x, y) = 1/(mk) k = total number of categories uniform distribution
Boosting Algorithm Loop
for t = 1,…,T do Run weak learner using distribution D Get weak hypothesis ht(x, y) For each possible pair (x,y) in training data If ht(x,y) guesses incorrectly, increase
D(x,y)end for
return
T
tt yxhyxH
1
),(),(
Recall Example Problem
Games > Online
•EverQuest Addict
•Warcraft III Clan
Games > Single-Player
•Warcraft III Clan
Games > Roleplaying
•Final Fantasy Fan
•Dragon Quest Home
Games > Strategy
•Shogun: Total War
Co-Bootstrapping Algorithm (1/4)
1. Run AdaBoost on Yahoo! sites
• Get classifier Y1
2. Run AdaBoost on Google sites
• Get classifier G1
3. Run Y1 on Google sites• Get predicted Yahoo! categories for Google sites
4. Run G1 on Yahoo! sites•Get predicted Google categories for Yahoo! sites
Co-Bootstrapping Algorithm (2/4)
5. Run AdaBoost on Yahoo! sites
• Include Google category as a feature
• Get classifier Y2
6. Run AdaBoost on Google sites
• Include Yahoo! category as a feature
• Get classifier G2
7. Run Y2 on original Google sites• get more accurate Yahoo! categories for Google sites
8. Run G2 on original Yahoo! sites• get more accurate Google categories for Yahoo! sites
Co-Bootstrapping Algorithm (3/4)
9. Run AdaBoost on Yahoo! sites
• Include Google category as a feature
• Get classifier Y3
10. Run AdaBoost on Google sites
• Include Yahoo! category as a feature
• Get classifier G3
11. Run Y3 on original Google sites• get even more accurate Yahoo! categories for Google sites
12. Run G3 on original Yahoo! sites• get even more accurate Google categories for Yahoo! sites
Co-Bootstrapping Algorithm (4/4)
Repeat, repeat, and repeat…Hopefully, the classification will become more
accurate after each iteration…
Enhanced Naïve Bayes (1/2)
Given document x source category S of x
Predict master category C In NB, Pr[C | x] Pr[C] wx(Pr[w | C])n(x,w)
w : word n(x,w) number of occurrences of w in x
Pr[C | x, S] Pr[C | S] wx(Pr[w | C])n(x,w)
Enhanced Naïve Bayes (2/2)
Pr[C] =
Estimate Pr[C | S]
|C S| : number of docs in S that is classified into C by NB classifier
iCii SCC
SCC
||||
||||
iC
iC
C
||
||
Datasets
Google Yahoo!
Book /Top/Shopping/ Publications/Books
/Business and Economy/Shopping and Services/Books//Bookstores
Disease /Top/Health/Conditions and Diseases
/Health/Diseases and Conditions
Movie /Top/Arts/Movies/Genres /Entertainment/Movies and Film/Genres/
Music /Top/Arts/Music/Styles /Entertainment/Music/Genres
News /Top/News/By Subject /News and Media
Number of Categories*/Dataset (1/2)
Google Yahoo!
Book 49 41
Disease 30 51
Movie 34 25
Music 47 24
News 27 34
*Top level categories only
Number of Categories*/Dataset (2/2)
Book Horror Science
Fiction Non-fictionBiographyHistory
Merge into Non-fiction
Number of Websites
Google Yahoo! GY GY
Book 10,842 11,268 21,111 999
Disease 34,047 9,785 41,439 2,393
Movie 36,787 14,366 49,744 1,409
Music 76,420 24,518 95,971 4,967
News 31,504 19,419 49,303 1,620
Method (1/2)
Classify Yahoo! Book websites into Google Book categories (GY)
1. Find GY for Book
2. Hide Google categories for in GY
3. GY Yahoo! Book
4. Randomly take |GY| sites from G-Y Google Book
Method (2/2)
For each dataset, do GY five times and GY five times
macro F-score : calculate F-score for each category, then average over all categories
micro F-score : calculate F-score on the entire dataset recall = 100%?
Doesn’t say anything about multi-category ENB
Results (1/3)
00.10.20.30.40.50.60.70.80.9
Boo
k
Dis
ease
Mov
ie
Mus
ic
New
s
Boo
k
Dis
ease
Mov
ie
Mus
ic
New
s
G←Y Y←G
AB CB-AB
00.10.20.30.40.50.60.70.80.9
Boo
k
Dis
ease
Mov
ie
Mus
ic
New
s
Boo
k
Dis
ease
Mov
ie
Mus
ic
New
s
G←Y Y←G
AB CB-AB
Co-Boostrapping-AdaBoost > AdaBoost
macro-averaged F scores micro-averaged F scores
Results (2/3)
0
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0 1 2 3 4 5 6 7 8
Co-Bootstrapping iteration
F s
co
remaF(G←Y) miF(G←Y) maF(Y←G) miF(Y←G)
Co-Bootstrapping-AdaBoost iteratively improves AdaBoost
Book Dataset
Results (3/3)
00.10.20.30.40.50.60.70.80.9
Boo
k
Dis
ease
Mov
ie
Mus
ic
New
s
Boo
k
Dis
ease
Mov
ie
Mus
ic
New
s
G←Y Y←G
ENB CB-AB
00.10.20.30.40.50.60.70.80.9
Boo
k
Dis
ease
Mov
ie
Mus
ic
New
s
Boo
k
Dis
ease
Mov
ie
Mus
ic
New
s
G←Y Y←G
ENB CB-AB
Co-Boostrapping-AdaBoost > Enhanced Naïve Bayes
macro-averaged F scores micro-averaged F scores