Automatic Web Tagging and Person Tagging Using Language Models - Qiaozhu Mei †, Yi Zhang ‡...
-
Upload
beryl-stewart -
Category
Documents
-
view
218 -
download
1
Transcript of Automatic Web Tagging and Person Tagging Using Language Models - Qiaozhu Mei †, Yi Zhang ‡...
![Page 1: Automatic Web Tagging and Person Tagging Using Language Models - Qiaozhu Mei †, Yi Zhang ‡ Presented by Jessica Gronski ‡ † University of Illinois at Urbana-Champaign.](https://reader035.fdocuments.in/reader035/viewer/2022062718/56649e7e5503460f94b8107c/html5/thumbnails/1.jpg)
Automatic Web Tagging and Person Tagging
Using Language Models
- Qiaozhu Mei†, Yi Zhang‡
Presented by Jessica Gronski‡
† University of Illinois at Urbana-Champaign
‡ University of California at Santa Cruz
![Page 2: Automatic Web Tagging and Person Tagging Using Language Models - Qiaozhu Mei †, Yi Zhang ‡ Presented by Jessica Gronski ‡ † University of Illinois at Urbana-Champaign.](https://reader035.fdocuments.in/reader035/viewer/2022062718/56649e7e5503460f94b8107c/html5/thumbnails/2.jpg)
Tagging a Web Document
• The dual problem of search/retrieval:[Mei et al. 2007]
– Retrieval: short description (query) relevant documents
– Tagging: document short description (tag)• To summarize the content of documents• To access the document in the future
2
Text Document
Query/Tag
retrieval
tagging
![Page 4: Automatic Web Tagging and Person Tagging Using Language Models - Qiaozhu Mei †, Yi Zhang ‡ Presented by Jessica Gronski ‡ † University of Illinois at Urbana-Champaign.](https://reader035.fdocuments.in/reader035/viewer/2022062718/56649e7e5503460f94b8107c/html5/thumbnails/4.jpg)
Existing Work on Social Bookmarking
• Social Bookmarking Systems– Del.icio.us, Digg, Citeulike, etc.
• Enhance Social bookmarking systems– Anti-spam [Koutrika et al 2007]
– Search& ranking tags [Hotho et al 2006]
• Utilize social bookmarks– Visualization [Dubinko et al. 2006]
– Summarization [Boydell et al. 2007]
– Use tags to help web search: [Heymann et al. 2008]; [Zhou et al. 2008]
4
![Page 5: Automatic Web Tagging and Person Tagging Using Language Models - Qiaozhu Mei †, Yi Zhang ‡ Presented by Jessica Gronski ‡ † University of Illinois at Urbana-Champaign.](https://reader035.fdocuments.in/reader035/viewer/2022062718/56649e7e5503460f94b8107c/html5/thumbnails/5.jpg)
Research Questions
• Can we automatically generate tags for web documents?– Meaningful, compact, relevant
• Can we generate tags for other web objects, such as web users?
5
![Page 6: Automatic Web Tagging and Person Tagging Using Language Models - Qiaozhu Mei †, Yi Zhang ‡ Presented by Jessica Gronski ‡ † University of Illinois at Urbana-Champaign.](https://reader035.fdocuments.in/reader035/viewer/2022062718/56649e7e5503460f94b8107c/html5/thumbnails/6.jpg)
Applications of Automatic Tagging
• Summarizing documents/ web objects• Suggest social bookmarks• Refine queries for web search
– Finding good queries to a document
• Suggest good keywords for online advertising
6
![Page 7: Automatic Web Tagging and Person Tagging Using Language Models - Qiaozhu Mei †, Yi Zhang ‡ Presented by Jessica Gronski ‡ † University of Illinois at Urbana-Champaign.](https://reader035.fdocuments.in/reader035/viewer/2022062718/56649e7e5503460f94b8107c/html5/thumbnails/7.jpg)
7
Rest of the Talk
• A probabilistic approach to tag generation– Candidate Tag Selection– Web document representation– Tag ranking
• Experiments– Web documents tagging; – web user tagging
• Summary
![Page 8: Automatic Web Tagging and Person Tagging Using Language Models - Qiaozhu Mei †, Yi Zhang ‡ Presented by Jessica Gronski ‡ † University of Illinois at Urbana-Champaign.](https://reader035.fdocuments.in/reader035/viewer/2022062718/56649e7e5503460f94b8107c/html5/thumbnails/8.jpg)
Our Method
8
data 0.1599statistics 0.0752tutorial 0.0660 analysis 0.0372software 0.0311model 0.0310frequent 0.0233probabilistic 0.0188algorithm 0.0173…
ipod nano, data mining, presidential campaignindex structure, statistics tutorial, computer science… Candidate tag pool
data mining 0.26statistics tutorial 0.19computer science 0.17 index structure 0.01……ipod nano 0.00001presidential campaign 0.0……
Ranking candidate tags
User-Generated Corpus(e.g., Del.icio.us, Wikipedia)
Web DocumentsMultinomial word Distribution
representation
![Page 9: Automatic Web Tagging and Person Tagging Using Language Models - Qiaozhu Mei †, Yi Zhang ‡ Presented by Jessica Gronski ‡ † University of Illinois at Urbana-Champaign.](https://reader035.fdocuments.in/reader035/viewer/2022062718/56649e7e5503460f94b8107c/html5/thumbnails/9.jpg)
Candidate Tag Selection
• Meaningful, compact, user-oriented
• From social bookmarking data– E.g., Del.icio.us– Single tags tags that other people used– “phrases” statistically significant bigrams
• From other user-generated web contents– E.g., Wikipedia– Titles of entries in wikipedia
9
![Page 10: Automatic Web Tagging and Person Tagging Using Language Models - Qiaozhu Mei †, Yi Zhang ‡ Presented by Jessica Gronski ‡ † University of Illinois at Urbana-Champaign.](https://reader035.fdocuments.in/reader035/viewer/2022062718/56649e7e5503460f94b8107c/html5/thumbnails/10.jpg)
Representation of Web Documents
• Multinomial distribution of words
(unigram language models)– Commonly used in retrieval and
text mining
• Can be estimated from the content of the document, or from social bookmarks (our approach)
– What other people used to tag that document
10
text 0.16mining 0.08data 0.07 probabilistic 0.04independence 0.03model 0.03…
Baseline: Use the topwords in that distributionto tag a document
![Page 11: Automatic Web Tagging and Person Tagging Using Language Models - Qiaozhu Mei †, Yi Zhang ‡ Presented by Jessica Gronski ‡ † University of Illinois at Urbana-Champaign.](https://reader035.fdocuments.in/reader035/viewer/2022062718/56649e7e5503460f94b8107c/html5/thumbnails/11.jpg)
Tag Ranking: A Probabilistic Approach
• Web documents d a language model• A candidate tag t a language model from its
co-occurring tags • Score and rank t by KL-divergence of these two
language models
11
w dwp
twpdwptdDdtf
)|(
)|(log)|()||(),(
),|( Ctwp
Social BookmarkCollection
![Page 12: Automatic Web Tagging and Person Tagging Using Language Models - Qiaozhu Mei †, Yi Zhang ‡ Presented by Jessica Gronski ‡ † University of Illinois at Urbana-Champaign.](https://reader035.fdocuments.in/reader035/viewer/2022062718/56649e7e5503460f94b8107c/html5/thumbnails/12.jpg)
Rewriting and Efficient Computation
12
),()||()|()|(
)|,(log)|(
)|(
),|(log)|(
)|(
)|(log)|(
)|(
),|(log)|(
)|(
)|(log)|(),(
CtBiasCdDCtpCwp
Ctwpdwp
twp
Ctwpdwp
Cwp
dwpdwp
Cwp
Ctwpdwp
dwp
twpdwpdtf
w
www
w
)|,( CwtPMIBias of using C to representcandidate tag tBias of using C to represent
document d (e.g., del.icio.us)
)]|,([),( CtwPMIEdtf d
rank
1. Can be pre-computed from corpus;2. Only store those PMI(w,t|C) > 0
![Page 13: Automatic Web Tagging and Person Tagging Using Language Models - Qiaozhu Mei †, Yi Zhang ‡ Presented by Jessica Gronski ‡ † University of Illinois at Urbana-Champaign.](https://reader035.fdocuments.in/reader035/viewer/2022062718/56649e7e5503460f94b8107c/html5/thumbnails/13.jpg)
Tagging Web Users
• Summarize the interests and bias of a user• Web user a pseudo document• Estimate a language model from all tags that he
used• The rest is similar to web document tagging
13
![Page 14: Automatic Web Tagging and Person Tagging Using Language Models - Qiaozhu Mei †, Yi Zhang ‡ Presented by Jessica Gronski ‡ † University of Illinois at Urbana-Champaign.](https://reader035.fdocuments.in/reader035/viewer/2022062718/56649e7e5503460f94b8107c/html5/thumbnails/14.jpg)
Experiments
• Dataset: – Two-week tagging records from Del.icio.us
– Candidate tags:• Top 15,000 Significant 2-grams from del.icio.us; • titles of all wikipedia entries (5,836,166 entries,
around 48,000 appeared in del.icio.us)
14
Time Span Bookmarks Distinct Tags
Distinct Users
02/13/07 ~ 02/26/07 579,652 111,381 20,138
![Page 15: Automatic Web Tagging and Person Tagging Using Language Models - Qiaozhu Mei †, Yi Zhang ‡ Presented by Jessica Gronski ‡ † University of Illinois at Urbana-Champaign.](https://reader035.fdocuments.in/reader035/viewer/2022062718/56649e7e5503460f94b8107c/html5/thumbnails/15.jpg)
Tagging Web Documents
15
Urls LM p(w|d) Tag = Word Tag = bigram Tag = wikipedia title
http://kuler.adobe.com/ (158 bookmarks)
colordesignwebdesigntoolsadobegraphicsflash
colorcolourpalettecolorschemecolourspickercor
adobe colorcolor designcolor colourcolor colorscolour designinspiration palettewebdesign color
colorcolourpaletteweb colorcolourscorrgb
http://www.youtube.com/watch?v=6gmP4nk0EOE (157 bookmarks)
web2.0videoyoutubewebinternetxmlcommunity
youtuberevvervodcastprimercomunidadparticipationethnograpy
xml youtubeweb2.0 youtubevideo web2.0web2.0 xmlonline presentationsocial videoyoutube video
internet videoyoutuberevverresearch videovodcastprimerp2p TV
Too general, sometimes not
relevant
Relevant, preciseMeaningful,
relevant
overfit data, not real phrases
Meaningful, relevant
Meaningful, relevant, real
But partially covers good
tags
But sometimes not meaningful
overfit data, not real phrases
![Page 16: Automatic Web Tagging and Person Tagging Using Language Models - Qiaozhu Mei †, Yi Zhang ‡ Presented by Jessica Gronski ‡ † University of Illinois at Urbana-Champaign.](https://reader035.fdocuments.in/reader035/viewer/2022062718/56649e7e5503460f94b8107c/html5/thumbnails/16.jpg)
Tagging Web Documents (Cont.)
16
Urls LM p(w|d) Tag = Word Tag = bigram Tag = wikipedia title
http://pipes.yahoo.com (386 bookmarks)
yahoorssweb2.0mashupfeedsprogrammingpipes
pipesfeedsyahoomashuprsssyndicationmashups
feeds mashupsmashup pipesweb2.0 yahoorss web2.0mashup rssapi feedspipes prog-ramming
pipesyahoomashupsrsssyndicationmashupsblog feeds
http://www.miniajax.com/ (349 bookmarks)
ajaxjavascriptweb2.0webdesignprogrammingcodewebdev
ajaxdhtmljavascriptmoo.fxdragdropphototypeautosuggest
ajax codecode javascriptjavascript ajaxjavascript web-2.0css ajaxjavascript pro-gramming
ajaxdhtmljavascriptmoo.fxjavascript li-braryjavascript -framework
Too general, sometimes not
relevant
Relevant, precise
But sometimes not meaningful
Meaningful, relevant
overfit data, not real phrases
Meaningful, relevant, real
![Page 17: Automatic Web Tagging and Person Tagging Using Language Models - Qiaozhu Mei †, Yi Zhang ‡ Presented by Jessica Gronski ‡ † University of Illinois at Urbana-Champaign.](https://reader035.fdocuments.in/reader035/viewer/2022062718/56649e7e5503460f94b8107c/html5/thumbnails/17.jpg)
Tagging Web Users
17
Users LM p(w|d) Tag = bigram Tag = wikipedia title
User 1 photographyartportraitstoolswebdesigngeek
art photographyphotography portraitsdigital flickrphotoblog photographyart photoflickr photographyweblog wordpress
art photographyphotoblog portraitsphotographylandscapesflickrart contest
User 2 humorprogrammingphotographyblogwebdesignsecurityfunny
geek hackhumor programminghack hackingnetworking programminggeek htmlgeek hackingreference security
network programmingtweakhackingsecuritygeek humorsysadmindigitalcamera
Partially covers the interest
Meaningful, relevant, real
overfit data, not real phrases
![Page 18: Automatic Web Tagging and Person Tagging Using Language Models - Qiaozhu Mei †, Yi Zhang ‡ Presented by Jessica Gronski ‡ † University of Illinois at Urbana-Champaign.](https://reader035.fdocuments.in/reader035/viewer/2022062718/56649e7e5503460f94b8107c/html5/thumbnails/18.jpg)
Tagging Web Users (Cont.)
18
Users LM p(w|d) Tag = bigram Tag = wikipedia title
User 3 gamesargtoolsprogrammingsudokucryptographysoftware
arg gamesgames puzzlesgames internetarg codegames sudokucode generatorcommunity games
arggames researchgamespuzzlesstorytellingcode generatorcommunity games
User 4 webreferencecssdevelopmentrubyonrailstoolsdesign
rubyonrails webcss developmentbrower developmentdevelopment editordevelopment forumdevelopment firefoxjavascript tools
javascriptcsswebdevxhtmldhtmlcss3dom
Missed many good tags
![Page 19: Automatic Web Tagging and Person Tagging Using Language Models - Qiaozhu Mei †, Yi Zhang ‡ Presented by Jessica Gronski ‡ † University of Illinois at Urbana-Champaign.](https://reader035.fdocuments.in/reader035/viewer/2022062718/56649e7e5503460f94b8107c/html5/thumbnails/19.jpg)
Discussions
• Using top tags: too general, sometimes not relevant
• Ranking tags by labeling language models:– Candidate = Social bookmarking words
• Pros: relevant, compact• Cons: ambiguous, not so meaningful
– Candidate = Social bookmarking bigrams• Pros: more meaningful, relevant• Cons: overfiting the data, sometimes not real phrases
– Candidate = Wikipedia Titles:• Pros: meaningful, relevant real phrases• Cons: biased, missed potential good tags. (Bias(t, C))
19
![Page 20: Automatic Web Tagging and Person Tagging Using Language Models - Qiaozhu Mei †, Yi Zhang ‡ Presented by Jessica Gronski ‡ † University of Illinois at Urbana-Champaign.](https://reader035.fdocuments.in/reader035/viewer/2022062718/56649e7e5503460f94b8107c/html5/thumbnails/20.jpg)
Summary
• Automatic tagging of web documents and web users
• A probabilistic approach based on labeling language models
• Effective when the candidate tags are of high quality
• Future work:– A robust way of generating candidate tags– Large scale evaluation
20