Research © 2008 Yahoo! Generating Succinct Titles for Web URLs Kunal Punera joint work with...
-
Upload
oscar-reynolds -
Category
Documents
-
view
214 -
download
0
Transcript of Research © 2008 Yahoo! Generating Succinct Titles for Web URLs Kunal Punera joint work with...
Research © 2008 Yahoo!
Generating Succinct Titles for Web URLs
Kunal Punera
joint work with Deepayan Chakrabarti and Ravi KumarYahoo! Research
Research © 2008 Yahoo!
Agenda
• Motivation
• Our Approach
• Comparison from Previous Work
• Experimental Results
Research © 2008 Yahoo!
Titles on Search Results Page
• HTML Titles – Too long
– Can be missing
– Non-html results• Pictures, video and
audio clips
• Other Apps– Site-map generation
Research © 2008 Yahoo!
Titles for “Quicklinks”
• Strict length restrictions
• Links displayed in context of home page
Quicklink Titles
Homepage Context
Research © 2008 Yahoo!
Agenda
• Motivation
• Our Approach
• Comparison from Previous Work
• Experimental Results
Research © 2008 Yahoo!
“Sources” of Information about URLs (URL: http://www.barackobama.com/issues/)
URL-Tokens “barack obama issues”
Web page content
(HTMLTitle, KeyPhrases)
“Barack Obama | Change We Can Believe In | Issues”
“Issues”, “Civil Rights”, “Defense”, “Economy”
Anchor text on incoming links (IntrasiteAT, IntersiteAT, HomepageAT)
“Issues”, “Economic Issues”
“Barack Obama’s Plan for America”
Search engine queries
(QueryView, QueryClick, QueryClickPos1)
“obama issues”, “obama platform”, “obama campaign issues”, “barack obama platform”
User generated tags
(DeliciousTags)
“obama campaign platform”, “cool”, “nice webpage”
URL-Tokens “barack obama issues”
Web page content
(HTMLTitle, KeyPhrases)
“Barack Obama | Change We Can Believe In | Issues”
“Issues”, “Civil Rights”, “Defense”, “Economy”
Anchor text on incoming links (IntrasiteAT, IntersiteAT, HomepageAT)
“Issues”, “Economic Issues”
“Barack Obama’s Plan for America”
Search engine queries
(QueryView, QueryClick, QueryClickPos1)
“obama issues”, “obama platform”, “obama campaign issues”, “barack obama platform”
URL-Tokens “barack obama issues”
Web page content
(HTMLTitle, KeyPhrases)
“Barack Obama | Change We Can Believe In | Issues”
“Issues”, “Civil Rights”, “Defense”, “Economy”
Anchor text on incoming links (IntrasiteAT, IntersiteAT, HomepageAT)
“Issues”, “Economic Issues”
“Barack Obama’s Plan for America”
URL-Tokens “barack obama issues”
Web page content
(HTMLTitle, KeyPhrases)
“Barack Obama | Change We Can Believe In | Issues”
“Issues”, “Civil Rights”, “Defense”, “Economy”
URL-Tokens “barack obama issues”
Source Instances
Research © 2008 Yahoo!
Central Idea
Words from title and context (if applicable) are preferentially used by sources in constructing instances.
Degree of these preferences is source dependent.
Research © 2008 Yahoo!
Generation of Instances(URL: http://www.barackobama.com/issues/)
QuicklinkTitle
HomepageAbstract(Context)
GeneralVocabulary
QueryClick Source IntrasiteAT Source HTMLTitle Source …
obama issuesobama campaign issuesbarack obama platform
platform for obama campaign…
IssuesForeign Policy
Economic IssuesYes We Can
…
“Barack Obama | Change We Can Believe In | Issues”
0.5 0.4 0.10.8 0.1 0.10.2 0.6 0.2
0.5/0.4/0.1 0.8/0.1/0.1 0.2/0.6/0.2
Research © 2008 Yahoo!
Learning Source Generation Parameters(URL: http://www.barackobama.com/issues/)
QuicklinkTitle
HomepageAbstract(Context)
GeneralVocabulary
QueryClick Source IntrasiteAT Source HTMLTitle Source …
obama issuesobama platform
obama campaign issuesbarack obama platform
…
IssuesForeign Policy
Economic IssuesYes We Can
…
“Barack Obama | Change We Can Believe In | Issues”
GIVEN Learn parameter values that maximize probability of generation of instances
--/--/-- --/--/-- --/--/--
UNKNOWN
Research © 2008 Yahoo!
Finding Best Quicklink Title (URL: http://www.barackobama.com/issues/)
QuicklinkTitle
HomepageAbstract(Context)
GeneralVocabulary
QueryClick Source IntrasiteAT Source HTMLTitle Source …
obama issuesobama platform
obama campaign issuesbarack obama platform
…
IssuesForeign Policy
Economic IssuesYes We Can
…
“Barack Obama | Change We Can Believe In | Issues”
UNKNOWN GIVEN GIVEN
Select title for which probability ofgeneration of instances is maximum
LEARNT
0.5/0.4/0.1 0.8/0.1/0.1 0.2/0.6/0.2
Research © 2008 Yahoo!
sourcess (s)w
lenw instances
)Plog(.1 )contexttitle,|(P log.1
sourcess sw
lenws )(instances
)Plog(.1 )contexttitle,|(P log.in instances #
1
sourcess sw
lenlens
ws )(instances
)Plog(. )contexttitle,|(P log.in instances #
Objective Function
• Sources have different number of instances– QueryClick vs. HTMLTitle
• Sources are associated to target web object to different degrees– QueryClick vs. QueryView
– Comments on Youtube etc.
• Can account for dependent sources
Source specific Normalization
Source specific Weights
Research © 2008 Yahoo!
Learning Source Weights
• With known source generation parameters we have a linear function in source weights
• We learn weights that ranks various candidate titles correctly
– We use the linear ranking SVM described in
Joachims, “Optimizing search engines using clickthrough data”, KDD 2002
Research © 2008 Yahoo!
Where do Title Candidates come from?
• Instances of some sources of information
• Not all sources used
– Ungrammatical (URL-Tokens)
– Miss-spellings (QueryView)
– Sometimes irrelevant (DeliciousTags)
• We clean some instances to obtain more candidates
– Removing website name
Research © 2008 Yahoo!
Agenda
• Motivation
• Our Approach
• Comparisons from Previous Work
• Experimental Results
Research © 2008 Yahoo!
Comparisons with Previous Work
• Our title generation is an “extractive” approach
– Avoid modeling gramatical correctness of titles
• Only learn parameters at the source level
– Lesser training data needed
• Combine information from external sources
– Can obtain titles for objects with no text content
• Respect constraints placed by context of title use
Research © 2008 Yahoo!
BMW: Banko et al., Headline Generation based on Statistical Translation, ACL 2000
• Rank headline candidates using 3 factors
– Likelihood of seeing candidate words in a title
– Likelihood of most likely sequence of the words in candidate
– Likelihood of length of candidate
• Lots of parameters
– to model word being in title
– to model bi-grams
– to combine the above 3 factors
Research © 2008 Yahoo!
Agenda
• Motivation
• Our Approach
• Comparison from Previous work
• Experimental Results
Research © 2008 Yahoo!
Empirical Evaluation
• Two Tasks
– Generating Quicklink titles (manually judged data)
– Generating Web Page Titles
• Metrics
– F-measure, Jaccard, Exact Match, Longest Common Subsequence
• Baselines
– Sources of information our system uses
– BMW: Banko et al., ACL 2000
Research © 2008 Yahoo!
Quicklinks Title Task
Approach F-measure Jaccard Exact Match
Our Approach 0.81 0.75 0.63
HomepageAT 0.70 0.66 0.58
IntrasiteAT 0.43 0.41 0.35
IntersiteAT 0.36 0.32 0.25
HTMLTitle 0.37 0.27 0.05
KeyPhrases 0.25 0.19 0.07
• HomepageAT is a very competitive baseline
• IntrasiteAT better than IntersiteAT
• Our system’s performance approaches inter-judge agreement values
Research © 2008 Yahoo!
Quicklinks Title Task: Learning Rates
• Very few datapoints needed– Learning parameters at source level helps
Research © 2008 Yahoo!
Quicklinks Title Task: Source Weights
• Having Source weights and normalization helps
Research © 2008 Yahoo!
Web Page Title Task
Approach F-measure Jaccard LCS
Our Approach 0.53 0.41 3.44
HomepageAT 0.45 0.34 2.7
KeyPhrases 0.41 0.31 2.54
QueryClick 0.31 0.23 2.1
IntersiteAT 0.29 0.21 1.8
BMW 0.12 0.10 --
• Our approach beats competition– BMW not suited to this task
– Often page text doesn’t describe page well
• HomepageAT surprisingly effective
Research © 2008 Yahoo!
Conclusions
• Our approach combines various sources of information to select titles
• It select titles that respect constraints of length and context
• We empirically showed the effectiveness of our approach
• Future Work
– Deeper language features in selecting titles
– Uniform quicklinks titles across websites
– Contexts of different types