getting_rid_of_duplicate_content_iss-priyank_garg.ppt
-
Upload
zach-browne -
Category
Technology
-
view
475 -
download
0
Transcript of getting_rid_of_duplicate_content_iss-priyank_garg.ppt
Content Duplication
Priyank Garg
Yahoo! Web Search
November 13, 2008
Yahoo! Confidential
Content Duplication - Outline
• Where Yahoo! Search eliminates duplication• Why should you, the webmaster, care?• Reasons to preserve dupes• Sources of duplication• The abusive fringe• What should you do?
Yahoo! Confidential
Where does Y! Search eliminate dupes?
A: At every point in the pipeline, but as much as possible at query-time
• Crawl-time filtering– Less likely to extract links from known duplicate pages– Less likely to crawl new docs from duplicative sites
• Index-time filtering– Less representation from dupes when choosing crawled pages to
put in index
• Query-time dup elimination– Limits of URLs/host per SRP, domain restrictions as well– Filtering of similar documents
• Duplication doesn’t have to be exact– Approximate page-level dupes, site mirrors
Yahoo! Confidential
Why should you, the webmaster, care?
• For each site, we allocate certain crawl and index resources for the site using many aspects– site importance,
– content quality and uniqueness
– overall unique additional value to search users
• Let’s say you have a recipe site with 15k good recipes
http://recipe-site.com/yourbasicporkchop.html
http://recipe-site.com/lambvindaloo.html
http://recipe-site.com/sausagegumbo.html
http://recipe-site.com/spicyvegansurprise.html
http://recipe-site.com/whateversinthefridgeplusoregano.html [..]
Yahoo! Confidential
Why should you, the webmaster, care?
• But unfortunately, your entire crawl quota found by Slurp look like this:
http://recipe-site.com/yourbasicporkchop.html?sessid=aba89shttp://recipe-site.com/yourbasicporkchop.html?sessid=acc90x http://recipe-site.com/yourbasicporkchop.html?sessid=aff23f http://recipe-site.com/yourbasicporkchop.html?sessid=ccr33a
[....]
The upshot may be that only one page of (unique) content survives, though we would have taken more
• Note that all the duplicate pages would probably get filtered later in the pipeline, wasting your chances to get referrals
Yahoo! Confidential
Accidental “duplication”
• Session IDs in URLs– Remember, to engines a URL is a URL is a URL....
– Two URLs referring to the same doc look like dupes
– We can sort this out, but it may inhibit crawling
– Embedding session-IDs in non-dynamic URLs doesn’t change fundamental problem
• http://yoursite/yourpage/sessid489/a.html is still a dup ofhttp://yoursite/yourpage/sessid524/a.html
• Soft 404s– “Not found” error pages should return a 404 HTTP status code when
crawled.
– If not, we can crawl many copies of the same “not found” page
• Not considered abusive, but it can still hamper our ability to display your content.
Yahoo! Confidential
Dodgy duplication
• Replicating content across multiple domains unnecessarily
• “Aggregation” of content found elsewhere on web– Ownership questions?– Is there value added in the aggregation?– Search engines are themselves aggregators, but shouldn’t
necessarily point to other aggregations or search results page
• Identical content repeated with minimal value added– How much of the page is duplicated? Is what is new worth
anything?– May be handled by dup detection algorithms (if you’re OK with that)– Particularly an issue with regionally-targeted content
Yahoo! Confidential
Dodgy duplication, cont.
When repeated elements dominate, approximate dupes may be (appropriately) filtered out.
Real estate advice for FLORIDA homeowners:Buy low, sell high! Don’t leave your home on the market for too long.
Consider being your own agent! Price your home to move.
Real estate advice for TENNESSEE homeowners:Buy low, sell high! Don’t leave your home on the market for too long.
Consider being your own agent! Price your home to move.
Real estate advice for MONTANA homeowners:Buy low, sell high! Don’t leave your home on the market for too long.
Consider being your own agent! Price your home to move.
Yahoo! Confidential
The abusive fringe
• Scraper spammers– Other people’s content + their ads, in bulk
• Weaving/stitching– Mix-and-match content (at the phrase, sentence, paragraph,
section level) from different sources– Often an attempt to defeat duplicate detection
• Bulk cross-domain duplication– Often an attempt to get around hosts-per-SRP limits
• Bulk duplication with small changes– Often an attempt to defeat duplicate detection
All of the above are outside our content guidelines, and may lead to unanticipated results for publishers.
Yahoo! Confidential
What should you do?
• Avoid bulk duplication of underlying documents– If small variations, does search engine need all versions?– Use robots.txt to hide parts of site that are duplicate (say print
versions of pages)– Use 301s to redirect dups to original
• Avoid accidental proliferation of many URLs for the same documents– SessionIDs, soft 404s, etc.– Not abusive by our guidelines, but impair effective crawling
• Avoid duplication of sites across many domains
• When importing content from elsewhere, ask– Do you own it (or have rights to it)?– Are you adding value in addition, or just duplicating?
Yahoo! Confidential
Tools from Yahoo!
• Yahoo! Slurp supports wildcards in robots.txt: http://www.ysearchblog.com/archives/000372.html– to make it easy to mark out areas of sites not to be crawled and
indexed:
• Site Explorer allows Delete URL or entire path from the index for authenticated sites:http://www.ysearchblog.com/archives/000400.html
• Use robots-nocontent tag on non-relevant parts of a page:http://www.ysearchblog.com/archives/000444.html– Can be used to mark out boilerplate content– Or syndicated content that may be useful in context for user but not for
search engines
Yahoo! Confidential
Tools from Yahoo! (contd.)
• Dynamic URL Rewriting in Site Explorerhttp://www.ysearchblog.com/archives/000479.html– Ability to indicate parameter to remove from URLs across site– More efficient crawl, with less duplicate URLs– Better site coverage as fewer resources wasted on duplicates– More unique content discovered for same crawl– Fewer risks of crawler traps– Cleaner URL, easier for user to read and more likely to be
clicked– Better ranking due to reduced link juice fragmentation– Some sites have had 5m URLs cleaned with a single rule !!!
Yahoo! Confidential
Check out
Site Explorer from Yahoo! Search
http://siteexplorer.search.yahoo.com
Yahoo! Search Blog
http://ysearchblog.com
Yahoo! Confidential
Search Information and Contacts
Site Explorer:http://siteexplorer.search.yahoo.com/
Search Happenings and People:http://www.ysearchblog.com/
Search Information, Guidelines and FAQ:http://help.yahoo.com/help/us/ysearch/
Search content guidelines:http://help.yahoo.com/help/us/ysearch/basics/basics-18.html
To report spam:http://add.yahoo.com/fast/help/us/ysearch/cgi_reportsearchspam
To check on a site, or to get a site reviewed :http://add.yahoo.com/fast/help/us/ysearch/cgi_urlstatus
Search Support:http://add.yahoo.com/fast/help/us/ysearch/cgi_feedback