getting_rid_of_duplicate_content_iss-priyank_garg.ppt

14
Content Duplication Priyank Garg Yahoo! Web Search November 13, 2008

Transcript of getting_rid_of_duplicate_content_iss-priyank_garg.ppt

Page 1: getting_rid_of_duplicate_content_iss-priyank_garg.ppt

Content Duplication

Priyank Garg

Yahoo! Web Search

November 13, 2008

Page 2: getting_rid_of_duplicate_content_iss-priyank_garg.ppt

Yahoo! Confidential

Content Duplication - Outline

• Where Yahoo! Search eliminates duplication• Why should you, the webmaster, care?• Reasons to preserve dupes• Sources of duplication• The abusive fringe• What should you do?

Page 3: getting_rid_of_duplicate_content_iss-priyank_garg.ppt

Yahoo! Confidential

Where does Y! Search eliminate dupes?

A: At every point in the pipeline, but as much as possible at query-time

• Crawl-time filtering– Less likely to extract links from known duplicate pages– Less likely to crawl new docs from duplicative sites

• Index-time filtering– Less representation from dupes when choosing crawled pages to

put in index

• Query-time dup elimination– Limits of URLs/host per SRP, domain restrictions as well– Filtering of similar documents

• Duplication doesn’t have to be exact– Approximate page-level dupes, site mirrors

Page 4: getting_rid_of_duplicate_content_iss-priyank_garg.ppt

Yahoo! Confidential

Why should you, the webmaster, care?

• For each site, we allocate certain crawl and index resources for the site using many aspects– site importance,

– content quality and uniqueness

– overall unique additional value to search users

• Let’s say you have a recipe site with 15k good recipes

http://recipe-site.com/yourbasicporkchop.html

http://recipe-site.com/lambvindaloo.html

http://recipe-site.com/sausagegumbo.html

http://recipe-site.com/spicyvegansurprise.html

http://recipe-site.com/whateversinthefridgeplusoregano.html [..]

Page 5: getting_rid_of_duplicate_content_iss-priyank_garg.ppt

Yahoo! Confidential

Why should you, the webmaster, care?

• But unfortunately, your entire crawl quota found by Slurp look like this:

http://recipe-site.com/yourbasicporkchop.html?sessid=aba89shttp://recipe-site.com/yourbasicporkchop.html?sessid=acc90x http://recipe-site.com/yourbasicporkchop.html?sessid=aff23f http://recipe-site.com/yourbasicporkchop.html?sessid=ccr33a

[....]

The upshot may be that only one page of (unique) content survives, though we would have taken more

• Note that all the duplicate pages would probably get filtered later in the pipeline, wasting your chances to get referrals

Page 6: getting_rid_of_duplicate_content_iss-priyank_garg.ppt

Yahoo! Confidential

Accidental “duplication”

• Session IDs in URLs– Remember, to engines a URL is a URL is a URL....

– Two URLs referring to the same doc look like dupes

– We can sort this out, but it may inhibit crawling

– Embedding session-IDs in non-dynamic URLs doesn’t change fundamental problem

• http://yoursite/yourpage/sessid489/a.html is still a dup ofhttp://yoursite/yourpage/sessid524/a.html

• Soft 404s– “Not found” error pages should return a 404 HTTP status code when

crawled.

– If not, we can crawl many copies of the same “not found” page

• Not considered abusive, but it can still hamper our ability to display your content.

Page 7: getting_rid_of_duplicate_content_iss-priyank_garg.ppt

Yahoo! Confidential

Dodgy duplication

• Replicating content across multiple domains unnecessarily

• “Aggregation” of content found elsewhere on web– Ownership questions?– Is there value added in the aggregation?– Search engines are themselves aggregators, but shouldn’t

necessarily point to other aggregations or search results page

• Identical content repeated with minimal value added– How much of the page is duplicated? Is what is new worth

anything?– May be handled by dup detection algorithms (if you’re OK with that)– Particularly an issue with regionally-targeted content

Page 8: getting_rid_of_duplicate_content_iss-priyank_garg.ppt

Yahoo! Confidential

Dodgy duplication, cont.

When repeated elements dominate, approximate dupes may be (appropriately) filtered out.

Real estate advice for FLORIDA homeowners:Buy low, sell high! Don’t leave your home on the market for too long.

Consider being your own agent! Price your home to move.

Real estate advice for TENNESSEE homeowners:Buy low, sell high! Don’t leave your home on the market for too long.

Consider being your own agent! Price your home to move.

Real estate advice for MONTANA homeowners:Buy low, sell high! Don’t leave your home on the market for too long.

Consider being your own agent! Price your home to move.

Page 9: getting_rid_of_duplicate_content_iss-priyank_garg.ppt

Yahoo! Confidential

The abusive fringe

• Scraper spammers– Other people’s content + their ads, in bulk

• Weaving/stitching– Mix-and-match content (at the phrase, sentence, paragraph,

section level) from different sources– Often an attempt to defeat duplicate detection

• Bulk cross-domain duplication– Often an attempt to get around hosts-per-SRP limits

• Bulk duplication with small changes– Often an attempt to defeat duplicate detection

All of the above are outside our content guidelines, and may lead to unanticipated results for publishers.

Page 10: getting_rid_of_duplicate_content_iss-priyank_garg.ppt

Yahoo! Confidential

What should you do?

• Avoid bulk duplication of underlying documents– If small variations, does search engine need all versions?– Use robots.txt to hide parts of site that are duplicate (say print

versions of pages)– Use 301s to redirect dups to original

• Avoid accidental proliferation of many URLs for the same documents– SessionIDs, soft 404s, etc.– Not abusive by our guidelines, but impair effective crawling

• Avoid duplication of sites across many domains

• When importing content from elsewhere, ask– Do you own it (or have rights to it)?– Are you adding value in addition, or just duplicating?

Page 11: getting_rid_of_duplicate_content_iss-priyank_garg.ppt

Yahoo! Confidential

Tools from Yahoo!

• Yahoo! Slurp supports wildcards in robots.txt: http://www.ysearchblog.com/archives/000372.html– to make it easy to mark out areas of sites not to be crawled and

indexed:

• Site Explorer allows Delete URL or entire path from the index for authenticated sites:http://www.ysearchblog.com/archives/000400.html

• Use robots-nocontent tag on non-relevant parts of a page:http://www.ysearchblog.com/archives/000444.html– Can be used to mark out boilerplate content– Or syndicated content that may be useful in context for user but not for

search engines

Page 12: getting_rid_of_duplicate_content_iss-priyank_garg.ppt

Yahoo! Confidential

Tools from Yahoo! (contd.)

• Dynamic URL Rewriting in Site Explorerhttp://www.ysearchblog.com/archives/000479.html– Ability to indicate parameter to remove from URLs across site– More efficient crawl, with less duplicate URLs– Better site coverage as fewer resources wasted on duplicates– More unique content discovered for same crawl– Fewer risks of crawler traps– Cleaner URL, easier for user to read and more likely to be

clicked– Better ranking due to reduced link juice fragmentation– Some sites have had 5m URLs cleaned with a single rule !!!

Page 13: getting_rid_of_duplicate_content_iss-priyank_garg.ppt

Yahoo! Confidential

Check out

Site Explorer from Yahoo! Search

http://siteexplorer.search.yahoo.com

Yahoo! Search Blog

http://ysearchblog.com

Page 14: getting_rid_of_duplicate_content_iss-priyank_garg.ppt

Yahoo! Confidential

Search Information and Contacts

Site Explorer:http://siteexplorer.search.yahoo.com/

Search Happenings and People:http://www.ysearchblog.com/

Search Information, Guidelines and FAQ:http://help.yahoo.com/help/us/ysearch/

Search content guidelines:http://help.yahoo.com/help/us/ysearch/basics/basics-18.html

To report spam:http://add.yahoo.com/fast/help/us/ysearch/cgi_reportsearchspam

To check on a site, or to get a site reviewed :http://add.yahoo.com/fast/help/us/ysearch/cgi_urlstatus

Search Support:http://add.yahoo.com/fast/help/us/ysearch/cgi_feedback