212 building googlebot - deview - google drive

Building Googlebot

Youngjin KimOctober 15, 2013

http://www.creditwritedowns.com/2011/07/european-monetary-union-titanic.html

From the web to your query

● Query processing1. Lookup keywords in the index => every relevant page2. Rank pages and display the result

● Google's index of the webkeyword => { page1, page2, ... }

● Building the index requires processing the current version of all of the pages on the web...

All of the pages on the web!?!

60 Trillion Pages And Counting!

Our local copy of the web

● Crawling○ Googlebot

● Storage○ Google File System (GFS), BigTable

● Processing○ MapReduce

● Data Centers○ Job control, Fault-Tolerance, High-Speed Networking,

Power/Cooling, etc.

Finding every page with googlebot

● Basic discovery crawl1. Start with the set

of known links2. Crawl every link

(pages change!)3. Extract every

new link, repeatCrawlStatus

WebPage

Crawl Pages

Extract Links

Key considerations in crawling

● Polite crawling○ Do not overload websites and DNS (no DoS!) ○ Understand web serving infrastructure

● Prioritize among discovered links○ Crawl is a giant queuing system○ Predicting serving capacity

● Do not waste resources○ Ignore spam/broken links○ Skip links with duplicate content

Mirrors

● Hosts with exactly the same contentdeview.krwww.deview.kr

● Paths within hosts with the same contentwww.cs.unc.edu/Courses/comp426-f09/docs/tools/downloads/tomcat/ jakarta-tomcat-4.1.29/webapps/tomcat-docswww.cs.unc.edu/Courses/comp590-001-f08/docs/tools/downloads/tomcat/ jakarta-tomcat-4.1.29/webapps/tomcat-docswww.cs.unc.edu/Courses/comp590-001-f08/tools/downloads/tomcat/ jakarta-tomcat-4.1.29/webapps/tomcat-docswww.cs.unc.edu/Courses/jbs/tools/downloads/tomcat/ jakarta-tomcat/4.1.29/webapps/tomcat-docs

● Unrestricted mirroring across hosts and paths○ Distributed graph mining

Optimizing our crawling

● Efficient crawling requires duplicate handling○ Predict whether a newly discovered link points to

duplicate content○ Must happen before crawling

useful(link, status_table) => { yes, no }

Duplicates in Dynamic Pages

● Duplicates are most common in dynamic linkshttp://foo.com/forum/viewtopic.php?t=3808&sid=126bc5f2http://foo.com/forum/viewtopic.php?t=3808&sid=d5b8483bhttp://foo.com/forum/viewtopic.php?t=3808&sid=3b1a8e27http://foo.com/forum/viewtopic.php?t=3808&sid=2a21f059...

● Significance analysis○ Parameter t is a relevant○ Parameter sid is irrelevant

● Duplicate predictionhttp://foo.com/forum/viewtopic.php?t=3808&sid=ee5da24a

SameContent

Equivalence rules and class names

● Equivalence rule for a cluster○ Set of relevant parameters○ Set of irrelevant parameters

● Equivalence class name○ Remove irrelevant parameters

ECN(link1) = ECN(link2) => Same content!○ For the previous example

ECN(http://foo.com/forum/viewtopic.php?t=3808&sid=ee5da24a) = http://foo.com/forum/viewtopic.php?t=3808

Modified crawl algorithm

● Representative table○ Equivalence class name => representative link

● Given a new link1. Identify cluster2. Lookup equivalence rule3. Apply rule to determine equivalence class name4. Lookup table of representatives5. Crawl link if no representative found

Equivalence rule generation

● Find every crawled link under a cluster cluster = { link1 : content1, link2 : content2, ... }● Study evidence

1. Insignificance analysis2. Significance analysis3. Parameter classification4. Equivalence rule construction

rule(cluster) = { param1 : RELEVANT, param2 : IRRELEVANT, param3 : CONFLICT, ... }

1. Insignificance analysis

● Group links by content content1 = { link11, link21, ... } content2 = { link21, link22, ... } ... ● For each parameter

○ For each content group with this parameter■ If parameter values are not the same, add the number

of links to the insignificance index

2. Significance analysis

● For each parameter○ Remove the parameter from every link

■ Group content by remainder link remainder1 = { content11, content21, ... } remainder2 = { content21, content22, ... } ...

■ Increment significance index by the number of unique contents minus 1

3. Parameter classification

● For each parameter○ Compute content relevance (or irrelevance) value

○ Sample criteria: 90/10 rule■ If relevance > 90 => parameter is RELEVANT■ If relevance < 10 => parameter is IRRELEVANT■ Otherwise, parameter is CONFLICT

Content_Relevance =Significance_Index

Significance_Index + Insignificance_Index

Content_Irrelevance =Insignificance_Index

Significance_Index + Insignificance_Index

Example: P is content-irrelevant

http://foo.com/directory?P=1&Q=3http://foo.com/directory?P=2&Q=3

http://foo.com/directory?P=1&Q=2http://foo.com/directory?P=2&Q=2http://foo.com/directory?P=3&Q=2http://foo.com/directory?P=4&Q=2

Content B

Cluster

Content A

2 links,different Ps

Content A

4 links,different Ps

Content B

Insignificance Analysis of P

P's insignificance index = 2 + 4 = 6P's content-irrelevance value = 100%

2 links,Content A

4 links,Content B

Significance Analysis of P

P's significance index = 0P's content-relevance value = 0%

Example: Q is content-relevant

http://foo.com/directory?P=1&Q=3http://foo.com/directory?P=2&Q=3

http://foo.com/directory?P=1&Q=2http://foo.com/directory?P=2&Q=2http://foo.com/directory?P=3&Q=2http://foo.com/directory?P=4&Q=2

Content B

Cluster

Content A

2 links,same Q

Content A

4 links,same Q

Content B

Insignificance Analysis of Q

Q's insignificance index = 0Q's content-irrelevance value = 0%

2 links,Content A&B

Significance Analysis of Q

Q's significance index = 1 + 1 = 2Q's content-relevance value = 100%

Facing the Real World

● Limitations○ Co-changing parameters○ Noisy data○ Parameters not used in the standard way○ Need for continuous validation

● State-of-the-art○ White-box vs black-box

● Search is not solved○ Not even crawling is solved!

Defining duplicates

● Identical pages● Identical visible content● Essentially identical visible content

○ Ignore page generation time○ Ignore breaking news side bar○ etc.

● What is the right answer?Two pages should be considered duplicatesif our users would consider them duplicates

● How to translate this notion into a checksum?

Thank You!

212 building googlebot - deview - google drive

Technology

Transcript of 212 building googlebot - deview - google drive

Meet Googlebot, the Personae that Determines Your SEO Success by Jeremy Bencken

MIPS Assembly Programmingcomp212/lec2008/lec-08-MIPS.pdfComp 212 Computer Org & ArchComp 212 Computer Org & Arch 1 Z. Li, 2008 COMP 212 Computer Organization & Architecture COMP 212

Immig 212 h 212 a 7 a i i Bia

Deview 2013 rise of the wimpy machines - john mao

INSTRUCTIONS FOR MODELS 212 AND 212--EU SUPER--DUTY ...

COMMUNITY HEALTH SCIENCES 212 - chs.ph.ucla.edu 212 Winter.pdf · CHS 212 Winter 2013 CHS 212,syllabus.Winter 2013 FINAL 1 COMMUNITY HEALTH SCIENCES 212 ADVANCED SOCIAL RESEARCH METHODS

New York...The Chelsea Pub CROSS STREETS 8th and 9th Avenue Kevin McKenna 212-852-4826 212-852-4815 Nicholas Cohen 212-852-4826 212-852-4815 Alan Gardner 212-227-1700 212-766-2628

Introduction To SEO Competitiondownload.marketsamurai.com/SC100204TS01MS_intro.pdf · each search phrase. Introduction To SEO Competition - 3. The Googlebot A good way to visualize

SEO y WordPress: optimizar rastreo de Googlebot. WordCamp Bilbao

CrainsNY9.19.11 - acc-construction.com · (718) 556-6700 (212) 966-4426 (212) 481-5533 (973) 661-8300 (212) 206-1140 (212) 686-9331 (212) 728-1800 (201) 587-0777 (212) 964-6400

bugera 6262 infinium/6262-212 infinium/6260 infinium/6260-212 ...

[B5]memcached scalability-bag lru-deview-100

LAW DEPARTMENT Phone: 212-356-1000 Fax: 212-356-1148 ...

212 Manual

212 acres Jeff Davis Hazelhurst, Georgia, 212 ACimages.landsofamerica.com/imgs2/9b/91/76/soilmap_a8ba.pdf · 2017-08-06 · 212 acres Jeff Davis Hazelhurst, Georgia, 212 AC +/-Boundary

Physics 212 Lecture 25, Slide 1 Physics 212 Lecture 25.

[D1]deview 2012 nvidia

[143]Inside fuse deview 2016

Independence, Missouri...DEVIEW/CONSTQUCTION Plans Reviewed/Zoning Inspections - Reinspections COMMUNITY EDUCATION Fire Safety Talks & Job Fairs (talks 86, fairs 13) Audience Website

ORGANIC SEARCH. CRAWL INDEX RANK CRAWLING Known Web pages Index Servers Crawler Machines Crawler Machines Googlebot Doc Servers.