Negotiating crawl budget with googlebots

83
USING ’PAGE IMPORTANCE’ IN ONGOING CONVERSATION WITH GOOGLEBOT TO GET JUST A BIT MORE THAN YOUR ALLOCATED CRAWL BUDGET NEGOTIATING CRAWL BUDGET WITH GOOGLEBOTS Dawn Anderson @ dawnieando

Transcript of Negotiating crawl budget with googlebots

Page 1: Negotiating crawl budget with googlebots

USING  ’PAGE  IMPORTANCE’  IN  ONGOING  CONVERSATION  WITH  GOOGLEBOT  TO  GET  JUST  A  BIT  MORE  THAN  YOUR  ALLOCATED  CRAWL  BUDGET

NEGOTIATING  CRAWL  BUDGET  WITH  GOOGLEBOTS

Dawn  Anderson  @  dawnieando

Page 2: Negotiating crawl budget with googlebots

Another  Rainy  Day  In  Manchester

@dawnieando

Page 3: Negotiating crawl budget with googlebots

WTF???

Page 4: Negotiating crawl budget with googlebots

1994  -­ 1998

“THE  GOOGLE  INDEX  IN  1998  HAD  60  MILLION  PAGES”  (GOOGLE)  

(Source:Wikipedia.org)

Page 5: Negotiating crawl budget with googlebots

2000

“INDEXED  PAGES  REACHES  THE  ONE  BILLION  MARK”  (GOOGLE)

“IN  OVER  17  MILLION  WEBSITES”  (INTERNETLIVESTATS.COM)

Page 6: Negotiating crawl budget with googlebots

2001  ONWARDSENTER  WORDPRESS,  DRUPAL  CMS’,  PHP  DRIVEN  CMS’,  ECOMMERCE  PLATFORMS,  DYNAMIC  SITES,  AJAX

WHICH  CAN  GENERATE  10,000S  OR  100,000S  OR  1,000,000S  OF  DYNAMICURLS  ON  THE  FLY  WITH  DATABASE  ‘FIELD  BASED’  CONTENT

DYNAMIC  CONTENT  CREATION  GROWS

ENTER  FACETED  NAVIGATION  (WITH  MANY  #  PATHS  TO  SAME  CONTENT)

2003  – WE’RE  AT  40  MILLION  WEBSITES

Page 7: Negotiating crawl budget with googlebots

2003  ONWARDS  – USERS  BEGIN  TO  JUMP  ON  THE  CONTENT  GENERATION  BANDWAGGON

LOTS  OF  CONTENT  – IN  MANY  FORMS

Page 8: Negotiating crawl budget with googlebots

WE  KNEW  THE  WEB  WAS  BIG…  (GOOGLE,  2008)

https://googleblog.blogspot.co.uk/2008/07/we-­‐knew-­‐web-­‐was-­‐big.html

“1  trillion  (as  in  1,000,000,000,000)   unique  URLs  on  the  web  at  once!”(Jesse  Alpert  on  Google’s   Official  Blog,  2008)

2008  – EVEN  GOOGLE  ENGINEERS  STOPPED  IN  AWE

Page 9: Negotiating crawl budget with googlebots

2010  – USER  GENERATED  CONTENT  GROWS

“Let  me  repeat  that:  we  create  as  much  information  in  two  days  now  as  we  did  from  the  dawn  of  man  through  2003”

“The  real  issue  is  user-­‐generated  content.”  (Eric  Schmidt,  2010  – TechonomyConference  Panel)

SOURCE:  http://techcrunch.com/2010/08/04/schmidt-­‐data/

Page 10: Negotiating crawl budget with googlebots

Indexed  Web  contains at  least  4.73  billion   pages (13/11/2015)

CONTENT KEEPS GROWINGTotal  number  of  websites

2000 2001 2002 2003 2004 2005 2006 2007 2008 2009 2010 2011 2012 2013 2014

1,000,000,000

750,000,000

500,000,000

250,000,000

THE  NUMBER  OF  WEBSITES  DOUBLED  IN  SIZE  BETWEEN  2011  AND  2012AND  AGAIN  BY  1/3  IN  2014

Page 11: Negotiating crawl budget with googlebots

EVEN  SIR  TIM  BERNERS-­‐LEE(Inventor  of  www)  TWEETED

2014  – WE  PASS  A  BILLION  INDIVIDUAL  WEBSITES  ONLINE

Page 12: Negotiating crawl budget with googlebots

2014  – WE  ARE  ALL PUBLISHERS

SOURCE:  http://wordpress/activity/posting

Page 13: Negotiating crawl budget with googlebots

YUP  -­ WE  ALL‘LOVE  CONTENT’

IMAGINE  HOW  MANY  UNIQUE  URLs    COMBINED  THIS  AMOUNTS  TO?  

– A  LOT

http://www.internetlivestats.com/total-­‐number-­‐of-­‐websites/

Page 14: Negotiating crawl budget with googlebots

“As  of  the  end  of  2003,  the  WWW  is  believed   to  include  well  in  excess  of  10  billion  distinct  documents  or  web  pages,  while  a  search  engine  may  have  a  crawling  capacity  that  is  less  than  half  as  many  documents”  (MANY  GOOGLE  PATENTS)

CAPACITY  LIMITATIONS  – EVEN  FOR  SEARCH  ENGINES

Source:  Scheduler  for  search  engine  crawler Google  PatentUS  8042112  B1,  (Zhu  et  al)

Page 15: Negotiating crawl budget with googlebots

“So  how  many  unique  pages  does  the  web  really  contain?  We  don't  know;  we  don't  have  time  to  look  at  them  all!  :-­‐)”  

(Jesse  Alpert,  Google,  2008)

Source:  https://googleblog.blogspot.co.uk/2008/07/we-­‐knew-­‐web-­‐was-­‐big.html

NOT  ENOUGH  TIME

SOME  THINGS  MUST  BE  FILTERED

Page 16: Negotiating crawl budget with googlebots

A  LOT  OF  THE  CONTENT  IS  ‘KIND  OF  THE  SAME’

“There’s  a  needle  in  here  somewhere”

“It’s  an  important  needle  too”

Page 17: Negotiating crawl budget with googlebots

Capacity  limits  on  Google’s  

crawling  system

By  prioritising  URLs  for  crawling

By  assigning  crawl  period  

intervals  to  URLs

How  have  search  engines  responded?

By  creating  work  ‘schedules’  for  Googlebots

WHAT IS THE SOLUTION?

“To  keep  within  the  capacity  limits  of  the  crawler,  automated  selection  mechanisms  are  needed  to  determine  not  only  which  web  pages  to  crawl,  but  which  web  pages  to  avoid  crawling”.  -­‐Scheduler  for  search  engine  crawler,  (Zhu  et  al)

Page 18: Negotiating crawl budget with googlebots

‘Managing items in a crawl schedule’

IncludeGOOGLE CRAWL SCHEDULER PATENTS

‘Scheduling a recrawl’

‘Web crawler scheduler that utilizes sitemaps from websites’

‘Document reuse in a search engine crawler’

‘Minimizing visibility of stale content in web searching including revising web crawl intervals of documents’

‘Scheduler for search engine’

EFFICIENCY  IS  NECESSARY

Page 19: Negotiating crawl budget with googlebots

CRAWL  BUDGET

1.  Crawl  Budget  – “An  allocation  of  crawl  frequency  visits  to  a  host  (IP  LEVEL)”  

3.  Pages  with  a  lot  of  links  get  crawled  more

4.  The  vast  majority  of  URLs  on  the  web  don’t  get  a  lot  of  budget  allocated  to  them  (low  to  0  PageRank  URLs).

2.  Roughly  proportionate  to  PageRank  and  host  load  /  speed  /  host  capacity

https://www.stonetemple.com/matt-­‐cutts-­‐interviewed-­‐by-­‐eric-­‐enge-­‐2/

Page 20: Negotiating crawl budget with googlebots

BUT…  MAYBE  THINGS  HAVE  CHANGED?

CRAWL  BUDGET  /  CRAWL  FREQUENCY  IS  NOT  JUST  ABOUT  HOST-­LOAD  AND  PAGERANK  ANY  MORE

Page 21: Negotiating crawl budget with googlebots

STOP  THINKING  IT’S  JUST  ABOUT  ‘PAGERANK’

http://www.youtube.com/watch?v=GVKcMU7YNOQ&t=4m45s

“You  keep  focusing  on  PageRank”…

“There’s  a  shit-­‐ton  of  other  stuff  going  on”  (Illyes,  G,  Google  -­‐2016)

Page 22: Negotiating crawl budget with googlebots

THERE’S  A  LOT  OF  OTHER  THINGS  AFFECTING  ‘CRAWLING’

Transcript:  https://searchenginewatch.com/2016/04/06/webpromos-­‐qa-­‐with-­‐googles-­‐andrey-­‐lipattsev-­‐transcript/

WEB  PROMOS  Q  &  A  WITH  GOOGLES  ANDREY  LIPATTSEV

Page 23: Negotiating crawl budget with googlebots

WHY?BECAUSE…  

THE  WEB  GOT  ‘MAHOOOOOSIVE’

AND  CONTINUES  TO  GET  ‘MAHOOOOOOSIVER’

SITES  GOT  MORE  DYNAMIC,  COMPLEX,  AUTO-­GENERATED,  MULTI-­FACETED,  DUPLICATED,  INTERNATIONALISED,  BIGGER,  BECAME  PAGINATED  AND  SORTED

Page 24: Negotiating crawl budget with googlebots

WE  NEED  MOREWAYS  TO  GETMORE  EFFICIENTAND  FILTER  OUTTIME-­WASTINGCRAWLING  SO  WE  CAN  FIND  IMPORTANT  CHANGES  QUICKLY

GOOGLEBOT’S  TO-­DO  LIST  GOT  REALLY  BIG

Page 25: Negotiating crawl budget with googlebots

Hard  and  Soft  Crawl  Limits

Importance  Thresholds

Min  and  Max  Hints  &  ‘Hint  

ranges’

ImportanceCrawl  Periods

Scheduling

FURTHER IMPROVED CRAWLING EFFICIENCY SOLUTIONS NEEDED

Prioritization TieredCrawlingBuckets

(‘Real  Time,  Daily,  Base  Layer)  

Page 26: Negotiating crawl budget with googlebots

SEVERAL PATENTS UPDATED

‘Managing URLs’ (Alpert et al, 2013) (PAGE IMPORTANCE DETERMINING SOFT AND HARD LIMITS ON CRAWLING)

‘Managing Items in a Crawl Schedule’ (Alpert, 2014)

‘Scheduling a Recrawl’ (Anerbach, Alpert, 2013) (PREDICTING CHANGE FREQUENCY IN ORDER TO SCHEDULE NEXT VISIT, EMPLOYING HINTS (Min & Max)

(SEEM  TO  WORK  TOGETHER)

‘Minimizing visibility of stale content in web searching including revising web crawl intervals of documents’ (INCLUDES EMPLOYING HINTS TO DETECT PAGES ‘NOT’ TO CRAWL)

Page 27: Negotiating crawl budget with googlebots

Crawled  multiple  times  daily

Crawled  daily  Or  bi-­‐daily

Crawled  least  on  a  ‘round  robin’  basis  – only  ‘active’  segment  is  crawledSplit  into  segments  

on  random  rotation

MANAGING ITEMS IN A CRAWL SCHEDULE (GOOGLE PATENT)

Real  TimeCrawl

Daily Crawl

Base  Layer    Crawl

3  layers  /  tiers  /  buckets  for  scheduling

URLs  are  moved  in  and  out  of  layers  based  on  past  visits  data

Most  Unimportant

Page 28: Negotiating crawl budget with googlebots

CAN  WE  ESCAPE  THE  ‘BASE  LAYER’  CRAWL  BUCKET  RESERVED  FOR  ‘UNIMPORTANT’  URLS?

Page 29: Negotiating crawl budget with googlebots

10  typesof

Googlebot

SOME  OF  THE  MAJOR  SEARCH  ENGINE  CHARACTERS

History  Logs  /  History  Server

The  URL  Scheduler  /  Crawl  Manager

Page 30: Negotiating crawl budget with googlebots

HISTORY LOGS / HISTORY SERVERS

HISTORY  LOGS  /  HISTORY  SERVER  -­‐ Builds  a  picture  of  historical  data  and  past  behaviour  of  the  URL  and  ‘importance’  score  to  predict  and  plan  for  future  crawl  scheduling

• Last  crawled  date• Next  crawl  due• Last  server  response• Page  importance  score• Collaborates  with  link  

logs• Collaborates  with  

anchor  logs• Contributes   info  to  

scheduling

Page 31: Negotiating crawl budget with googlebots

‘BOSS’- URL SCHEDULER / URL MANAGER

Think  of  it  as  Google’s  line  manager  or  ‘air  traffic  controller’  for  Googlebots in  the  web  crawling  system

• Schedules  Googlebot visits   to  URLs• Decides  which  URLs  to  ‘feed’  to  Googlebot• Uses  data  from  the  history   logs  about  past  visits  (Change  rate  and  

importance)• Calculates  importance  crawl  threshold• Assigns   visit  regularity  of  Googlebot to  URLs• Drops  ‘max  and  min  hints’  to  Googlebot to  guide  on  types  of  

content  NOT  to  crawl  or  to  crawl  as  exceptions.• Excludes  some  URLs  from  schedules• Assigns  URLs  to  ‘layers  /  tiers’  for  crawling  schedules• Scheduler   checks  URLs  for  ‘importance’,   ‘boost   factor’  candidacy,  

‘probability   of  modification’• Budgets  are  allocated  to  IPs  and  shared  amongst  domains   there

JOBS

Page 32: Negotiating crawl budget with googlebots

• ‘Ranks  nothing  at  all’• Takes  a  list  of  URLs  to  crawl  from  URL  Scheduler• Runs  errands  &  makes  deliveries   for  the  URL  server,  indexer  /  

ranking  engine  and  logs• Makes  notes  of  outbound   linked  pages  and  additional  links  

for  future  crawling• Follows   directives  (robots)   and  takes  ‘hints’  when  crawling• Tells  tales  of  URL  accessibility   status,  server  response   codes,  

notes  relationships   between  links   and  collects  content  checksums   (binary  data  equivalent  of  web  content)  for  comparison  with  past  visits  by  history   and  link   logs

• Will  go  beyond   the  crawl  schedule   if  it  finds  something  more  important  than  URLs  scheduled

GOOGLEBOT - CRAWLERJOBS

Page 33: Negotiating crawl budget with googlebots

WHAT  MAKES  THE  DIFFERENCE  BETWEEN  BASE  LAYER  AND  ‘REAL  TIME’  SCHEDULE  ALLOCATION?

Page 34: Negotiating crawl budget with googlebots

CONTRIBUTING  FACTORS

1.  Page  Importance  (which  may  include  PageRank)

3.  Soft  limits  and  hard  crawl  limits

4.  Host  load  capability  &  past  site  performance  (speed  and  access)  (IP  level  and  domain  level  within)

2.  Hints  (max  and  min)

5.  Probability  /  predictability  of  ‘CRITICALMATERIAL’  change  +  importance  crawl  period

Page 35: Negotiating crawl budget with googlebots

1 - PAGE IMPORTANCE - Page  importance  is  the  importance  of  a  page  independent  of  a  query

• Location  in  Site  (e.g.  home  page  more  important  than  parameter  3  level  output)

• PageRank• Page  type  /  file  type• Internal  PageRank• Internal  Backlinks• In-­‐site  Anchor  Text  Consistency• Relevance  (content,  anchors  and  elements)   to  a  

topic  (Similarity  Importance)• Directives  from   in-­‐page  robot  and  robots.txt

management• Parent  quality  brushes  off  on  child  page  qualityIMPORTANT  PARENTS  LIKELY  SEEN  TO  HAVE  IMPORTANT  CHILD  PAGES

Page 36: Negotiating crawl budget with googlebots

2 - HINTS - ’MIN’ HINTS & ’MAX’ HINTS

MIN  HINT  /  MIN  HINT  RANGES• e.g.  Programmatically  generated  

content  which  changes  content  checksum  on  load

• Unimportant  duplicate  parameter  URLs

• Canonicals• Rel=next,  rel=prev• HReflang• Duplicate  content• SpammyURLs?• Objectionable  content

MAX  HINT  /  MAX  HINT  RANGES• CHANGE  CONSIDERED  ‘CRITICAL  

MATERIAL  CHANGE’  (useful  to  users  e.g.  availability,  price)  &  /  or  improved  site  sections  or  change  to  IMPORTANT  but   infrequently  changing  content?

• Important  pages  /  page  range  updates

E.G. rel="prev" and rel="next" act  as  hints  to  Google,   not  absolute  directives

https://support.google.com/webmasters/answer/1663744?hl=en&ref_topic=4617741

Page 37: Negotiating crawl budget with googlebots

3 - HARD AND SOFT LIMITS ON CRAWLING

If  URLs  are  discovered  during  crawling  that  are  more  important  than  those  scheduled  to  be  crawled  then  Googlebot can  go  beyond  its  schedule  to  include  these  up  to  a  hard  crawl  limit

‘Soft’  crawl  limit  is  set  (Original  schedule)

‘Hard’  crawl  limit  is  set  (E.G.  130%  of  schedule)

FOR  IMPORTANT  FINDINGS

Page 38: Negotiating crawl budget with googlebots

4 – HOST LOAD CAPACITY / PAST SITE PERFORMANCE

Googlebot has  a  list  of  URLs  to  crawl

Naturally,  if  your  site  is  fast  that  list  can  be  crawled  quicker

If  Googlebotexperiences  500s  e.g.  she  will  retreat  &  ‘past  performance’  is  noted

If  Googlebotdoesn’t  get  ‘round  the  list’  you  may  end  up  with  ‘overdue’  URLs  to  crawl

Page 39: Negotiating crawl budget with googlebots

• Not  all  change  is  considered  equal• There  are  many  dynamic  sites  with  low  importance  pages  

changing  frequently   – SO  WHAT• Constantly   changing  your  page  just  to  get  Googlebot

back  won’t  work  if  the  page  is  low  importance  (crawl  importance  period  <  change  rate)  POINTLESS

• Hints  are  employed   to  determine  pages  which  simply  change  the  content  checksum  with  every  visit

• Features  are  weighted  for  change  importance  to  user  (price  >  colour  e.g.)

• Change  identified   as  useful   to  users  is  considered  ‘CRITICAL  MATERIAL  CHANGE’

• Don’t  just  try  to  randomise  things  to  catch  Googlebot’seye

• That  counter  or  clock  you  added  probably   isn’t   going  to  help  you  get  more  attention,  nor  random  or  shuffle

• Change  on  some  types  of  pages  is  more  important than  other  pages  (e.g.  Home  page  CNN  >  SME  about  us  page)

5 - CHANGE

Page 40: Negotiating crawl budget with googlebots

• Current  capacity  of  the  web  crawling  system  is  high• Your  URL  has  a  high  ‘importance  score’• Your  URL  is  in  the  real  time  (HIGH  IMPORTANCE),  daily  crawl  

(LESS  IMPORTANT)  or  ‘active’  base  layer  segment  (UNIMPORTANT  BUT  SELECTED)

• Your  URL  changes  a  lot  with  CRITICAL  MATERIAL  CONTENT  change  (AND  IS  IMPORTANT)

• Probability   and  predictability   of  CRITICAL  MATERIAL  CONTENT  change  is  high  for  your  URL  (AND  URL  IS  IMPORTANT)

• Your  website  speed   is  fast  and  Googlebot gets  the  time  to  visit  your  URL  on  its  bucket  list  of  scheduled  URLs  that  visit

• Your  URL  has  been  ‘upgraded’  to  a  daily   or  real  time  crawl  layer  as  it’s   importance  is  detected  as  raised

• History  logs  and  URL  Scheduler   ’learn’  together

FACTORS AFFECTING GOOGLEBOT HIGHER VISIT FREQUENCY

Page 41: Negotiating crawl budget with googlebots

• Current  capacity  of  web  crawling  system  is  low• Your  URL  has  been  detected  as  a  ‘spam’  URL• Your  URL  is  in  an  ‘inactive’   base  layer  segment  (UNIMPORTANT)• Your  URLs  are  ‘tripping   hints’   built  into  the  system  to  detect  non-­‐

critical  change  dynamic  content• Probability   and  predictability   of  critical  material  content  change  is  

low  for  your  URL• Your  website  speed   is  slow  and  Googlebot doesn’t   get  the  time  to  

visit  your  URL• Your  URL  has  been  ‘downgraded’   to  an  ‘inactive’  base  layer  

(UNIMPORTANT)  segment• Your  URL  has  returned  an  ‘unreachable’   server  response   code  

recently• In-­‐page  robots  management  or  robots.txt send  wrong  signals

FACTORS AFFECTING LOWER GOOGLEBOT VISIT FREQUENCY

Page 42: Negotiating crawl budget with googlebots

GET  MORE  CRAWL  BY  ‘TURNING  GOOGLEBOT’S  HEAD’  – MAKE  YOUR  URLs  MORE  IMPORTANT  AND  ‘EMPHASISE’ IMPORTANCE

Page 43: Negotiating crawl budget with googlebots

• Hard  limits  and  soft  limits• Follows   ‘min’   and  ‘max’  Hints• If  she  finds   something  important  she  will  go  beyond  a  

scheduled   crawl  (SOFT  LIMIT)  to  seek  out  importance  (TO  HARD  LIMIT)

• You  need  to  IMPRESS  Googlebot• If  you  ‘bore’  Googlebot she  will  return  to  boring  URLs  less  

(e.g.  with  pages  all  the  same  (duplicate  content)  or  dynamically   generated  low  usefulness   content)

• If  you  ’delight’  Googlebot she  will  return  to  delightful  URLs  more  (they  became  more  important  or  they  changed  with  ‘CRITICAL  MATERIAL  CHANGE’)

• If  she  doesn’t   get  her  crawl  completed  you  will  end  up  with  an  ‘overdue’   list  of  URLs  to  crawl

GOOGLEBOT DOES AS SHE’S TOLD –WITH A FEW EXCEPTIONS

Page 44: Negotiating crawl budget with googlebots

• Your  URL  became  more  important  and  achieved  a  higher  ‘importance  score’  via  increased  PageRank

• Your  URL  became  more  important  via  increased  IB(P)  (INTERNAL  BACKLINKS  IN  OWN  SITE)  relative  to  other  URLs  within  your   site  (You  emphasised  importance)

• You  made  the  URL  content  more  relevant  to  a  topic  and  improved  the  importance  score

• The  parent  of  your  URL  became  more  important  (E.G.  IMPROVED  TOPIC  RELEVANCE  (SIMILARITY),  PageRank  OR  local  (in-­‐site)  importance  metric)

• YOUR  ‘IMPORTANCE  SCORE’  OF  SOME  URLS  EXCEEDED  THE  ‘IMPORTANCE  SOFT  LIMIT  THRESHOLD’  SO  THAT  IT  IS  INCLUDED  FOR  CRAWLING  WHILST  BEING  VISITED  UP  TO  A  POINT  OF  ‘HARD  LIMIT’  CRAWLING  (E.G.  130%  OF  SCHEDULED  CRAWLING)

GETTING MORE CRAWL BY IMPROVING PAGE IMPORTANCE

Page 45: Negotiating crawl budget with googlebots

HOW  DO  WE  DO  THIS?

Page 46: Negotiating crawl budget with googlebots

TO DO - FIND GOOGLEBOTAUTOMATE  SERVER  LOG  RETRIEVAL  VIA  CRON  JOB

grep Googlebotaccess_log>googlebot_access.txt

ANALYSE  THE  LOGS

Page 47: Negotiating crawl budget with googlebots

LOOK THROUGH SPIDER-EYESPREPARE TO BE HORRIFIED

Incorrect  URL  header  response   codes  301  redirect  chainsOld  files  or  XML  sitemaps  left  on  server  from  years  agoInfinite/  endless   loops   (circular  dependency)On  parameter  driven  sites  URLs  crawled  which  produce  same  outputAJAX  content  fragments  pulled   in  aloneURLs  generated  by  spammersDead  image  files  being  visitedOld  CSS  files   still  being  crawled  and  loading  EVERYTHINGYou  may  even  see  ’mini’   abandoned  projects  within  the  siteLegacy  URLs  generated  by  long  forgotten  .htaccess regex  pattern  matchingGooglebot hanging  around  in  your  ‘ever-­‐changing’   blog  but  nowhere  else

Page 48: Negotiating crawl budget with googlebots

URL  CRAWL  FREQUENCY  ’CLOCKING’

Spreadsheet  provided  by  @johnmu during  Webmaster  Hangout  -­‐ https://goo.gl/1pToL8

Identify  your  ‘real  time’,  ‘daily’  and  ‘base  layer’  URLs-­‐ ARE  THEY  THE  ONES  YOU  WANT  THERE?    WHAT  IS  BEING  SEEN  AS  UNIMPORTANT?

NOTE GOOGLEBOT

Do  you  recognise  all  theURLs  and  URL  ranges  thatAre  appearing?If  not…  Why  not?

Page 49: Negotiating crawl budget with googlebots

IMPROVE & EMPHASISE PAGE IMPORTANCE• Cross  modular  internal  linking• Canonicalization• Important  URLs  in  XML  sitemaps• Anchor   text  target  consistency  (but  not  spammyrepetition  of  

anchors  everywhere   (it’s  still  output))• Internal  links  in  right  descending  order  – emphasise

IMPORTANCE• Reduce  boiler  plate  content  and  improve  relevance  of  content  

and  elements  to  specific  topic  (if  category)  /  product   (if  product  page)  /  subcategory  (if  subcategory)

• Reduce  duplicate  content  parts  of  page  to  allow  primary  targets  to  take  ’IMPORTANCE’

• Improve  parent  pages  to  raise  IMPORTANCE  reputation  of  the  children  rather  than  over-­‐optimising the  child  pages  and  cannibalising the  parent.

• Improve  content  as  more  ‘relevant’  to  a  topic  to  increase  ‘IMPORTANCE’  and  get  reassigned  to  a  different  crawl  layer

• Flatten  ‘architectures’• Avoid  content  cannibalisation• Link  relevant  content  to  relevant  content• Build  strong  highly  relevant  ‘hub’  pages  to  tie  together  strength  

&  IMPORTANCE

Page 50: Negotiating crawl budget with googlebots

EMPHASISE IMPORTANCE WISELY

USE  CUSTOMXMLSITEMAPS

E.G.  XML  UNLIMITEDSITEMAP  GENERATOR

PUT IMPORTANT URLS IN HERE

IF EVERYTHING IS IMPORTANT THEN IMPORTANCE IS NOT DIFFERENTIATED

Page 51: Negotiating crawl budget with googlebots

KEEP CUSTOM SITEMAPS ‘CURRENT’ AUTOMATICALLY

AUTOMATEUPDATESWITH  CRON  JOBS  OR  WEB  CRON  JOBS

IT’S NOT AS TECHNICAL AS YOU MAY THINK – USE WEB CRON JOBS

Page 52: Negotiating crawl budget with googlebots

BE ‘PICKY’ ABOUT WHAT YOU INCLUDE IN XML SITEMAPS

EXCLUDE  ANDINCLUDE  CRAWLPATHS  IN  XML  SITEMAPS  TO EMPHASISEIMPORTANCE

Page 53: Negotiating crawl budget with googlebots

IF YOU CAN’T IMPROVE - EXCLUDE (VIA NOINDEX) FOR NOW • YOU’RE  OUT  FOR  NOW

• When  you  improve  you  can  come  back  in

• Tell  Googlebot quickly  that  you’re  out  (via  temporary  XML  sitemap  inclusion)

• But  ‘follow’  because  there  will  be  some  relevance  within  these  URLs

• Include  again  when  you’ve  improved

• Don’t   try  to  canonicalizeme  to  something   in  theindex

Page 54: Negotiating crawl budget with googlebots

OR REMOVE – 410 GONE(IF IT’S NEVER COMINGBACK)

http://faxfromthefuture.bandcamp.com/track/410-­‐gone-­‐acoustic-­‐demo

EMBRACE THE ‘410 GONE’

There’s  Even  A  SongAbout   It

Page 55: Negotiating crawl budget with googlebots

#BIGSITEPROBLEMS – LOSE THE INDEX BLOAT

LOSE THE BLOAT TO INCREASE THE CRAWLNo.  of  unimportant  URLs  indexed  extend  far  beyond   the  available  importance  crawl  threshold  allocation

Page 56: Negotiating crawl budget with googlebots

Tags:  I,  must,  tag,    this,  blog,   post,  with,  every,  possible,   word,  that,  pops,   into,  my,  head,  when,   I,  look,   at,  it,  and,  dilute,   all,  relevance,  from,  it,  to,  a,  pile,   of,  mush,  cow,  shoes,   sheep,   the,  and,  me,  of,   it

Image  Credit:  Buzzfeed

Creating  ‘thin’  content  and  Even  more  URLs  to  crawl

#BIGSITEPROBLEMS - LOSE THE CRAZY TAG MAN

Page 57: Negotiating crawl budget with googlebots

Most Important Page 1

Most  Important  Page  2

Most  Important  Page  3

IS THIS YOUR BLOG?? HOPE NOT

#BIGSITEPROBLEMS – INTERNAL BACKLINKS SKEWED

IMPORTANCE DISTORTED BY DISPROPORTIONATE INTERNAL LINKING -LOCAL IB (P) – INTERNAL BACKLINKS

Page 58: Negotiating crawl budget with googlebots

Optimize  Everything:  I  must  optimize  ALL  the  pages  across  a  category  descendants  for  the  same  terms  as  my  primary  target  category  page  so  that  each  of  them  is  of  almost  equal  relevance  to  the  target  page  and  confuse  crawlers  as  to  which  isthe  important  one.    I’ll  put  them  all  in  a  sitemap  as  standard  too  just  for  good  measure.

Image  Credit:  Buzzfeed

HOW  CAN  SEARCH  ENGINESKNOW  WHICH  IS  MOST  IMPORTANTTO  A  TOPIC  IF  ‘EVERYTHING’  ISIMPORTANT??

#BIGSITEPROBLEMS - WARNING SIGNS – LOSE THE ‘MISTER OVER-OPTIMIZER’

‘OPTIMIZE  ALL  THE  THINGS’

Page 59: Negotiating crawl budget with googlebots

Duplicate  Everything:  I  must  have  a  massive   boiler  plate  area  in  the  footer,  identical  sidebars  and  a  massive  mega  menu  with  all  the  same  output  in  sitewide.    I’ll  put  very  little  unique   content  into  the  page  body  and  it  will  also  look   very  much  like  it’s  parents  and  grandparents  too.    From  time  to  time  I’ll  outrank  my  parents  and  grandparent  pages  but  ‘Meh’…

Image  Credit:  Buzzfeed

HOW  CAN  SEARCH  ENGINESKNOW  WHICH  IS  MOST  IMPORTANTPAGE  IF  ALL  IT’S  CHILDREN  AND  GRANDCHILDREN  ARE  NEARLY  THE  SAME??

#BIGSITEPROBLEMS - WARNING SIGNS – LOSE THE ‘MISTER DUPLICATER’

‘DUPLICATE  ALL  THE  THINGS’

Page 60: Negotiating crawl budget with googlebots

IMPROVE SITE PERFORMANCE - HELP GOOGLEBOT GET THROUGH THE ‘BUCKET LIST’ – GET FAST AND RELIABLE

Avoid  wasting  time  on  ‘overdue-­‐URL’  crawling  (E.G.  Send  correct  response  codes,  speed  up  your  site,  etc)

8,666,964  B1

½  time

>  2  x  page  crawl  p/day

Added   to  Cloudflare CDN

Page 61: Negotiating crawl budget with googlebots

GOOGLEBOT  GOES  WHERE  THE  ACTION  IS

USE  ‘ACTION’  WISELY

DON’T  TRY  TO  TRICK  GOOGLEBOT  BY  FAKING  ‘FRESHNESS’  ON  LOW  IMPORTANCE  PAGES  – GOOGLEBOT  WILL  REALISE

UPDATE  IMPORTANT  PAGES  OFTEN

NURTURE  SEASONAL  URLs  TO  GROW  IMPORTANCE  WITH  FRESHNESS  (regular  updates)  &  MATURITY  (HISTORY)

DON’T  TURN  GOOGLEBOT’S  HEAD  INTO  THE  WRONG  PLACES

Image  Credit:  Buzzfeed

’GET FRESH’ AND STAY ‘FRESH’

‘BUT  DON’T  TRY  TO  FAKE  FRESH  &  USE  FRESH  WISELY’

Page 62: Negotiating crawl budget with googlebots

IMPROVE TO GET THE HARD LIMITS ON CRAWLING

By  improving  yourURL  importance on  an  ongoing  basis  viaIncreased  pagerank,  content  improvements  (e.g.  quality  hub  pages),  internal  link  strategies,  IB  (P),  restructuring,You  can  get  the  ‘hard  limit’  or  get  visited  more  generally

CAN IMPROVING YOUR SITE HELP TO ‘OVERRIDE’ SOFT LIMIT CRAWL PERIODS SET?

Page 63: Negotiating crawl budget with googlebots

YOU THINK IT DOESN’T MATTER… RIGHT?

YOU  SAY…

”  GOOGLE  WILL  WORK  IT  OUT”

”LET’S  JUST  MAKE  MORE  CONTENT”

Page 64: Negotiating crawl budget with googlebots

WRONG  – ‘CRAWL  TANK’  IS  UGLY

Page 65: Negotiating crawl budget with googlebots

WRONG  – CRAWL  TANK  CAN  LOOK  LIKE  THIS

SITE  SEO  DEATH  BY  TOO  MANY  URLS  AND  INSUFFICIENT  CRAWL  BUDGET  TO  SUPPORT  (EITHER  DUMPING  A  NEW  ‘THIN’  PARAMETER  INTO  A  SITE  OR  INFINITE  LOOP  (CODING  ERROR)  (SPIDER  TRAP))

WHAT’S  WORSE  THAN  AN  INFINITE  LOOP?

‘A  LOGICAL  INFINITE  LOOP’

IMPORTANCE DISTORTED BY BADLY CODED PARAMETERS GENERATING ‘JUNK’ OR EVEN WORSE PULLING LOGIC TO CRAWLERS BUT NOT HUMANS

Page 66: Negotiating crawl budget with googlebots

WRONG  –SITE  DROWNED

-­ IN  IT’SOWN  SEA  OF  UNIMPORTANT  URLS

Page 67: Negotiating crawl budget with googlebots

VIA  ‘EXPONENTIAL  URL  UNIMPORTANCE’Your  URLs  exponentially  confirmed  unimportant   with  each  iterative  crawl  visit  to  other  similar  or  duplicate  content  checksum  URLs.    Fewer  and  fewer  internal  links  and  ‘thinner  and  thinner’   relevant  content.

MULTPLE  RANDOM  URLs  competing   for  same  query  confirm  irrelevance  of  all  competing   in-­‐site  URLs  with  no  dominant   single  relevant  IMPORTANT  URL

Page 68: Negotiating crawl budget with googlebots

WRONG  – ‘SENDING  WRONG  SIGNALS  TO  GOOGLEBOT’  COSTS  DEARLY

(Source:Sistrix)

“2015  was  the  year  where  website  owners  managed  to  be  mostly  at  fault,  all  by  themselves”  (Sistrix 2015  Organic  Search  Review  -­‐2016)

Page 69: Negotiating crawl budget with googlebots

WRONG  -­ NO-­ONE  IS  EXEMPT

(Source:Sistrix)

“It  doesn’t  matter  how  big  your  brand  is  if  you  ‘talk  to  the  spider’  (Googlebot)  wrong  ”  – You  can  still  ‘tank’

Page 70: Negotiating crawl budget with googlebots

WRONG  – GOOGLE  THINKS  SEOS  SHOULD  UNDERSTAND  CRAWL  BUDGET

Page 71: Negotiating crawl budget with googlebots

”EMPHASISE  IMPORTANCE”“Make  sure  the  right  URLs  get  on  Googlebot’smenu  and  increase  URL  

importance  to  build  Googlebot’s appetite  for  your  site  more”

Dawn  Anderson  @  dawnieando

SORT OUT CRAWLING

Page 72: Negotiating crawl budget with googlebots

TWITTER  -­‐ @dawnieandoGOOGLE+  -­‐ +DawnAnderson888LINKEDIN  -­‐ msdawnandersonTHANK  YOUDawn  Anderson  @  dawnieando

Page 73: Negotiating crawl budget with googlebots

• Going  ‘where  the  action  is’  in  sites

• The  ‘need  for  speed’

• Logical  structure

• Correct   ‘response’  codes

• XML  sitemaps  with  important  URLs

• ‘Successful  crawl  visits

• ‘Seeing  everything’  on  a  page

• Taking  MAX  ‘hints’

• Clear  unique  single  ‘URL  fingerprints’  (no  duplicates)

• Predicting  likelihood  of  ‘future  change’

• Finding  ‘more’  important  content  worth  crawling

• Slow  sites

• Too  many  redirects

• Being  bored  (Meh)  (Min  ‘Hints’  are  built  in  by  the  search  engine  systems  – Takes  ‘hints’)

• Being  lied  to  (e.g.  On  XML  sitemap  priorities)

• Crawl  traps  and  dead  ends

• Going  round   in  circles  (Infinite  loops)

• Spam  URLs

• Crawl  wasting  minor  content  change  URLs

• ‘Hidden’  and  blocked  content

• Uncrawlable URLs

Not  just  any  change

Critical  material  change

Predicting  future  change

Dropping   ‘hints’  to  Googlebot

Sending  GooglebotWhere  ‘the  action  is’

Not  just  page  change  designedTo  catch  Googlebot’s eye  withNo  added  value

UNDERSTAND GOOGLEBOT & URL SCHEDULER - LIKES & DISLIKESLIKES DISLIKES

CHANGE  IS  KEY

Page 74: Negotiating crawl budget with googlebots

Going  ‘where  the  action  is’  in  sites

The  ‘need  for  speed’

Logical  structure

Correct   ‘response’  codes

XML  sitemaps

‘Successful  crawl  visits

‘Seeing  everything’  on  a  page

Taking  ‘hints’

Clear  unique  single  ‘URL  fingerprints’  (no  duplicates)

Predicting  likelihood  of  ‘future  change’

Slow  sites

Too  many  redirects

Being  bored  (Meh)  (‘Hints’  are  built  in  by   the  search  engine  systems  – Takes  ‘hints’)

Being  lied  to  (e.g.  On  XML  sitemap  priorities)

Crawl  traps  and  dead  ends

Going  round   in  circles  (Infinite  loops)

Spam  URLs

Crawl  wasting  minor  content  change  URLs

‘Hidden’  and  blocked  content

Uncrawlable URLs

Not  just  any  change

Critical  material  change

Predicting  future  change

Dropping   ‘hints’  to  Googlebot

Sending  GooglebotWhere  ‘the  action  is’

CRAWL OPTIMISATION – STAGE 1 -UNDERSTAND GOOGLEBOT & URL SCHEDULER - LIKES & DISLIKESLIKES DISLIKES CHANGE  IS  KEY

Page 75: Negotiating crawl budget with googlebots

FIXGOOGLEBOT’S JOURNEY

SPEED UP YOUR SITE TO ‘FEED’ GOOGLEBOT MORE

TECHNICAL  ‘FIXES’      Speed  up  your  site

Implement  compression,  minification,  caching‘Fix  incorrect  header  response  codes

Fix  nonsensical  ‘infinite  loops’  generated  by  database  driven  parameters  or  ‘looping’   relative  URLs

Use  absolute  versus  relative  internal  links

Ensure  no  parts  of  content  is  blocked   from  crawlers  (e.g.  in  carousels,  concertinas  and  tabbed  content

Ensure  no  css or  javascript files  are  blocked   from  crawlers

Unpick  301   redirect  chains

Consider  using  a  CDN  such  asCloudflare

IMPLEMENTATION OF CONTENT DELIVERY NETWORK

Page 76: Negotiating crawl budget with googlebots

Minimise  301  redirects

Minimise  canonicalisation

Use  ‘if  modified’  headers  on   low  importance  ‘hygiene’  pages

Use  ‘expires  after’  headers  on  content  with  short  shelf  live  (e.g.  auctions,  job  sites,  event  sites)

Noindex low  search  volume  or  near  duplicate  URLs  (use  noindex directive  on  robots.txt)

Use  410  ‘gone’  headers  on  dead  URLs  liberally

Revisit  .htaccess file  and  review  legacy  pattern  matched  301   redirects

Combine  CSS  and  javascript files

Use  minification,  compression  and  caching

FIX GOOGLEBOT’S JOURNEY

SAVE  BUDGET  /  EMPHASISE  IMPORTANCE

£

Page 77: Negotiating crawl budget with googlebots

Revisit  ‘Votes  for  self ’  via  internal  links  in  GSC

Clear  ‘unique’  URL  fingerprints

Improve  whole  site  sections  /  categories

Use  XML  sitemaps  for  your  important  URLs  (don’t  put  everything  on   it)

Use  ‘mega  menus’  (very  selectively)  to  key  pages

Use  ‘breadcrumbs’

Build  ‘bridges’  and  ‘shortcuts’  via  html  sitemaps  and  ‘cross  modular’  ‘related’  internal  linking  to  key  pages

Consolidate  (merge)  important  but  similar  content  (e.g.  merge  FAQs  or  ‘low  search  volume’  content  into  other  relevant  pages)

Consider   flattening  your  site  structure  so  ‘importance’  flows  further

Reduce  internal  linking  to  lower  priority  URLs

BE  CLEAR  TO  GOOGLEBOT  WHICH  ARE  YOUR  MOST  IMPORTANT  PAGES

Not  just  any  change  – Critical  material  change

Keep  the  ‘action’  in  the  key  areas -­‐ NOT  JUST  THE  BLOG

Use  ‘relevant  ‘supplementary  content   to  keep  key  pages  ‘fresh’

Remember  min  crawl  ‘hints’

Regularly  update  key  IMPORTANT  content

Consider   ‘updating’  rather  than  replacing  seasonal  content  URLs  (e.g.  annual  events).    Append  and  update.

Build  ‘dynamism’  and  ‘interactivity’  into  your  web  development  (sites  that  ‘move’  win)

Keep  working  to  improve  and  make  your  URLs  more  important

GOOGLEBOT  GOES  WHERE  THE  ACTION  IS  AND  IS  LIKELY  TO  BE  IN  THE  FUTURE  (AS  LONG  AS  THOSE  URLS  ARE  NOT  UNIMPORTANT)

TRAIN GOOGLEBOT – ‘TALK TO THE SPIDER’ (PROMOTE URLS TO HIGHER CRAWL LAYERS)EMPHASISE  PAGE  IMPORTANCE       TRAIN  ON  CHANGE

Page 78: Negotiating crawl budget with googlebots

SAVINGS, CHANGE & SPEED TOOLS

• GSC  Index  levels  (over  indexation  checks)

• GSC  Crawl  stats

• Last  Accessed  Tools  (versus   competitors)

• Server  logs

• Keyword  Tools

SAVINGS  &  CHANGE

SPEED• Yslow

• Pingdom

• Google  Page  Speed  Tests

• Minificiation – JS  Compress   and  CSS  Minifier

• Image  Compression   –Compressjpeg.com,   tinypng.com

• Content  Delivery  Networks  (e.g.  Cloudflare)

Page 79: Negotiating crawl budget with googlebots

URL IMPORTANCE & CRAWL FREQUENCY TOOLS

• GSC  Internal  links  Report  (URL  importance)

• Link  Research  Tools  (Strongest  sub  pages  reports)

• GSC  Internal  links  (add  site  categories  and  sections   as  additional   profiles)

• Powermapper

• XML  Sitemap  Generators  for  custom  sitemaps

• Crawl  Frequency  Clocking  (@Johnmu)

URL  IMPORTANCE

Page 80: Negotiating crawl budget with googlebots

SPIDER EYES TOOLS

• GSC  Crawl  Stats

• URL  Profiler

• Deepcrawl

• Screaming  Frog

• Server  Logs

• SEMRush (auditing  tools)

• Webconfs (header  responses   /  similarity  checker)

• Powermapper (birds  eye  view  of  site)• Lynx  Browser

• Crawl  Frequency  Clocking  (@Johnmu)

SPIDER  EYES

Page 81: Negotiating crawl budget with googlebots

REFERENCES

Efficient  Crawling  Through  URL  Ordering  (Page  et  al)  -­‐ http://oak.cs.ucla.edu/~cho/papers/cho-­‐order.pdfCrawl  Optimisation (Blind  Five  Year  Old  – A  J  Kohn  -­‐ @ajkohn)  http://www.blindfiveyearold.com/crawl-­‐optimizationScheduling  a  recrawl (Auerbach)    -­‐ http://www.google.co.uk/patents/US8386459Scheduler  for  search  engine  crawler  (Zhu  et  al)  -­‐ http://www.google.co.uk/patents/US8042112Efficient  crawling  through  URL  ordering    (Page  et  al)  -­‐ http://oak.cs.ucla.edu/~cho/papers/cho-­‐order.pdfGoogle  Explains  Why  The  Search  Console  Reporting  Is  Not  Real  Time  (SERoundtable)  https://www.seroundtable.com/google-­‐explains-­‐why-­‐the-­‐search-­‐console-­‐has-­‐reporting-­‐delays-­‐21688.htmlCrawl  Data  Aggregation  Propagation  (Mueller)  -­‐ https://goo.gl/1pToL8Matt  Cutts Interviewed  By  Eric  Enge -­‐ https://www.stonetemple.com/matt-­‐cutts-­‐interviewed-­‐by-­‐eric-­‐enge-­‐2/Web  Promo  Q  and  A  with  Google’s  Andrev Lippatsev -­‐https://searchenginewatch.com/2016/04/06/webpromos-­‐qa-­‐with-­‐googles-­‐andrey-­‐lipattsev-­‐transcript/Google  Number  1  SEO  Advice  – Be  Consistent  -­‐ https://www.seroundtable.com/google-­‐number-­‐one-­‐seo-­‐advice-­‐be-­‐consistent-­‐21196.html

Page 82: Negotiating crawl budget with googlebots

REFERENCESInternet  Live  Stats  -­‐ http://www.internetlivestats.com/total-­‐number-­‐of-­‐websites/Scheduler  for  search  engine  crawler Google  PatentUS  8042112  B1,  (Zhu  et  al)  -­‐ https://www.google.com/patents/US8707313Managing  items  in  crawl  schedule  – Google  Patent  (Alpert)  http://www.google.ch/patents/US8666964Document  reuse  in  a  search  engine  crawler  -­‐ Google  Patent  (Zhu  et  al)https://www.google.com/patents/US8707312Web  crawler  scheduler  that  utilizes  sitemaps  (Brawer  et  al)  -­‐http://www.google.com/patents/US8037054Distributed  crawling  of  hyperlinked  documents  (Dean  et  al)  -­‐http://www.google.co.uk/patents/US7305610Minimizing  visibility  of  stale  content  (Carver)  -­‐http://www.google.ch/patents/US20130226897

Page 83: Negotiating crawl budget with googlebots

REFERENCEShttps://www.sistrix.com/blog/how-­‐nordstrom-­‐bested-­‐zappos-­‐on-­‐google/https://www.xml-­‐sitemaps.com/generator-­‐demo/