Negotiating crawl budget with googlebots
-
Upload
dawn-anderson-pg-dip-digm -
Category
Marketing
-
view
6.438 -
download
1
Transcript of Negotiating crawl budget with googlebots
USING ’PAGE IMPORTANCE’ IN ONGOING CONVERSATION WITH GOOGLEBOT TO GET JUST A BIT MORE THAN YOUR ALLOCATED CRAWL BUDGET
NEGOTIATING CRAWL BUDGET WITH GOOGLEBOTS
Dawn Anderson @ dawnieando
Another Rainy Day In Manchester
@dawnieando
WTF???
1994 - 1998
“THE GOOGLE INDEX IN 1998 HAD 60 MILLION PAGES” (GOOGLE)
(Source:Wikipedia.org)
2000
“INDEXED PAGES REACHES THE ONE BILLION MARK” (GOOGLE)
“IN OVER 17 MILLION WEBSITES” (INTERNETLIVESTATS.COM)
2001 ONWARDSENTER WORDPRESS, DRUPAL CMS’, PHP DRIVEN CMS’, ECOMMERCE PLATFORMS, DYNAMIC SITES, AJAX
WHICH CAN GENERATE 10,000S OR 100,000S OR 1,000,000S OF DYNAMICURLS ON THE FLY WITH DATABASE ‘FIELD BASED’ CONTENT
DYNAMIC CONTENT CREATION GROWS
ENTER FACETED NAVIGATION (WITH MANY # PATHS TO SAME CONTENT)
2003 – WE’RE AT 40 MILLION WEBSITES
2003 ONWARDS – USERS BEGIN TO JUMP ON THE CONTENT GENERATION BANDWAGGON
LOTS OF CONTENT – IN MANY FORMS
WE KNEW THE WEB WAS BIG… (GOOGLE, 2008)
https://googleblog.blogspot.co.uk/2008/07/we-‐knew-‐web-‐was-‐big.html
“1 trillion (as in 1,000,000,000,000) unique URLs on the web at once!”(Jesse Alpert on Google’s Official Blog, 2008)
2008 – EVEN GOOGLE ENGINEERS STOPPED IN AWE
2010 – USER GENERATED CONTENT GROWS
“Let me repeat that: we create as much information in two days now as we did from the dawn of man through 2003”
“The real issue is user-‐generated content.” (Eric Schmidt, 2010 – TechonomyConference Panel)
SOURCE: http://techcrunch.com/2010/08/04/schmidt-‐data/
Indexed Web contains at least 4.73 billion pages (13/11/2015)
CONTENT KEEPS GROWINGTotal number of websites
2000 2001 2002 2003 2004 2005 2006 2007 2008 2009 2010 2011 2012 2013 2014
1,000,000,000
750,000,000
500,000,000
250,000,000
THE NUMBER OF WEBSITES DOUBLED IN SIZE BETWEEN 2011 AND 2012AND AGAIN BY 1/3 IN 2014
EVEN SIR TIM BERNERS-‐LEE(Inventor of www) TWEETED
2014 – WE PASS A BILLION INDIVIDUAL WEBSITES ONLINE
2014 – WE ARE ALL PUBLISHERS
SOURCE: http://wordpress/activity/posting
YUP - WE ALL‘LOVE CONTENT’
IMAGINE HOW MANY UNIQUE URLs COMBINED THIS AMOUNTS TO?
– A LOT
http://www.internetlivestats.com/total-‐number-‐of-‐websites/
“As of the end of 2003, the WWW is believed to include well in excess of 10 billion distinct documents or web pages, while a search engine may have a crawling capacity that is less than half as many documents” (MANY GOOGLE PATENTS)
CAPACITY LIMITATIONS – EVEN FOR SEARCH ENGINES
Source: Scheduler for search engine crawler Google PatentUS 8042112 B1, (Zhu et al)
“So how many unique pages does the web really contain? We don't know; we don't have time to look at them all! :-‐)”
(Jesse Alpert, Google, 2008)
Source: https://googleblog.blogspot.co.uk/2008/07/we-‐knew-‐web-‐was-‐big.html
NOT ENOUGH TIME
SOME THINGS MUST BE FILTERED
A LOT OF THE CONTENT IS ‘KIND OF THE SAME’
“There’s a needle in here somewhere”
“It’s an important needle too”
Capacity limits on Google’s
crawling system
By prioritising URLs for crawling
By assigning crawl period
intervals to URLs
How have search engines responded?
By creating work ‘schedules’ for Googlebots
WHAT IS THE SOLUTION?
“To keep within the capacity limits of the crawler, automated selection mechanisms are needed to determine not only which web pages to crawl, but which web pages to avoid crawling”. -‐Scheduler for search engine crawler, (Zhu et al)
‘Managing items in a crawl schedule’
IncludeGOOGLE CRAWL SCHEDULER PATENTS
‘Scheduling a recrawl’
‘Web crawler scheduler that utilizes sitemaps from websites’
‘
‘Document reuse in a search engine crawler’
‘Minimizing visibility of stale content in web searching including revising web crawl intervals of documents’
‘Scheduler for search engine’
EFFICIENCY IS NECESSARY
CRAWL BUDGET
1. Crawl Budget – “An allocation of crawl frequency visits to a host (IP LEVEL)”
3. Pages with a lot of links get crawled more
4. The vast majority of URLs on the web don’t get a lot of budget allocated to them (low to 0 PageRank URLs).
2. Roughly proportionate to PageRank and host load / speed / host capacity
https://www.stonetemple.com/matt-‐cutts-‐interviewed-‐by-‐eric-‐enge-‐2/
BUT… MAYBE THINGS HAVE CHANGED?
CRAWL BUDGET / CRAWL FREQUENCY IS NOT JUST ABOUT HOST-LOAD AND PAGERANK ANY MORE
STOP THINKING IT’S JUST ABOUT ‘PAGERANK’
http://www.youtube.com/watch?v=GVKcMU7YNOQ&t=4m45s
“You keep focusing on PageRank”…
“There’s a shit-‐ton of other stuff going on” (Illyes, G, Google -‐2016)
THERE’S A LOT OF OTHER THINGS AFFECTING ‘CRAWLING’
Transcript: https://searchenginewatch.com/2016/04/06/webpromos-‐qa-‐with-‐googles-‐andrey-‐lipattsev-‐transcript/
WEB PROMOS Q & A WITH GOOGLES ANDREY LIPATTSEV
WHY?BECAUSE…
THE WEB GOT ‘MAHOOOOOSIVE’
AND CONTINUES TO GET ‘MAHOOOOOOSIVER’
SITES GOT MORE DYNAMIC, COMPLEX, AUTO-GENERATED, MULTI-FACETED, DUPLICATED, INTERNATIONALISED, BIGGER, BECAME PAGINATED AND SORTED
WE NEED MOREWAYS TO GETMORE EFFICIENTAND FILTER OUTTIME-WASTINGCRAWLING SO WE CAN FIND IMPORTANT CHANGES QUICKLY
GOOGLEBOT’S TO-DO LIST GOT REALLY BIG
Hard and Soft Crawl Limits
Importance Thresholds
Min and Max Hints & ‘Hint
ranges’
ImportanceCrawl Periods
Scheduling
FURTHER IMPROVED CRAWLING EFFICIENCY SOLUTIONS NEEDED
Prioritization TieredCrawlingBuckets
(‘Real Time, Daily, Base Layer)
SEVERAL PATENTS UPDATED
‘Managing URLs’ (Alpert et al, 2013) (PAGE IMPORTANCE DETERMINING SOFT AND HARD LIMITS ON CRAWLING)
‘Managing Items in a Crawl Schedule’ (Alpert, 2014)
‘
‘Scheduling a Recrawl’ (Anerbach, Alpert, 2013) (PREDICTING CHANGE FREQUENCY IN ORDER TO SCHEDULE NEXT VISIT, EMPLOYING HINTS (Min & Max)
(SEEM TO WORK TOGETHER)
‘Minimizing visibility of stale content in web searching including revising web crawl intervals of documents’ (INCLUDES EMPLOYING HINTS TO DETECT PAGES ‘NOT’ TO CRAWL)
Crawled multiple times daily
Crawled daily Or bi-‐daily
Crawled least on a ‘round robin’ basis – only ‘active’ segment is crawledSplit into segments
on random rotation
MANAGING ITEMS IN A CRAWL SCHEDULE (GOOGLE PATENT)
Real TimeCrawl
Daily Crawl
Base Layer Crawl
3 layers / tiers / buckets for scheduling
URLs are moved in and out of layers based on past visits data
Most Unimportant
CAN WE ESCAPE THE ‘BASE LAYER’ CRAWL BUCKET RESERVED FOR ‘UNIMPORTANT’ URLS?
10 typesof
Googlebot
SOME OF THE MAJOR SEARCH ENGINE CHARACTERS
History Logs / History Server
The URL Scheduler / Crawl Manager
HISTORY LOGS / HISTORY SERVERS
HISTORY LOGS / HISTORY SERVER -‐ Builds a picture of historical data and past behaviour of the URL and ‘importance’ score to predict and plan for future crawl scheduling
• Last crawled date• Next crawl due• Last server response• Page importance score• Collaborates with link
logs• Collaborates with
anchor logs• Contributes info to
scheduling
‘BOSS’- URL SCHEDULER / URL MANAGER
Think of it as Google’s line manager or ‘air traffic controller’ for Googlebots in the web crawling system
• Schedules Googlebot visits to URLs• Decides which URLs to ‘feed’ to Googlebot• Uses data from the history logs about past visits (Change rate and
importance)• Calculates importance crawl threshold• Assigns visit regularity of Googlebot to URLs• Drops ‘max and min hints’ to Googlebot to guide on types of
content NOT to crawl or to crawl as exceptions.• Excludes some URLs from schedules• Assigns URLs to ‘layers / tiers’ for crawling schedules• Scheduler checks URLs for ‘importance’, ‘boost factor’ candidacy,
‘probability of modification’• Budgets are allocated to IPs and shared amongst domains there
JOBS
• ‘Ranks nothing at all’• Takes a list of URLs to crawl from URL Scheduler• Runs errands & makes deliveries for the URL server, indexer /
ranking engine and logs• Makes notes of outbound linked pages and additional links
for future crawling• Follows directives (robots) and takes ‘hints’ when crawling• Tells tales of URL accessibility status, server response codes,
notes relationships between links and collects content checksums (binary data equivalent of web content) for comparison with past visits by history and link logs
• Will go beyond the crawl schedule if it finds something more important than URLs scheduled
GOOGLEBOT - CRAWLERJOBS
WHAT MAKES THE DIFFERENCE BETWEEN BASE LAYER AND ‘REAL TIME’ SCHEDULE ALLOCATION?
CONTRIBUTING FACTORS
1. Page Importance (which may include PageRank)
3. Soft limits and hard crawl limits
4. Host load capability & past site performance (speed and access) (IP level and domain level within)
2. Hints (max and min)
5. Probability / predictability of ‘CRITICALMATERIAL’ change + importance crawl period
1 - PAGE IMPORTANCE - Page importance is the importance of a page independent of a query
• Location in Site (e.g. home page more important than parameter 3 level output)
• PageRank• Page type / file type• Internal PageRank• Internal Backlinks• In-‐site Anchor Text Consistency• Relevance (content, anchors and elements) to a
topic (Similarity Importance)• Directives from in-‐page robot and robots.txt
management• Parent quality brushes off on child page qualityIMPORTANT PARENTS LIKELY SEEN TO HAVE IMPORTANT CHILD PAGES
2 - HINTS - ’MIN’ HINTS & ’MAX’ HINTS
MIN HINT / MIN HINT RANGES• e.g. Programmatically generated
content which changes content checksum on load
• Unimportant duplicate parameter URLs
• Canonicals• Rel=next, rel=prev• HReflang• Duplicate content• SpammyURLs?• Objectionable content
MAX HINT / MAX HINT RANGES• CHANGE CONSIDERED ‘CRITICAL
MATERIAL CHANGE’ (useful to users e.g. availability, price) & / or improved site sections or change to IMPORTANT but infrequently changing content?
• Important pages / page range updates
E.G. rel="prev" and rel="next" act as hints to Google, not absolute directives
https://support.google.com/webmasters/answer/1663744?hl=en&ref_topic=4617741
3 - HARD AND SOFT LIMITS ON CRAWLING
If URLs are discovered during crawling that are more important than those scheduled to be crawled then Googlebot can go beyond its schedule to include these up to a hard crawl limit
‘Soft’ crawl limit is set (Original schedule)
‘Hard’ crawl limit is set (E.G. 130% of schedule)
FOR IMPORTANT FINDINGS
4 – HOST LOAD CAPACITY / PAST SITE PERFORMANCE
Googlebot has a list of URLs to crawl
Naturally, if your site is fast that list can be crawled quicker
If Googlebotexperiences 500s e.g. she will retreat & ‘past performance’ is noted
If Googlebotdoesn’t get ‘round the list’ you may end up with ‘overdue’ URLs to crawl
• Not all change is considered equal• There are many dynamic sites with low importance pages
changing frequently – SO WHAT• Constantly changing your page just to get Googlebot
back won’t work if the page is low importance (crawl importance period < change rate) POINTLESS
• Hints are employed to determine pages which simply change the content checksum with every visit
• Features are weighted for change importance to user (price > colour e.g.)
• Change identified as useful to users is considered ‘CRITICAL MATERIAL CHANGE’
• Don’t just try to randomise things to catch Googlebot’seye
• That counter or clock you added probably isn’t going to help you get more attention, nor random or shuffle
• Change on some types of pages is more important than other pages (e.g. Home page CNN > SME about us page)
5 - CHANGE
• Current capacity of the web crawling system is high• Your URL has a high ‘importance score’• Your URL is in the real time (HIGH IMPORTANCE), daily crawl
(LESS IMPORTANT) or ‘active’ base layer segment (UNIMPORTANT BUT SELECTED)
• Your URL changes a lot with CRITICAL MATERIAL CONTENT change (AND IS IMPORTANT)
• Probability and predictability of CRITICAL MATERIAL CONTENT change is high for your URL (AND URL IS IMPORTANT)
• Your website speed is fast and Googlebot gets the time to visit your URL on its bucket list of scheduled URLs that visit
• Your URL has been ‘upgraded’ to a daily or real time crawl layer as it’s importance is detected as raised
• History logs and URL Scheduler ’learn’ together
FACTORS AFFECTING GOOGLEBOT HIGHER VISIT FREQUENCY
• Current capacity of web crawling system is low• Your URL has been detected as a ‘spam’ URL• Your URL is in an ‘inactive’ base layer segment (UNIMPORTANT)• Your URLs are ‘tripping hints’ built into the system to detect non-‐
critical change dynamic content• Probability and predictability of critical material content change is
low for your URL• Your website speed is slow and Googlebot doesn’t get the time to
visit your URL• Your URL has been ‘downgraded’ to an ‘inactive’ base layer
(UNIMPORTANT) segment• Your URL has returned an ‘unreachable’ server response code
recently• In-‐page robots management or robots.txt send wrong signals
FACTORS AFFECTING LOWER GOOGLEBOT VISIT FREQUENCY
GET MORE CRAWL BY ‘TURNING GOOGLEBOT’S HEAD’ – MAKE YOUR URLs MORE IMPORTANT AND ‘EMPHASISE’ IMPORTANCE
• Hard limits and soft limits• Follows ‘min’ and ‘max’ Hints• If she finds something important she will go beyond a
scheduled crawl (SOFT LIMIT) to seek out importance (TO HARD LIMIT)
• You need to IMPRESS Googlebot• If you ‘bore’ Googlebot she will return to boring URLs less
(e.g. with pages all the same (duplicate content) or dynamically generated low usefulness content)
• If you ’delight’ Googlebot she will return to delightful URLs more (they became more important or they changed with ‘CRITICAL MATERIAL CHANGE’)
• If she doesn’t get her crawl completed you will end up with an ‘overdue’ list of URLs to crawl
GOOGLEBOT DOES AS SHE’S TOLD –WITH A FEW EXCEPTIONS
• Your URL became more important and achieved a higher ‘importance score’ via increased PageRank
• Your URL became more important via increased IB(P) (INTERNAL BACKLINKS IN OWN SITE) relative to other URLs within your site (You emphasised importance)
• You made the URL content more relevant to a topic and improved the importance score
• The parent of your URL became more important (E.G. IMPROVED TOPIC RELEVANCE (SIMILARITY), PageRank OR local (in-‐site) importance metric)
• YOUR ‘IMPORTANCE SCORE’ OF SOME URLS EXCEEDED THE ‘IMPORTANCE SOFT LIMIT THRESHOLD’ SO THAT IT IS INCLUDED FOR CRAWLING WHILST BEING VISITED UP TO A POINT OF ‘HARD LIMIT’ CRAWLING (E.G. 130% OF SCHEDULED CRAWLING)
GETTING MORE CRAWL BY IMPROVING PAGE IMPORTANCE
HOW DO WE DO THIS?
TO DO - FIND GOOGLEBOTAUTOMATE SERVER LOG RETRIEVAL VIA CRON JOB
grep Googlebotaccess_log>googlebot_access.txt
ANALYSE THE LOGS
LOOK THROUGH SPIDER-EYESPREPARE TO BE HORRIFIED
Incorrect URL header response codes 301 redirect chainsOld files or XML sitemaps left on server from years agoInfinite/ endless loops (circular dependency)On parameter driven sites URLs crawled which produce same outputAJAX content fragments pulled in aloneURLs generated by spammersDead image files being visitedOld CSS files still being crawled and loading EVERYTHINGYou may even see ’mini’ abandoned projects within the siteLegacy URLs generated by long forgotten .htaccess regex pattern matchingGooglebot hanging around in your ‘ever-‐changing’ blog but nowhere else
URL CRAWL FREQUENCY ’CLOCKING’
Spreadsheet provided by @johnmu during Webmaster Hangout -‐ https://goo.gl/1pToL8
Identify your ‘real time’, ‘daily’ and ‘base layer’ URLs-‐ ARE THEY THE ONES YOU WANT THERE? WHAT IS BEING SEEN AS UNIMPORTANT?
NOTE GOOGLEBOT
Do you recognise all theURLs and URL ranges thatAre appearing?If not… Why not?
IMPROVE & EMPHASISE PAGE IMPORTANCE• Cross modular internal linking• Canonicalization• Important URLs in XML sitemaps• Anchor text target consistency (but not spammyrepetition of
anchors everywhere (it’s still output))• Internal links in right descending order – emphasise
IMPORTANCE• Reduce boiler plate content and improve relevance of content
and elements to specific topic (if category) / product (if product page) / subcategory (if subcategory)
• Reduce duplicate content parts of page to allow primary targets to take ’IMPORTANCE’
• Improve parent pages to raise IMPORTANCE reputation of the children rather than over-‐optimising the child pages and cannibalising the parent.
• Improve content as more ‘relevant’ to a topic to increase ‘IMPORTANCE’ and get reassigned to a different crawl layer
• Flatten ‘architectures’• Avoid content cannibalisation• Link relevant content to relevant content• Build strong highly relevant ‘hub’ pages to tie together strength
& IMPORTANCE
EMPHASISE IMPORTANCE WISELY
USE CUSTOMXMLSITEMAPS
E.G. XML UNLIMITEDSITEMAP GENERATOR
PUT IMPORTANT URLS IN HERE
IF EVERYTHING IS IMPORTANT THEN IMPORTANCE IS NOT DIFFERENTIATED
KEEP CUSTOM SITEMAPS ‘CURRENT’ AUTOMATICALLY
AUTOMATEUPDATESWITH CRON JOBS OR WEB CRON JOBS
IT’S NOT AS TECHNICAL AS YOU MAY THINK – USE WEB CRON JOBS
BE ‘PICKY’ ABOUT WHAT YOU INCLUDE IN XML SITEMAPS
EXCLUDE ANDINCLUDE CRAWLPATHS IN XML SITEMAPS TO EMPHASISEIMPORTANCE
IF YOU CAN’T IMPROVE - EXCLUDE (VIA NOINDEX) FOR NOW • YOU’RE OUT FOR NOW
• When you improve you can come back in
• Tell Googlebot quickly that you’re out (via temporary XML sitemap inclusion)
• But ‘follow’ because there will be some relevance within these URLs
• Include again when you’ve improved
• Don’t try to canonicalizeme to something in theindex
OR REMOVE – 410 GONE(IF IT’S NEVER COMINGBACK)
http://faxfromthefuture.bandcamp.com/track/410-‐gone-‐acoustic-‐demo
EMBRACE THE ‘410 GONE’
There’s Even A SongAbout It
#BIGSITEPROBLEMS – LOSE THE INDEX BLOAT
LOSE THE BLOAT TO INCREASE THE CRAWLNo. of unimportant URLs indexed extend far beyond the available importance crawl threshold allocation
Tags: I, must, tag, this, blog, post, with, every, possible, word, that, pops, into, my, head, when, I, look, at, it, and, dilute, all, relevance, from, it, to, a, pile, of, mush, cow, shoes, sheep, the, and, me, of, it
Image Credit: Buzzfeed
Creating ‘thin’ content and Even more URLs to crawl
#BIGSITEPROBLEMS - LOSE THE CRAZY TAG MAN
Most Important Page 1
Most Important Page 2
Most Important Page 3
IS THIS YOUR BLOG?? HOPE NOT
#BIGSITEPROBLEMS – INTERNAL BACKLINKS SKEWED
IMPORTANCE DISTORTED BY DISPROPORTIONATE INTERNAL LINKING -LOCAL IB (P) – INTERNAL BACKLINKS
Optimize Everything: I must optimize ALL the pages across a category descendants for the same terms as my primary target category page so that each of them is of almost equal relevance to the target page and confuse crawlers as to which isthe important one. I’ll put them all in a sitemap as standard too just for good measure.
Image Credit: Buzzfeed
HOW CAN SEARCH ENGINESKNOW WHICH IS MOST IMPORTANTTO A TOPIC IF ‘EVERYTHING’ ISIMPORTANT??
#BIGSITEPROBLEMS - WARNING SIGNS – LOSE THE ‘MISTER OVER-OPTIMIZER’
‘OPTIMIZE ALL THE THINGS’
Duplicate Everything: I must have a massive boiler plate area in the footer, identical sidebars and a massive mega menu with all the same output in sitewide. I’ll put very little unique content into the page body and it will also look very much like it’s parents and grandparents too. From time to time I’ll outrank my parents and grandparent pages but ‘Meh’…
Image Credit: Buzzfeed
HOW CAN SEARCH ENGINESKNOW WHICH IS MOST IMPORTANTPAGE IF ALL IT’S CHILDREN AND GRANDCHILDREN ARE NEARLY THE SAME??
#BIGSITEPROBLEMS - WARNING SIGNS – LOSE THE ‘MISTER DUPLICATER’
‘DUPLICATE ALL THE THINGS’
IMPROVE SITE PERFORMANCE - HELP GOOGLEBOT GET THROUGH THE ‘BUCKET LIST’ – GET FAST AND RELIABLE
Avoid wasting time on ‘overdue-‐URL’ crawling (E.G. Send correct response codes, speed up your site, etc)
8,666,964 B1
½ time
> 2 x page crawl p/day
Added to Cloudflare CDN
GOOGLEBOT GOES WHERE THE ACTION IS
USE ‘ACTION’ WISELY
DON’T TRY TO TRICK GOOGLEBOT BY FAKING ‘FRESHNESS’ ON LOW IMPORTANCE PAGES – GOOGLEBOT WILL REALISE
UPDATE IMPORTANT PAGES OFTEN
NURTURE SEASONAL URLs TO GROW IMPORTANCE WITH FRESHNESS (regular updates) & MATURITY (HISTORY)
DON’T TURN GOOGLEBOT’S HEAD INTO THE WRONG PLACES
Image Credit: Buzzfeed
’GET FRESH’ AND STAY ‘FRESH’
‘BUT DON’T TRY TO FAKE FRESH & USE FRESH WISELY’
IMPROVE TO GET THE HARD LIMITS ON CRAWLING
By improving yourURL importance on an ongoing basis viaIncreased pagerank, content improvements (e.g. quality hub pages), internal link strategies, IB (P), restructuring,You can get the ‘hard limit’ or get visited more generally
CAN IMPROVING YOUR SITE HELP TO ‘OVERRIDE’ SOFT LIMIT CRAWL PERIODS SET?
YOU THINK IT DOESN’T MATTER… RIGHT?
YOU SAY…
” GOOGLE WILL WORK IT OUT”
”LET’S JUST MAKE MORE CONTENT”
WRONG – ‘CRAWL TANK’ IS UGLY
WRONG – CRAWL TANK CAN LOOK LIKE THIS
SITE SEO DEATH BY TOO MANY URLS AND INSUFFICIENT CRAWL BUDGET TO SUPPORT (EITHER DUMPING A NEW ‘THIN’ PARAMETER INTO A SITE OR INFINITE LOOP (CODING ERROR) (SPIDER TRAP))
WHAT’S WORSE THAN AN INFINITE LOOP?
‘A LOGICAL INFINITE LOOP’
IMPORTANCE DISTORTED BY BADLY CODED PARAMETERS GENERATING ‘JUNK’ OR EVEN WORSE PULLING LOGIC TO CRAWLERS BUT NOT HUMANS
WRONG –SITE DROWNED
- IN IT’SOWN SEA OF UNIMPORTANT URLS
VIA ‘EXPONENTIAL URL UNIMPORTANCE’Your URLs exponentially confirmed unimportant with each iterative crawl visit to other similar or duplicate content checksum URLs. Fewer and fewer internal links and ‘thinner and thinner’ relevant content.
MULTPLE RANDOM URLs competing for same query confirm irrelevance of all competing in-‐site URLs with no dominant single relevant IMPORTANT URL
WRONG – ‘SENDING WRONG SIGNALS TO GOOGLEBOT’ COSTS DEARLY
(Source:Sistrix)
“2015 was the year where website owners managed to be mostly at fault, all by themselves” (Sistrix 2015 Organic Search Review -‐2016)
WRONG - NO-ONE IS EXEMPT
(Source:Sistrix)
“It doesn’t matter how big your brand is if you ‘talk to the spider’ (Googlebot) wrong ” – You can still ‘tank’
WRONG – GOOGLE THINKS SEOS SHOULD UNDERSTAND CRAWL BUDGET
”EMPHASISE IMPORTANCE”“Make sure the right URLs get on Googlebot’smenu and increase URL
importance to build Googlebot’s appetite for your site more”
Dawn Anderson @ dawnieando
SORT OUT CRAWLING
TWITTER -‐ @dawnieandoGOOGLE+ -‐ +DawnAnderson888LINKEDIN -‐ msdawnandersonTHANK YOUDawn Anderson @ dawnieando
• Going ‘where the action is’ in sites
• The ‘need for speed’
• Logical structure
• Correct ‘response’ codes
• XML sitemaps with important URLs
• ‘Successful crawl visits
• ‘Seeing everything’ on a page
• Taking MAX ‘hints’
• Clear unique single ‘URL fingerprints’ (no duplicates)
• Predicting likelihood of ‘future change’
• Finding ‘more’ important content worth crawling
• Slow sites
• Too many redirects
• Being bored (Meh) (Min ‘Hints’ are built in by the search engine systems – Takes ‘hints’)
• Being lied to (e.g. On XML sitemap priorities)
• Crawl traps and dead ends
• Going round in circles (Infinite loops)
• Spam URLs
• Crawl wasting minor content change URLs
• ‘Hidden’ and blocked content
• Uncrawlable URLs
Not just any change
Critical material change
Predicting future change
Dropping ‘hints’ to Googlebot
Sending GooglebotWhere ‘the action is’
Not just page change designedTo catch Googlebot’s eye withNo added value
UNDERSTAND GOOGLEBOT & URL SCHEDULER - LIKES & DISLIKESLIKES DISLIKES
CHANGE IS KEY
Going ‘where the action is’ in sites
The ‘need for speed’
Logical structure
Correct ‘response’ codes
XML sitemaps
‘Successful crawl visits
‘Seeing everything’ on a page
Taking ‘hints’
Clear unique single ‘URL fingerprints’ (no duplicates)
Predicting likelihood of ‘future change’
Slow sites
Too many redirects
Being bored (Meh) (‘Hints’ are built in by the search engine systems – Takes ‘hints’)
Being lied to (e.g. On XML sitemap priorities)
Crawl traps and dead ends
Going round in circles (Infinite loops)
Spam URLs
Crawl wasting minor content change URLs
‘Hidden’ and blocked content
Uncrawlable URLs
Not just any change
Critical material change
Predicting future change
Dropping ‘hints’ to Googlebot
Sending GooglebotWhere ‘the action is’
CRAWL OPTIMISATION – STAGE 1 -UNDERSTAND GOOGLEBOT & URL SCHEDULER - LIKES & DISLIKESLIKES DISLIKES CHANGE IS KEY
FIXGOOGLEBOT’S JOURNEY
SPEED UP YOUR SITE TO ‘FEED’ GOOGLEBOT MORE
TECHNICAL ‘FIXES’ Speed up your site
Implement compression, minification, caching‘Fix incorrect header response codes
Fix nonsensical ‘infinite loops’ generated by database driven parameters or ‘looping’ relative URLs
Use absolute versus relative internal links
Ensure no parts of content is blocked from crawlers (e.g. in carousels, concertinas and tabbed content
Ensure no css or javascript files are blocked from crawlers
Unpick 301 redirect chains
Consider using a CDN such asCloudflare
IMPLEMENTATION OF CONTENT DELIVERY NETWORK
Minimise 301 redirects
Minimise canonicalisation
Use ‘if modified’ headers on low importance ‘hygiene’ pages
Use ‘expires after’ headers on content with short shelf live (e.g. auctions, job sites, event sites)
Noindex low search volume or near duplicate URLs (use noindex directive on robots.txt)
Use 410 ‘gone’ headers on dead URLs liberally
Revisit .htaccess file and review legacy pattern matched 301 redirects
Combine CSS and javascript files
Use minification, compression and caching
FIX GOOGLEBOT’S JOURNEY
SAVE BUDGET / EMPHASISE IMPORTANCE
£
Revisit ‘Votes for self ’ via internal links in GSC
Clear ‘unique’ URL fingerprints
Improve whole site sections / categories
Use XML sitemaps for your important URLs (don’t put everything on it)
Use ‘mega menus’ (very selectively) to key pages
Use ‘breadcrumbs’
Build ‘bridges’ and ‘shortcuts’ via html sitemaps and ‘cross modular’ ‘related’ internal linking to key pages
Consolidate (merge) important but similar content (e.g. merge FAQs or ‘low search volume’ content into other relevant pages)
Consider flattening your site structure so ‘importance’ flows further
Reduce internal linking to lower priority URLs
BE CLEAR TO GOOGLEBOT WHICH ARE YOUR MOST IMPORTANT PAGES
Not just any change – Critical material change
Keep the ‘action’ in the key areas -‐ NOT JUST THE BLOG
Use ‘relevant ‘supplementary content to keep key pages ‘fresh’
Remember min crawl ‘hints’
Regularly update key IMPORTANT content
Consider ‘updating’ rather than replacing seasonal content URLs (e.g. annual events). Append and update.
Build ‘dynamism’ and ‘interactivity’ into your web development (sites that ‘move’ win)
Keep working to improve and make your URLs more important
GOOGLEBOT GOES WHERE THE ACTION IS AND IS LIKELY TO BE IN THE FUTURE (AS LONG AS THOSE URLS ARE NOT UNIMPORTANT)
TRAIN GOOGLEBOT – ‘TALK TO THE SPIDER’ (PROMOTE URLS TO HIGHER CRAWL LAYERS)EMPHASISE PAGE IMPORTANCE TRAIN ON CHANGE
SAVINGS, CHANGE & SPEED TOOLS
• GSC Index levels (over indexation checks)
• GSC Crawl stats
• Last Accessed Tools (versus competitors)
• Server logs
• Keyword Tools
SAVINGS & CHANGE
SPEED• Yslow
• Pingdom
• Google Page Speed Tests
• Minificiation – JS Compress and CSS Minifier
• Image Compression –Compressjpeg.com, tinypng.com
• Content Delivery Networks (e.g. Cloudflare)
URL IMPORTANCE & CRAWL FREQUENCY TOOLS
• GSC Internal links Report (URL importance)
• Link Research Tools (Strongest sub pages reports)
• GSC Internal links (add site categories and sections as additional profiles)
• Powermapper
• XML Sitemap Generators for custom sitemaps
• Crawl Frequency Clocking (@Johnmu)
URL IMPORTANCE
SPIDER EYES TOOLS
• GSC Crawl Stats
• URL Profiler
• Deepcrawl
• Screaming Frog
• Server Logs
• SEMRush (auditing tools)
• Webconfs (header responses / similarity checker)
• Powermapper (birds eye view of site)• Lynx Browser
• Crawl Frequency Clocking (@Johnmu)
SPIDER EYES
REFERENCES
Efficient Crawling Through URL Ordering (Page et al) -‐ http://oak.cs.ucla.edu/~cho/papers/cho-‐order.pdfCrawl Optimisation (Blind Five Year Old – A J Kohn -‐ @ajkohn) http://www.blindfiveyearold.com/crawl-‐optimizationScheduling a recrawl (Auerbach) -‐ http://www.google.co.uk/patents/US8386459Scheduler for search engine crawler (Zhu et al) -‐ http://www.google.co.uk/patents/US8042112Efficient crawling through URL ordering (Page et al) -‐ http://oak.cs.ucla.edu/~cho/papers/cho-‐order.pdfGoogle Explains Why The Search Console Reporting Is Not Real Time (SERoundtable) https://www.seroundtable.com/google-‐explains-‐why-‐the-‐search-‐console-‐has-‐reporting-‐delays-‐21688.htmlCrawl Data Aggregation Propagation (Mueller) -‐ https://goo.gl/1pToL8Matt Cutts Interviewed By Eric Enge -‐ https://www.stonetemple.com/matt-‐cutts-‐interviewed-‐by-‐eric-‐enge-‐2/Web Promo Q and A with Google’s Andrev Lippatsev -‐https://searchenginewatch.com/2016/04/06/webpromos-‐qa-‐with-‐googles-‐andrey-‐lipattsev-‐transcript/Google Number 1 SEO Advice – Be Consistent -‐ https://www.seroundtable.com/google-‐number-‐one-‐seo-‐advice-‐be-‐consistent-‐21196.html
REFERENCESInternet Live Stats -‐ http://www.internetlivestats.com/total-‐number-‐of-‐websites/Scheduler for search engine crawler Google PatentUS 8042112 B1, (Zhu et al) -‐ https://www.google.com/patents/US8707313Managing items in crawl schedule – Google Patent (Alpert) http://www.google.ch/patents/US8666964Document reuse in a search engine crawler -‐ Google Patent (Zhu et al)https://www.google.com/patents/US8707312Web crawler scheduler that utilizes sitemaps (Brawer et al) -‐http://www.google.com/patents/US8037054Distributed crawling of hyperlinked documents (Dean et al) -‐http://www.google.co.uk/patents/US7305610Minimizing visibility of stale content (Carver) -‐http://www.google.ch/patents/US20130226897
REFERENCEShttps://www.sistrix.com/blog/how-‐nordstrom-‐bested-‐zappos-‐on-‐google/https://www.xml-‐sitemaps.com/generator-‐demo/