Heritrix Mobile

Click here to load reader

download Heritrix Mobile

of 22

description

Heritrix Mobile. Keith Enlow. Introduction. Heritrix 3.1 Mobile Finder Web Service 2 Options Crawl desktop web pages (default) Crawl mobile web pages using Mobile finder and exclude mobile web pages that use media queries. Experiment. Decision Making Heritrix - PowerPoint PPT Presentation

Transcript of Heritrix Mobile

Heritrix Mobile

Keith EnlowHeritrix MobileIntroductionHeritrix 3.1Mobile Finder Web Service2 OptionsCrawl desktop web pages (default)Crawl mobile web pages using Mobile finder and exclude mobile web pages that use media queries.ExperimentDecision Making HeritrixWeb Service (Mobile Finder) HeritrixModified Heritrix 3.1 to include two options for crawlingOption 0: Crawl with desktop user agentOption 1: Crawl with mobile user agent using Mobile FinderAdded built in mobile user agent adapted from Google BotCrawled a small set of URLsUsed Mobile Finder to find if the given URL has mobile versionWrote a small script to discover differences between the mobile and desktop versions

URLs CrawledDesktop URLMobile URLwww.huffingtonpost.comwww.foxnews.comwww.nbcnews.comwww.whitehouse.govwww.nasa.govwww.ssa.govwww.cornell.eduwww.stanford.eduwww.mit.edu

m.huffpost.com foxnews.mobiwww.nbcnews.comm.whitehouse.govmobile.nasa.govwww.ssa.gov/mobilem.cornell.edu/#homem.stanford.edum.mit.edu / mobile.mit.edu

Redirection/Delivery200 Response (server side redirect)302 Temporary relocation301 Permanent relocationJavaScript Redirection (client side redirect)Media QueriesStyle SheetsTiny LimitsNo JavaScript EngineHeritrix is unable to perform and execute JavaScript codeUnable to catch client side redirection and will instead continue to crawl the desktop version of the web page. Note: The Mobile Finder Web Service will find the mobile page and therefore Heritrix will continue the crawl.www.nasa.govwww.ssa.govwww.cornell.edu Desktop vs MobileTotal Link Count

HufingtonFox NewsNBC NewsNASASSAWhite HouseStanfordCornellMIT567741270388944960238081212351290112021341103545635357011694124HTML DistributionHuffingtonFox NewsNBC NewsNASASSAWhite HouseStanfordCornellMIT1155026812302851203251385596124933548818076163126

JavaScript DistributionHuffingtonFox NewsNBC NewsNASASSAWhite HouseStanfordCornellMIT2451074658912831045252334148013480

CSS DistributionHuffingtonFox NewsNBC NewsNASASSAWhite HouseStanfordCornellMIT587301723041154214863363171019843

Image DistributionHuffington Fox News NBC NASA SSA White House Stanford Cornell MIT386718893585229081741871460148487122759276928043674489

FIN