Web Archive Analysis - Rutgers...
Transcript of Web Archive Analysis - Rutgers...
![Page 2: Web Archive Analysis - Rutgers Universitywp.comminfo.rutgers.edu/.../07/Vinay_WebResearchIA_June2014_Vinay … · 21 Archive Analysis Workshop Generate derivatives: CDX, WAT, Parsed](https://reader034.fdocuments.in/reader034/viewer/2022052006/601a6b927541750d7c0f2f46/html5/thumbnails/2.jpg)
2
Access Web Archive Data
Wayback Machine
Search
![Page 3: Web Archive Analysis - Rutgers Universitywp.comminfo.rutgers.edu/.../07/Vinay_WebResearchIA_June2014_Vinay … · 21 Archive Analysis Workshop Generate derivatives: CDX, WAT, Parsed](https://reader034.fdocuments.in/reader034/viewer/2022052006/601a6b927541750d7c0f2f46/html5/thumbnails/3.jpg)
3
Enable Research & Analysis
![Page 4: Web Archive Analysis - Rutgers Universitywp.comminfo.rutgers.edu/.../07/Vinay_WebResearchIA_June2014_Vinay … · 21 Archive Analysis Workshop Generate derivatives: CDX, WAT, Parsed](https://reader034.fdocuments.in/reader034/viewer/2022052006/601a6b927541750d7c0f2f46/html5/thumbnails/4.jpg)
Analysis Workflow
![Page 5: Web Archive Analysis - Rutgers Universitywp.comminfo.rutgers.edu/.../07/Vinay_WebResearchIA_June2014_Vinay … · 21 Archive Analysis Workshop Generate derivatives: CDX, WAT, Parsed](https://reader034.fdocuments.in/reader034/viewer/2022052006/601a6b927541750d7c0f2f46/html5/thumbnails/5.jpg)
5
Data: WARC
Data written by web crawlers Web ARchive Container File - WARC (ISO standard) Revision of the ARC file format Each file contains a series of concatenated records
– Full HTTP request/response records
– Metadata records (links, crawler path, encoding etc.)
– Records to store duplicate detection events
– Records to support segmentation and conversion
![Page 6: Web Archive Analysis - Rutgers Universitywp.comminfo.rutgers.edu/.../07/Vinay_WebResearchIA_June2014_Vinay … · 21 Archive Analysis Workshop Generate derivatives: CDX, WAT, Parsed](https://reader034.fdocuments.in/reader034/viewer/2022052006/601a6b927541750d7c0f2f46/html5/thumbnails/6.jpg)
6
Derived Data: CDX
Index for Wayback Machine Space delimited text file Contains only essential fields needed by Wayback
– URL, Timestamp, Content Digest
– MIME type, HTTP Status Code
– Redirect URL, meta tags, size
– WARC filename and file offset of record
![Page 7: Web Archive Analysis - Rutgers Universitywp.comminfo.rutgers.edu/.../07/Vinay_WebResearchIA_June2014_Vinay … · 21 Archive Analysis Workshop Generate derivatives: CDX, WAT, Parsed](https://reader034.fdocuments.in/reader034/viewer/2022052006/601a6b927541750d7c0f2f46/html5/thumbnails/7.jpg)
7
Wayback Machine
![Page 8: Web Archive Analysis - Rutgers Universitywp.comminfo.rutgers.edu/.../07/Vinay_WebResearchIA_June2014_Vinay … · 21 Archive Analysis Workshop Generate derivatives: CDX, WAT, Parsed](https://reader034.fdocuments.in/reader034/viewer/2022052006/601a6b927541750d7c0f2f46/html5/thumbnails/8.jpg)
8
Growth of content
![Page 9: Web Archive Analysis - Rutgers Universitywp.comminfo.rutgers.edu/.../07/Vinay_WebResearchIA_June2014_Vinay … · 21 Archive Analysis Workshop Generate derivatives: CDX, WAT, Parsed](https://reader034.fdocuments.in/reader034/viewer/2022052006/601a6b927541750d7c0f2f46/html5/thumbnails/9.jpg)
9
Growth of content
![Page 10: Web Archive Analysis - Rutgers Universitywp.comminfo.rutgers.edu/.../07/Vinay_WebResearchIA_June2014_Vinay … · 21 Archive Analysis Workshop Generate derivatives: CDX, WAT, Parsed](https://reader034.fdocuments.in/reader034/viewer/2022052006/601a6b927541750d7c0f2f46/html5/thumbnails/10.jpg)
10
Rate of duplication
![Page 11: Web Archive Analysis - Rutgers Universitywp.comminfo.rutgers.edu/.../07/Vinay_WebResearchIA_June2014_Vinay … · 21 Archive Analysis Workshop Generate derivatives: CDX, WAT, Parsed](https://reader034.fdocuments.in/reader034/viewer/2022052006/601a6b927541750d7c0f2f46/html5/thumbnails/11.jpg)
11
Breakdown by Year-First-Crawled
![Page 12: Web Archive Analysis - Rutgers Universitywp.comminfo.rutgers.edu/.../07/Vinay_WebResearchIA_June2014_Vinay … · 21 Archive Analysis Workshop Generate derivatives: CDX, WAT, Parsed](https://reader034.fdocuments.in/reader034/viewer/2022052006/601a6b927541750d7c0f2f46/html5/thumbnails/12.jpg)
Log Analysis (Hive/Pig/Giraph)
CDX Warehouse Crawl Log Warehouse Distribution of HTTP status codes, MIME types Find timeout errors, duplicate content, crawler traps,
robots exclusions Trace path of the crawler
![Page 13: Web Archive Analysis - Rutgers Universitywp.comminfo.rutgers.edu/.../07/Vinay_WebResearchIA_June2014_Vinay … · 21 Archive Analysis Workshop Generate derivatives: CDX, WAT, Parsed](https://reader034.fdocuments.in/reader034/viewer/2022052006/601a6b927541750d7c0f2f46/html5/thumbnails/13.jpg)
13
Derived Data: Parsed Text
Input to build text indexes for Search Text is extracted from (W)ARC files HTML boilerplate is stripped out Also contains metadata for each record
– URL, Timestamp, Content Digest, Record Length
– MIME type, HTTP status code
– Title, description and meta keywords
– Links with anchor text
![Page 14: Web Archive Analysis - Rutgers Universitywp.comminfo.rutgers.edu/.../07/Vinay_WebResearchIA_June2014_Vinay … · 21 Archive Analysis Workshop Generate derivatives: CDX, WAT, Parsed](https://reader034.fdocuments.in/reader034/viewer/2022052006/601a6b927541750d7c0f2f46/html5/thumbnails/14.jpg)
14
Derived Data: WAT
Extensible Metadata format Essential metadata for many types of analyses Avoids barriers to data exchange: copyright, privacy Less data than WARC, more than CDX WAT records are WARC metadata records Contains for every HTML page in the WARC,
– Title, description and meta keywords
– Embeds and outgoing links with alt / anchor text
![Page 15: Web Archive Analysis - Rutgers Universitywp.comminfo.rutgers.edu/.../07/Vinay_WebResearchIA_June2014_Vinay … · 21 Archive Analysis Workshop Generate derivatives: CDX, WAT, Parsed](https://reader034.fdocuments.in/reader034/viewer/2022052006/601a6b927541750d7c0f2f46/html5/thumbnails/15.jpg)
15
Text Analysis (Pig/Mahout)
Text extracted from WARC / Parsed Text / WAT files Use curated collections to train Classifiers Cluster documents in collections Topic Modeling
– Discover topics
– Study how topics evolve over time Compare how a page describes itself (meta text) vs.
how other pages linking to it describe it (anchor text)
![Page 16: Web Archive Analysis - Rutgers Universitywp.comminfo.rutgers.edu/.../07/Vinay_WebResearchIA_June2014_Vinay … · 21 Archive Analysis Workshop Generate derivatives: CDX, WAT, Parsed](https://reader034.fdocuments.in/reader034/viewer/2022052006/601a6b927541750d7c0f2f46/html5/thumbnails/16.jpg)
16
Link Analysis (Pig/Giraph)
Links extracted from crawl logs / WARC metadata records / Parsed Text / WAT files
Indegree and Outdegree information Inter-host and Intra-host link information Study how linking behavior changes over time Rank resources by PageRank
– Identify important resources
– Prioritize crawling of missing resources Find possible spam pages by running biased PageRank
algorithms
![Page 17: Web Archive Analysis - Rutgers Universitywp.comminfo.rutgers.edu/.../07/Vinay_WebResearchIA_June2014_Vinay … · 21 Archive Analysis Workshop Generate derivatives: CDX, WAT, Parsed](https://reader034.fdocuments.in/reader034/viewer/2022052006/601a6b927541750d7c0f2f46/html5/thumbnails/17.jpg)
17
Completeness
![Page 18: Web Archive Analysis - Rutgers Universitywp.comminfo.rutgers.edu/.../07/Vinay_WebResearchIA_June2014_Vinay … · 21 Archive Analysis Workshop Generate derivatives: CDX, WAT, Parsed](https://reader034.fdocuments.in/reader034/viewer/2022052006/601a6b927541750d7c0f2f46/html5/thumbnails/18.jpg)
18
PageRank over Time
![Page 19: Web Archive Analysis - Rutgers Universitywp.comminfo.rutgers.edu/.../07/Vinay_WebResearchIA_June2014_Vinay … · 21 Archive Analysis Workshop Generate derivatives: CDX, WAT, Parsed](https://reader034.fdocuments.in/reader034/viewer/2022052006/601a6b927541750d7c0f2f46/html5/thumbnails/19.jpg)
19
PageRank over Time
![Page 20: Web Archive Analysis - Rutgers Universitywp.comminfo.rutgers.edu/.../07/Vinay_WebResearchIA_June2014_Vinay … · 21 Archive Analysis Workshop Generate derivatives: CDX, WAT, Parsed](https://reader034.fdocuments.in/reader034/viewer/2022052006/601a6b927541750d7c0f2f46/html5/thumbnails/20.jpg)
20
PageRank over Time
![Page 21: Web Archive Analysis - Rutgers Universitywp.comminfo.rutgers.edu/.../07/Vinay_WebResearchIA_June2014_Vinay … · 21 Archive Analysis Workshop Generate derivatives: CDX, WAT, Parsed](https://reader034.fdocuments.in/reader034/viewer/2022052006/601a6b927541750d7c0f2f46/html5/thumbnails/21.jpg)
21
Archive Analysis Workshop
Generate derivatives: CDX, WAT, Parsed Text Set up CDX Warehouse using Hive Extract links from WARCs / WAT / Parsed Text Extract text from WARCs / WAT / Parsed Text Generate Archival web graphs
– Assign integer / fingerprint ID to URLs
– Represent graph as an adjacency list using these IDs and the timestamp info
Generate host and domain graphs
![Page 22: Web Archive Analysis - Rutgers Universitywp.comminfo.rutgers.edu/.../07/Vinay_WebResearchIA_June2014_Vinay … · 21 Archive Analysis Workshop Generate derivatives: CDX, WAT, Parsed](https://reader034.fdocuments.in/reader034/viewer/2022052006/601a6b927541750d7c0f2f46/html5/thumbnails/22.jpg)
22
Archive Analysis Workshop
Text Analysis using Pig / Mahout
– Extract top terms using TF-IDF
– Prepare text for analysis with Mahout Link Analysis with Pig / Giraph
– Degree Analysis
– PageRank
– Find common links between entities Data extraction
– Repackage subset of data into new (W)ARCs
![Page 23: Web Archive Analysis - Rutgers Universitywp.comminfo.rutgers.edu/.../07/Vinay_WebResearchIA_June2014_Vinay … · 21 Archive Analysis Workshop Generate derivatives: CDX, WAT, Parsed](https://reader034.fdocuments.in/reader034/viewer/2022052006/601a6b927541750d7c0f2f46/html5/thumbnails/23.jpg)
23
Archive Analysis Workshop
https://webarchive.jira.com/wiki/display/Iresearch/Web+Archive+Analysis+Workshop