Australian web domain harvests 2005, 2006 & 2007.
-
Upload
janel-collins -
Category
Documents
-
view
214 -
download
2
Transcript of Australian web domain harvests 2005, 2006 & 2007.
![Page 1: Australian web domain harvests 2005, 2006 & 2007.](https://reader035.fdocuments.in/reader035/viewer/2022072010/56649dd95503460f94ace4d5/html5/thumbnails/1.jpg)
Australian web domain harvests2005, 2006 & 2007
Unique Hosts Collected
811,523
1,260,533 1,247,614
42,93610,037
0
200,000
400,000
600,000
800,000
1,000,000
1,200,000
1,400,000
AusCrawl 2005 AusCrawl 2006 AusCrawl 2007 PANDORA all PANDORA (.au)
![Page 2: Australian web domain harvests 2005, 2006 & 2007.](https://reader035.fdocuments.in/reader035/viewer/2022072010/56649dd95503460f94ace4d5/html5/thumbnails/2.jpg)
Igor RanitovicInternet Archive engineerWith Petabox rackFor Australian domain harvest
![Page 3: Australian web domain harvests 2005, 2006 & 2007.](https://reader035.fdocuments.in/reader035/viewer/2022072010/56649dd95503460f94ace4d5/html5/thumbnails/3.jpg)
PANDORA : Domain Harvesting
• Australian domain harvest– .au domain, located on Australian servers– Internet Archive
• 1st harvest June/July 2005 – 4 weeks, 185m files, 6.69 TBs
• 2nd harvest Aug/Sept 2006– 5 weeks, 596m files, 19.04 TBs
• 3rd harvest Aug/Sept 2007– 4 weeks, 516m files, 18.47 TBs
![Page 4: Australian web domain harvests 2005, 2006 & 2007.](https://reader035.fdocuments.in/reader035/viewer/2022072010/56649dd95503460f94ace4d5/html5/thumbnails/4.jpg)
Comparative statistics
PANDORA
Files: 51 million
Size: 2.12 TB
Domain Harvest 2005 2006 2007
Unique files 185,549,662 596,238,990 516,064,820
Hosts crawled 811,523 1,046,038 1,247,614
Size 6.69 TB 19.04 18.47 TB
Domain Harvests
Files: 1,297 million
Size: 44.2 TB
![Page 5: Australian web domain harvests 2005, 2006 & 2007.](https://reader035.fdocuments.in/reader035/viewer/2022072010/56649dd95503460f94ace4d5/html5/thumbnails/5.jpg)
PANDORA : Domain Harvesting
Size in Terabytes
1.73
6.69
19.04
18.47
PANDORA
AusCraw l05
AusCraw l06
AusCraw l07
![Page 6: Australian web domain harvests 2005, 2006 & 2007.](https://reader035.fdocuments.in/reader035/viewer/2022072010/56649dd95503460f94ace4d5/html5/thumbnails/6.jpg)
PANDORA : Domain Harvesting
• Some pros – – Retains linkages and context– Large scale – more bytes for the buck– Less selectively discriminate
• Some cons – – High dependence on the crawler technology– Domain and geo-location bias (.au, geoIP)– Limitations in timeliness, quality assurance,
scoping, site complexity, deep web– Legal and access issues to resolve
![Page 7: Australian web domain harvests 2005, 2006 & 2007.](https://reader035.fdocuments.in/reader035/viewer/2022072010/56649dd95503460f94ace4d5/html5/thumbnails/7.jpg)
PANDORA : Australia’s Web Archive
• Enormous growth and volume of material• Everyone can be creators and publishers• Virtually instantaneous publication• Dynamic content and format• Multiplicity of formats• Technology dependent • Hyperlinked and interconnected• Highly accessible but hard to identify• Ephemeral• Interactivity, re-use, personalisation (web 2.0)