Australian web domain harvests 2005, 2006 & 2007.

7
Australian web domain harvests 2005, 2006 & 2007 U nique H osts C ollected 811,523 1,260,533 1,247,614 42,936 10,037 0 200,000 400,000 600,000 800,000 1,000,000 1,200,000 1,400,000 AusC raw l 2005 AusC raw l 2006 AusC raw l 2007 PAND O R A all PANDO R A (.au)

Transcript of Australian web domain harvests 2005, 2006 & 2007.

Page 1: Australian web domain harvests 2005, 2006 & 2007.

Australian web domain harvests2005, 2006 & 2007

Unique Hosts Collected

811,523

1,260,533 1,247,614

42,93610,037

0

200,000

400,000

600,000

800,000

1,000,000

1,200,000

1,400,000

AusCrawl 2005 AusCrawl 2006 AusCrawl 2007 PANDORA all PANDORA (.au)

Page 2: Australian web domain harvests 2005, 2006 & 2007.

Igor RanitovicInternet Archive engineerWith Petabox rackFor Australian domain harvest

Page 3: Australian web domain harvests 2005, 2006 & 2007.

PANDORA : Domain Harvesting

• Australian domain harvest– .au domain, located on Australian servers– Internet Archive

• 1st harvest June/July 2005 – 4 weeks, 185m files, 6.69 TBs

• 2nd harvest Aug/Sept 2006– 5 weeks, 596m files, 19.04 TBs

• 3rd harvest Aug/Sept 2007– 4 weeks, 516m files, 18.47 TBs

Page 4: Australian web domain harvests 2005, 2006 & 2007.

Comparative statistics

PANDORA

Files: 51 million

Size: 2.12 TB

Domain Harvest 2005 2006 2007

Unique files 185,549,662 596,238,990 516,064,820

Hosts crawled 811,523 1,046,038 1,247,614

Size 6.69 TB 19.04 18.47 TB

Domain Harvests

Files: 1,297 million

Size: 44.2 TB

Page 5: Australian web domain harvests 2005, 2006 & 2007.

PANDORA : Domain Harvesting

Size in Terabytes

1.73

6.69

19.04

18.47

PANDORA

AusCraw l05

AusCraw l06

AusCraw l07

Page 6: Australian web domain harvests 2005, 2006 & 2007.

PANDORA : Domain Harvesting

• Some pros – – Retains linkages and context– Large scale – more bytes for the buck– Less selectively discriminate

• Some cons – – High dependence on the crawler technology– Domain and geo-location bias (.au, geoIP)– Limitations in timeliness, quality assurance,

scoping, site complexity, deep web– Legal and access issues to resolve

Page 7: Australian web domain harvests 2005, 2006 & 2007.

PANDORA : Australia’s Web Archive

• Enormous growth and volume of material• Everyone can be creators and publishers• Virtually instantaneous publication• Dynamic content and format• Multiplicity of formats• Technology dependent • Hyperlinked and interconnected• Highly accessible but hard to identify• Ephemeral• Interactivity, re-use, personalisation (web 2.0)