NAS_qual reports. 2 NAS_qual - 1 Java batch which works on Heritrix reports (extracted from metadata...

3
NAS_qual reports

Transcript of NAS_qual reports. 2 NAS_qual - 1 Java batch which works on Heritrix reports (extracted from metadata...

Page 1: NAS_qual reports. 2 NAS_qual - 1 Java batch which works on Heritrix reports (extracted from metadata W/ARC files) Compiles a large set of figures and.

NAS_qual reports

Page 2: NAS_qual reports. 2 NAS_qual - 1 Java batch which works on Heritrix reports (extracted from metadata W/ARC files) Compiles a large set of figures and.

2

NAS_qual - 1

• Java batch which works on Heritrix reports (extracted from metadata W/ARC files)• Compiles a large set of figures and lists and store them into text files• 21 figures:

– processed URLs– harvested URLs– harvested seeds– non-harvested seeds– harvested hosts– harvested domains– non-harvested domains– TLDs– MIME types– harvest duration– average URL/s– average Kb/s– average job size in URLs– average seeds per job– average job size– non-harvested URLs because of robots exclusion– total raw size– number of W/ARC files– size of W/ARC files– number of processed jobs– list of processed jobs

Page 3: NAS_qual reports. 2 NAS_qual - 1 Java batch which works on Heritrix reports (extracted from metadata W/ARC files) Compiles a large set of figures and.

3

NAS_qual - 2

• 01-codehttp_url.txt : URL distribution per HTTP response code.• 02-typemime_url_octets.txt : URL and bytes distribution per MIME type.• 03-tld_url_octets.txt : URL and bytes distribution per TLD.• 04-tld-hotes.txt : hosts distribution per TLD.• 05-tld-domaines.txt : domains distribution per TLD.• 06-tranches_hotes_url.txt : number of hosts in a given slice of harvested URL.

– =<10; 11-100; 101-1000; 1001-10000;10001-50000; 50001-100000; >=100001;• 07-tranches_domaines_url.txt : same with domains.• 08-tranches_domaines_hotes.txt : same with hosts on domains.• 09-tld2ndniveau_url_octets.txt : URL and bytes distribution per second level TLD.• 10-tld2ndniveau_hotes.txt : host distribution per second level TLD.• 11-top_domaines_url_octets.txt : URL and bytes distribution for the N bigger

domains.• 12-top_hotes_url_octets.txt : URL and bytes distribution for the N bigger hosts.• 13-top_domaines_hotes.txt : list of domains having the largest number of hosts.• 14-codereponse_seeds.txt : distribution of seed per response code.