27 APRIL 2015 STUDYING A NATION’S WEB DOMAIN OVER TIME: ANALYTICAL AND METHODOLOGICAL...

31
27 APRIL 2015 STUDYING A NATION’S WEB DOMAIN OVER TIME: ANALYTICAL AND METHODOLOGICAL CONSIDERATIONS NIELS BRÜGGER, ASSOCIATE PROFESSOR, HEAD OF CENTRE FOR INTERNET STUDIES AND NETLAB, AARHUS UNIVERSITY DITTE LAURSEN, SENIOR RESEARCHER AND CURATOR, THE DANISH NETARCHIVE JANNE NIELSEN, RESEARCH ASSISTANT, NETLAB, AARHUS UNIVERSITY

Transcript of 27 APRIL 2015 STUDYING A NATION’S WEB DOMAIN OVER TIME: ANALYTICAL AND METHODOLOGICAL...

27 APRIL 2015

STUDYING A NATION’S WEB DOMAIN OVER TIME:

ANALYTICAL AND METHODOLOGICAL

CONSIDERATIONS

NIELS BRÜGGER, ASSOCIATE PROFESSOR, HEAD OF CENTRE FOR INTERNET STUDIES AND NETLAB, AARHUS UNIVERSITY

DITTE LAURSEN, SENIOR RESEARCHER AND CURATOR, THE DANISH NETARCHIVE

JANNE NIELSEN, RESEARCH ASSISTANT, NETLAB, AARHUS UNIVERSITY

STUDYING A NATION’S WEB DOMAIN OVER TIME

Niels Brügger, Ditte Laursen & Janne Nielsen27 APRIL 2015

2

OVERVIEW OF PRESENTATION

1. The project› Why study the development of a nation’s web domain?› How to study the development of a nation’s web domain?

— an outline of an analytical design2. Methodological challenges3. Solutions4. Results

› Registry of .dk domains› Corpus creation

5. Next steps

STUDYING A NATION’S WEB DOMAIN OVER TIME

Niels Brügger, Ditte Laursen & Janne Nielsen27 APRIL 2015

3

THE PROJECT

What has the entire Danish web looked like in the past, and how has it developed?

What are the methodological challenges in conducting such a study?

What kind of research infrastructure do we need to conduct such a study?

STUDYING A NATION’S WEB DOMAIN OVER TIME

Niels Brügger, Ditte Laursen & Janne Nielsen27 APRIL 2015

WHY STUDY THE DEVELOPMENT OF A NATION'S WEB DOMAIN?

› It is an important part of a nation’s cultural heritage

› It is a back cloth for all other types of web entities and activities

› It can identify some of the patterns of the developments of the web and relate them to the web of today

STUDYING A NATION’S WEB DOMAIN OVER TIME

Niels Brügger, Ditte Laursen & Janne Nielsen27 APRIL 2015

5

HOW TO STUDY THE DEVELOPMENT OF A NATION’S WEB DOMAIN?

An outline of an analytical design — A gross list of possible ’probes’:› Size› Space› Structure› Aliveness› Content

STUDYING A NATION’S WEB DOMAIN OVER TIME

Niels Brügger, Ditte Laursen & Janne Nielsen27 APRIL 2015

6

HOW CAN WE STUDY THE DEVELOPMENT OF A NATION'S WEB DOMAIN? SIZE — BYTES

› How small/big is a nation’s web domain?› The size of different file types and of file types in general› How big/small are websites?

STUDYING A NATION’S WEB DOMAIN OVER TIME

Niels Brügger, Ditte Laursen & Janne Nielsen27 APRIL 2015

7

HOW CAN WE STUDY THE DEVELOPMENT OF A NATION'S WEB DOMAIN? SPACE – GEOLOCATION

› Where are websites located?› Search the text for geographic references, e.g. postcodes in

footers

STUDYING A NATION’S WEB DOMAIN OVER TIME

Niels Brügger, Ditte Laursen & Janne Nielsen27 APRIL 2015

8

HOW CAN WE STUDY THE DEVELOPMENT OF A NATION'S WEB DOMAIN? NETWORKS

Website internal/external hyperlinks› Are websites closed or open towards the web?› How flat/deep are websites?

Web domain internal/external hyperlinks› Centrality based on in-links› How well-linked is the national web domain to the rest of the

web?› Which other domain names are the most linked-to?

STUDYING A NATION’S WEB DOMAIN OVER TIME

Niels Brügger, Ditte Laursen & Janne Nielsen27 APRIL 2015

9

HOW CAN WE STUDY THE DEVELOPMENT OF A NATION'S WEB DOMAIN? ALIVENESS – UPDATING

› Domain names: number of new/inactive/disappeared domain names

› Updating: number of web objects having been changed since last archiving

STUDYING A NATION’S WEB DOMAIN OVER TIME

Niels Brügger, Ditte Laursen & Janne Nielsen27 APRIL 2015

10

HOW CAN WE STUDY THE DEVELOPMENT OF A NATION'S WEB DOMAIN? CONTENT 1

Closedness› How many websites are password protected?

File and software types› Which file types are the most prevalent?› Which software types are the most widespread?

Language› Does the national language prevail? — Or foreign languages?

STUDYING A NATION’S WEB DOMAIN OVER TIME

Niels Brügger, Ditte Laursen & Janne Nielsen27 APRIL 2015

11

HOW CAN WE STUDY THE DEVELOPMENT OF A NATION'S WEB DOMAIN? CONTENT 2

Textual elements on webpage› Background color› Most used fonts› Length of webpages › Placing of menu items (left align, vertical, or top align,

horizontal)

Semantics› Word frequencies› Where specific issues or topics are to be found, and how they

spread

STUDYING A NATION’S WEB DOMAIN OVER TIME

Niels Brügger, Ditte Laursen & Janne Nielsen27 APRIL 2015

12

METHODOLOGICAL CHALLENGES

The web of the past is gone

Possible solution: using (national) web archives› DK: Legal Deposit law effective July 2005› DK: web material within the ccTLD .dk and websites on other

domains aimed at a Danish audience› DK: 2015: approx 1 million active domain names within the

ccTLD .dk — 583 Terabytes

No 1:1 relation between archive and the Danish web domain

STUDYING A NATION’S WEB DOMAIN OVER TIME

Niels Brügger, Ditte Laursen & Janne Nielsen27 APRIL 2015

13

METHODOLOGICAL CHALLENGES

No 1:1 relation between Danish national archive and the Danish national web domain

› Not everything has been archived› Unsystematic, no register, no original to compare with› Archiving takes time, e.g. the link structure becomes

inconsistent› Deduplication may affect the subsequent use of the archived

material› Archiving strategies may be changed between two archivings› Parts of domains may be harvested more than once

NETLAB WORKSHOP OM WEBARKIVERING 18. MARTS 2015

14

PARTS OF DOMAINS MAY BE HARVESTED MORE THAN ONCE

start url

url

url url

url

url url url

url

url

1

0

2

3

harvester (web crawler/spider)

domain

domain

domain

domain domain A

urlurlurlurl

url url url

urlurl url

url url url

url url url url urlurl

domain A

domain B

domain C…

domain B

urlurlurlurl

url url url

urlurl url

url url url

url url url url urlurl

url

domain C

urlurlurlurl

url url url

urlurl url

url url

url url url url urlurl

url

url

url

STUDYING A NATION’S WEB DOMAIN OVER TIME

Niels Brügger, Ditte Laursen & Janne Nielsen27 APRIL 2015

16

METHODOLOGICAL CHALLENGES

› Main harvest: objects within a domain which have been harvested in the job to which the harvest of the domain was assigned

› By-harvest: objects within a domain which have been harvested in another job than the one to which the harvest of the domain was assigned

Domain A — MH

JOB 1

Domain B — MH

Domain C — MH

Domain E — MH

JOB 2

B1 — BH

Domain F — MH

JOB 3

B2 — BH

D1 — BH

Domain D — MH

STUDYING A NATION’S WEB DOMAIN OVER TIME

Niels Brügger, Ditte Laursen & Janne Nielsen27 APRIL 2015

17

SOLUTIONS

Not to use the archive after all› Use the registry of .dk domains

Corpus creation› Selection of harvests› Selection of one version of each domain (consisting of the main

harvest and possibly by-harvests)

STUDYING A NATION’S WEB DOMAIN OVER TIME

Niels Brügger, Ditte Laursen & Janne Nielsen27 APRIL 2015

18

REGISTRY OF .DK DOMAINS

Size and aliveness – 2006, 2009, 2012, 2015 › What are the total number of domain names over time?› How many domain names have disappeared compared to the

previous years? (and which ones)› How many domain names have been created compared to the

previous year? (and which ones)› How many domain names have changed hands compared to

the previous years? (and which ones)› How is the relationship of ownership and domains over time?

(cf. long tail)

STUDYING A NATION’S WEB DOMAIN OVER TIME

Niels Brügger, Ditte Laursen & Janne Nielsen27 APRIL 2015

19

RESULTS: REGISTRY OF .DK DOMAINS

Number of domain names over time

2005 2009 2012 20150

200000

400000

600000

800000

1000000

1200000

1400000

629,344

973,456

1,163,2501,277,035

STUDYING A NATION’S WEB DOMAIN OVER TIME

Niels Brügger, Ditte Laursen & Janne Nielsen27 APRIL 2015

20

RESULTS: REGISTRY OF .DK DOMAINS

New and disappearing domain names from 2005 to 2015

2005-2009 2009-2012 2012-20150

50000100000150000200000250000300000350000400000450000500000

470,925416,081

369,002

126,813

226,287255,217

CreatedDisappeared

STUDYING A NATION’S WEB DOMAIN OVER TIME

Niels Brügger, Ditte Laursen & Janne Nielsen27 APRIL 2015

21

RESULTS: REGISTRY OF .DK DOMAINS

Number of domain names which have changed hands over time

• In 2015, 14% of the domains from 2012 had changed the owner name

• Both in 2012 and in 2015, just less of 10% of the total number of owners owned 50% of the Danish domains

• An observation: If you own more than three domains you are part of the top 10% of domain owners

Year Domains Owners Anonymous

2012 1.163.250 513.326 46.727

2015 1.277.035 549.978 58.710

STUDYING A NATION’S WEB DOMAIN OVER TIME

Niels Brügger, Ditte Laursen & Janne Nielsen27 APRIL 2015

22

RESULTS: REGISTRY OF .DK DOMAINS

Relationship of ownership and domain names over time. Anonymous registrants removed. Chart shows 2012—no visual difference between 2012 and 2015

Parameter

2012 2015

Max 3422 3786

Mean 2.175 2.215

Median 1 1

STUDYING A NATION’S WEB DOMAIN OVER TIME

Niels Brügger, Ditte Laursen & Janne Nielsen27 APRIL 2015

23

PRE/POST-STEPS: REGISTRY OF .DK DOMAINS

Pre-steps› DK Hostmaster has shifted from ISO-8853 to UTF-8› Earlier attempts at handling the data assumed space separated

data sets when in fact they are fixed width fields› Data from DK hostmaster contains dirt, e.g. tab characters and

in one year some sort of header:

Post-steps› Same questions on several years (all years, up till four times a

year)› Further investigation on which domains have disappeared› New questions emerged in the process

STUDYING A NATION’S WEB DOMAIN OVER TIME

Niels Brügger, Ditte Laursen & Janne Nielsen27 APRIL 2015

24

CORPUS CREATION

Collaboration between researchers, curators, developers and management at the archive› How is a broad crawl performed? ie. several ”steps”› When were broad crawls performed?› How to find the most complete version of a domain within a

certain timespan within a broad crawl?› What do we mean when we talk about a ”web element”, a ”web

page”, a ”version” etc.?› What could a corpus creation algorithm look like?› How many resources are needed to test and implement a

creation of a corpus?

STUDYING A NATION’S WEB DOMAIN OVER TIME

Niels Brügger, Ditte Laursen & Janne Nielsen27 APRIL 2015

25

CORPUS CREATION

Use of broad crawls› Internationally recognized as a suitable web harvesting strategy

for national archives› 2-4 broad crawls each year of all domains from .dk as well as

Danish websites published under other extensions› Comprehensive in nature and consistent over time

STUDYING A NATION’S WEB DOMAIN OVER TIME

Niels Brügger, Ditte Laursen & Janne Nielsen27 APRIL 2015

26

CORPUS CREATION

Selection of broad crawls› Four broad crawls, one from each of the years 2006, 2009, 2012

and 2015 (first crawl of the year)

2006 2009 2012 2015

STUDYING A NATION’S WEB DOMAIN OVER TIME

Niels Brügger, Ditte Laursen & Janne Nielsen27 APRIL 2015

27

CORPUS CREATION

Selection of one havested version of each domain› Domain version from ’main harvest’› Inclusion of unique materials from the’ by-

harvest’ if the material is within our selected time span

Domain A — MH

JOB 1

Domain B — MH

Domain C — MH

Domain E — MH

JOB 2

B1 — BH

Domain F — MH

JOB 3

B2 — BH

D1 — BH

Domain D — MH

STUDYING A NATION’S WEB DOMAIN OVER TIME

Niels Brügger, Ditte Laursen & Janne Nielsen27 APRIL 2015

28

CORPUS CREATION

Test of the algorithm› Tested on the first broad crawl from January

2006 (1TB, only websites <10MB)› This harvest consists of 127 jobs› Each job consist of several domains› We produce an 18GB crawl log enhanced

with job IDs

STUDYING A NATION’S WEB DOMAIN OVER TIME

Niels Brügger, Ditte Laursen & Janne Nielsen27 APRIL 2015

29

CORPUS CREATION

Test of the algorithm› Using IBM BigInsights we can perform the

algorithm on this large spreadsheet› The algorithm locates the objects that are

not included in a main harvest (’by-harvests’)

› There might be duplicates — in these cases, the algorithm identifies and selects the objects closest to the time of the main harvest

STUDYING A NATION’S WEB DOMAIN OVER TIME

Niels Brügger, Ditte Laursen & Janne Nielsen27 APRIL 2015

30

NEXT STEPS

From test to implementation› How to get from crawl logs to the material that the crawl logs

refer to and that we want to analyze? — Should WARC files be opened? Should a subset of an index be used?

› Start making some of the analyzes

Dissemination and networking› Book chapters and papers› An open workshop in Aarhus, Denmark in 2016 for other

national web archives and scholars wanting to do similar projects — aiming at establishing transnational ’best practice’ and analytical design

27 APRIL 2015

STUDYING A NATION’S WEB DOMAIN OVER TIME:

ANALYTICAL AND METHODOLOGICAL

CONSIDERATIONS

NIELS BRÜGGER, ASSOCIATE PROFESSOR, HEAD OF CENTRE FOR INTERNET STUDIES AND NETLAB, AARHUS UNIVERSITY

DITTE LAURSEN, SENIOR RESEARCHER AND CURATOR, THE DANISH NETARCHIVE

JANNE NIELSEN, RESEARCH ASSISTANT, NETLAB, AARHUS UNIVERSITY