Aarhus. BnF main topics – 2013 – crawling side Keep crawling –Broad and focused crawls...

10
Aarhus

Transcript of Aarhus. BnF main topics – 2013 – crawling side Keep crawling –Broad and focused crawls...

Page 1: Aarhus. BnF main topics – 2013 – crawling side Keep crawling –Broad and focused crawls –Limit of 100 Tb Crawl of password protected content –“Press project”:

Aarhus

Page 2: Aarhus. BnF main topics – 2013 – crawling side Keep crawling –Broad and focused crawls –Limit of 100 Tb Crawl of password protected content –“Press project”:

BnF main topics – 2013 – crawling side

• Keep crawling – Broad and focused crawls– Limit of 100 Tb

• Crawl of password protected content– “Press project”: PDFs of daily newspapers– Tests with other kinds of content

• Work on direct deposit of e-books

Page 3: Aarhus. BnF main topics – 2013 – crawling side Keep crawling –Broad and focused crawls –Limit of 100 Tb Crawl of password protected content –“Press project”:

BnF main topics – 2013 – access and preservation sides

• Merging professional and public WB– Various optimizations– Clickable permalink…

• Draw links between web archives and BnF indexing and promotion tools– general catalogue, data.bnf.fr…

• Open access to web archives in regional libraries– Legal and technical aspects

• Start ingesting our web archives in our digital repository

Page 4: Aarhus. BnF main topics – 2013 – crawling side Keep crawling –Broad and focused crawls –Limit of 100 Tb Crawl of password protected content –“Press project”:

Direct deposit for e-books?

• High-level discussions between National Publishers Union and BnF– A better international framework: IFLA

statement on legal deposit, FEP/CENL declaration…

• Why not crawling?– A better unitary indexation of each e-book– No problems of DRMs– Discussing directly with publishers

Page 5: Aarhus. BnF main topics – 2013 – crawling side Keep crawling –Broad and focused crawls –Limit of 100 Tb Crawl of password protected content –“Press project”:

Direct deposit for e-books? / technical side

• A technical layer is available: the extranet for publishers– 2011: digital legal deposit forms– 2012/3: direct transfer of metadata (ONIX)– 2013/4: ebooks?

• What do we need to decide?– Who will be the main interlocutor?– How many and what kind of formats? What

validation? Is it possible to refuse?– What link between the paper and digital version in the

catalogue?– What access tool? Gallica or web archives?

Page 6: Aarhus. BnF main topics – 2013 – crawling side Keep crawling –Broad and focused crawls –Limit of 100 Tb Crawl of password protected content –“Press project”:

RESAW project : some keywords

• Networking (researchers and heritage institutions)

• Standards and collection quality

• Shared tools and services (storage infrastructure, analyzing tools, portal)

• Methods and training

Page 7: Aarhus. BnF main topics – 2013 – crawling side Keep crawling –Broad and focused crawls –Limit of 100 Tb Crawl of password protected content –“Press project”:

RESAW project : interest for BnF

• Promote the use of web archives towards researchers

• Help launching international and national research programs

• Offer groundbreaking tools and services• Get feedback about our collection development

policies• Promote the building and use of web archives

towards high level decision makers

Page 8: Aarhus. BnF main topics – 2013 – crawling side Keep crawling –Broad and focused crawls –Limit of 100 Tb Crawl of password protected content –“Press project”:

Current situation at BnF

• No current research project– But the Web legal deposit team involved in research

frameworks: “Labex” : “excellence laboratories” – Participation in the “Hypertext corpus initiative framework” (lead:

Medialab)

• Relationships with researchers– Political sciences (Political science institute in Paris and

Grenoble, universities of Nancy and Cergy)– Social sciences (university of Paris 1, Grenoble)– Netart (Avignon)– Web metrics (AFNIC)?

• Relationships with associations (literature, sustainable development…)

Page 9: Aarhus. BnF main topics – 2013 – crawling side Keep crawling –Broad and focused crawls –Limit of 100 Tb Crawl of password protected content –“Press project”:

International initiatives to follow up

• Collaborative web harvesting– EU elections, “Olympics” project, Vaclav Havel collection– Use of “nomination tool” provided by University of North

Texas• Portal and shared access

– IIPC website, Memento• Research project

– BL/IA/JISC project on .uk analysis– 80 Tb of data provided by IA– Common crawl project (?)

• Training– PhD sponsorship (UNT)

Page 10: Aarhus. BnF main topics – 2013 – crawling side Keep crawling –Broad and focused crawls –Limit of 100 Tb Crawl of password protected content –“Press project”:

Questions and comments

• The networked we dream about!• Some objectives already (partially)

covered by IIPC– standards, interoperability, shared portal

• Legal issues will be very difficult to solve• Be cautious with the term “quality” (prefer

relevancy for specific goals?)• What will you ask for?

– Money, doctoral students, engineers…