Post on 04-Jan-2016
Webarchivering in het Audiovisuele DomeinWeb archiving in the audiovisual Domain
Julia Vytopil- Nederlands Instituut voor Beeld en GeluidNetherlands Institute for Sound and Vision
Web Archiving in audiovisual field
Studiedag webarchivering in Nederland, Hilversum, October 30, 2014
Chloé Martinchloe@internetmemory.net
http://archivethe.net
What? & Why?What is a Web archive?A copy of website
Recorded by a crawlerAt a specific date and time
Look and feel like a real website
For Whom?Any institution whose aim is
to collect & preserve web/media material
for historical, cultural, heritage or legal (compliance) purpose
PervasiveDynamic Valuable
Web content
Variety of format
Ephemeral
Why?
Web Archiving Team• Put in place a cross-disciplinary team‣ Curator / Librarian / Archivist‣ Information system technician
• Train a team‣ Web archivist / Project Manager ‣ Engineer(s) to design & monitor the whole process (for
in house solution)
• Web archiving requires critical skills and experience, especially concerning engineers in the case of an in-house solution
How to iimprove Selection Policy
IMR value propositions:
• [Topic crawls] Percolable, a tool to discover relevant sources
• [Crawl of actives sources] Automated refreshment rate
• [Large Crawls] Smart discovery crawl based on topic or language
Challenges: Technical issues
•Deep & Hidden Web
•Webspams and Traps
•Dynamic websites
•Social Web (Twitter, FB, YouTube, Flickr,...)
•Video
Access & Search• Browsing in the archive
• URL
• Full Text with Elastic Search
+
• Branding (search, web archive)
• Automatic redirection
• Automated categorization
• Semantic expansion
Extract valuable informationFrom your large corpus for Users /
Researchers
•Cleaned text
•Keywords to add Cloud
•Outlinks to analyze Graphs
•Structure unstructured data (forums,...)
•Named entities
•More are coming soon...
About IMRInternet Memory Research
✓Spin-off of the Internet Memory Foundation, French start-up, founded in 2011
✓20+ engineers actively engaged in the Web Archiving and Information Mining field
✓EU Projects: DOPA, Annomarket, TrendMiner, Rethink Big, ASAP
✓Large Scale Crawler with high performances
✓Scalable platform based on a distributed architecture and Big Data components (Hadoop, Hbase, HDFS,…)
✓Innovative infrastructure with low consumption
About IMR
Any Question?http://archivethe.net chloe@internetmemory.net
Twitter ArchiveTheNet