The Elephant in the Library - Integrating Hadoop
-
Upload
cneudecker -
Category
Technology
-
view
136 -
download
4
description
Transcript of The Elephant in the Library - Integrating Hadoop
SCAPE
Clemens Neudecker Sven Schlarb@cneudecker @SvenSchlarb
The Elephant in the LibraryIntegrating Hadoop
Contents
1. Background: Digitization of cultural heritage
2. Numbers: Scaling up!
3. Challenges: Use cases and scenarios
4. Outlook
1. Background
“The digital revolution is far more significant than the invention of
writing or even of printing”Douglas Engelbart
Then
Our libraries
• The Hague, Netherlands• Founded in 1798• 120.000 visitors per year• 6 million documents• 260 FTE
www.kb.nl
• Vienna, Austria• Founded in 14th century • 300.000 visitors per year• 8 million documents• 300 FTE
www.onb.ac.at
Digitization
Libraries are rapidly transforming from physical…
to digital…
Transformation
Curation Lifecycle Model from Digital Curation Centre www.dcc.ac.uk
Now
Digital Preservation
Our data – cultural heritage
• Traditionally• Bibliographic and other metadata• Images (Portraits/Pictures, Maps, Posters, etc.)• Text (Books, Articles, Newspapers, etc.)
• More recently• Audio/Video• Websites, Blogs, Twitter, Social Networks• Research Data/Raw Data• Software? Apps?
2. Numbers
“A good decision is based on knowledge and not on numbers”
Plato, 400 BC
Numbers (I)National Library of the Netherlands
• Digital objects• > 500 million files• 18 million digital publications (+ 2M/year)• 8 million newspaper pages (+ 4M/year) • 152.000 books (+ 100k/year)• 730.000 websites (+ 170k/year)
• Storage• 1.3 PB (currently 458 TB used)• Growing approx. 150 TB a year
Numbers (II)Austrian National Library
• Digital objects• 600.000 volumes being digitised during the next
years (currently 120.000 volumes, 40 million pages)
• 10 million newspapers and legal texts• 1.16 billion files in web archive from
> 1 million domains• Several 100.000 images and portraits
• Storage• 84 TB• Growing approx. 15 TB a year
Numbers (III)
• Google Books Project• 2012: 20 million books scanned
(approx. 7,000,000,000 pages)• www.books.google.com
• Europeana• 2012: 25 million digital objects• All metadata licensed CC-0• www.europeana.eu/portal
Numbers (IV)
• Hathi Trust• 3,721,702,950 scanned pages• 477 TBytes• www.hathitrust.org
• Internet Archive• 245 billion web pages archived• 10 PBytes• www.archive.org
Numbers (V)
• What can we expect?• Enumerate 2012: only about 4% digitised so far• Strong growth of born digital information
Source: security.networksasia.net Source: www.idc.com
3. Challenges
“What do you do with a million books?” Gregory Crane, 2006
Making it scale
Scalability in terms of …• size• number• complexity • heterogeneity
SCAPE
• SCAPE = SCAlable Preservation Environments• €8.6M EU funding, Feb 2011 – July 2014• 20 partners from public sector, academia, industry• Main objectives:
• Scalability• Automation• Planning
www.scape-project.eu
Use cases (I)
• Document recognition: From image to XML • Business case:
• Better presentation options• Creation of eBooks• Full-text indexing
Use cases (II)
• File type migration: JP2k TIFF
• Business case: • Originally migration
to JP2k to reduce storage costs
• Reverse process used in case JP2k becomes obsolete
Use cases (III)
• Web archiving: Characterization of web content
• Business case: • What is in a Top Level Domain?• What is the distribution of file formats?• http://www.openplanetsfoundation.org/blogs/2013-01-
09-year-fits
xkcd.com/688
Use cases (IV)
• Digital Humanities: Making sense of the millions
• Business case: • Text mining & NLP• Statistical analysis• Semantic enrichment• Visualizations Source: www.open.ac.uk/
Enter the Elephants…
Source: Biopics
Experimental Cluster
Apache Tomcat Web Application
Taverna Server(REST API)
Hadoop Jobtracker
File server
Cluster
Execution environment
• Metadata log files generated by the web crawler during the harvesting process (no mime type identification – just the mime types returned by the web server)
Scenarios (I)Log file analysis
20110830130705 9684 46 16 3 image/jpeg http://URL at IP 17311 20020110830130709 9684 46 16 3 image/jpeg http://URL at IP 22123 20020110830130710 9684 46 16 3 image/gif http://URL at IP 9794 20020110830130707 9684 46 16 3 image/jpeg http://URL at IP 40056 20020110830130704 9684 46 16 3 text/html http://URL at IP 13149 20020110830130712 9684 46 16 3 image/gif http://URL at IP 2285 20020110830130712 9684 46 16 3 text/html http://URL at IP 415 30120110830130710 9684 46 16 3 text/html http://URL at IP 7873 20020110830130712 9684 46 16 3 text/html http://URL at IP 632 30220110830130712 9684 46 16 3 image/png http://URL at IP 679 200
→ Run file type identification on archived web content
Scenarios (II)Web archiving: File format identification
(W)ARC Container
JPG
GIF
HTM
HTM
MID
(W)ARC RecordReader
based onHERITRIX
Web crawlerread/write (W)ARC
MapReduce
JPG Apache Tikadetect MIME
MapReduce
image/jpg
image/jpg 1image/gif 1text/html 2audio/midi 1
→ Using MapReduce to calculate statistics
Scenarios (II)Web archiving: File format identification
TIKA 1.0DROID 6.01
• Risk of format obsolescence • Quality assurance
• File format validation• Original/target image
comparison• Imagine runtime of 1 minute
per image for 200 million pages ...
Scenarios (III)File format migration
Parallel execution of file format validation using Mapper●Jpylyzer (Python)●Jhove2 (Java)
●Feature extraction requires sharing resources between processing steps
●Challenge to model more complex image comparison scenarios, e.g. book page duplicates detection or digital book comparison
Scenarios (IV)Book page analysis
Create text file containing JPEG2000 input file paths and read image metadata using Exiftool via the Hadoop Streaming API
find
/NAS/Z119585409/00000001.jp2/NAS/Z119585409/00000002.jp2/NAS/Z119585409/00000003.jp2…/NAS/Z117655409/00000001.jp2/NAS/Z117655409/00000002.jp2/NAS/Z117655409/00000003.jp2…/NAS/Z119585987/00000001.jp2/NAS/Z119585987/00000002.jp2/NAS/Z119585987/00000003.jp2…/NAS/Z119584539/00000001.jp2/NAS/Z119584539/00000002.jp2/NAS/Z119584539/00000003.jp2…/NAS/Z119599879/00000001.jp2l/NAS/Z119589879/00000002.jp2/NAS/Z119589879/00000003.jp2...
...
NAS
reading files from NAS
1,4 GB 1,2 GB
: ~ 5 h + ~ 38 h = ~ 43 h60.000 books24 Million pages
Jp2PathCreator HadoopStreamingExiftoolRead
Z119585409/00000001 2345Z119585409/00000002 2340Z119585409/00000003 2543…Z117655409/00000001 2300Z117655409/00000002 2300Z117655409/00000003 2345…Z119585987/00000001 2300Z119585987/00000002 2340Z119585987/00000003 2432…Z119584539/00000001 5205Z119584539/00000002 2310Z119584539/00000003 2134…Z119599879/00000001 2312Z119589879/00000002 2300Z119589879/00000003 2300...
Reading image metadata
Create text file containing HTML input file paths and create one sequence file with the complete file content in HDFS
find
/NAS/Z119585409/00000707.html/NAS/Z119585409/00000708.html/NAS/Z119585409/00000709.html…/NAS/Z138682341/00000707.html/NAS/Z138682341/00000708.html/NAS/Z138682341/00000709.html…/NAS/Z178791257/00000707.html/NAS/Z178791257/00000708.html/NAS/Z178791257/00000709.html…/NAS/Z967985409/00000707.html/NAS/Z967985409/00000708.html/NAS/Z967985409/00000709.html…/NAS/Z196545409/00000707.html/NAS/Z196545409/00000708.html/NAS/Z196545409/00000709.html...
Z119585409/00000707
Z119585409/00000708
Z119585409/00000709
Z119585409/00000710
Z119585409/00000711
Z119585409/00000712
NAS
reading files from NAS
1,4 GB 997 GB (uncompressed)
: ~ 5 h + ~ 24 h = ~ 29 h60.000 books24 Million pages
HtmlPathCreator SequenceFileCreatorSequenceFile creation
Execute Hadoop MapReduce job using the sequence file created before in order to calculate the average paragraph block width
Z119585409/00000001
Z119585409/00000002
Z119585409/00000003
Z119585409/00000004
Z119585409/00000005...
: ~ 6 h60.000 books24 Million pages
Z119585409/00000001 2100 Z119585409/00000001 2200Z119585409/00000001 2300Z119585409/00000001 2400
Z119585409/00000002 2100 Z119585409/00000002 2200Z119585409/00000002 2300Z119585409/00000002 2400
Z119585409/00000003 2100 Z119585409/00000003 2200Z119585409/00000003 2300Z119585409/00000003 2400
Z119585409/00000004 2100 Z119585409/00000004 2200Z119585409/00000004 2300Z119585409/00000004 2400
Z119585409/00000005 2100 Z119585409/00000005 2200Z119585409/00000005 2300Z119585409/00000005 2400
Z119585409/00000001 2250
Z119585409/00000002 2250
Z119585409/00000003 2250
Z119585409/00000004 2250
Z119585409/00000005 2250
Map Reduce
HadoopAvBlockWidthMapReduce
SequenceFile Textfile
HTML Parsing
Create Hive table and load generated data into the Hive database
: ~ 6 h60.000 books
24 Million pages
HiveLoadExifData & HiveLoadHocrData
jid jwidth
Z119585409/00000001 2250
Z119585409/00000002 2150
Z119585409/00000003 2125
Z119585409/00000004 2125
Z119585409/00000005 2250
hid hwidth
Z119585409/00000001 1870
Z119585409/00000002 2100
Z119585409/00000003 2015
Z119585409/00000004 1350
Z119585409/00000005 1700
htmlwidth
jp2width
Z119585409/00000001 1870Z119585409/00000002 2100Z119585409/00000003 2015Z119585409/00000004 1350Z119585409/00000005 1700
Z119585409/00000001 2250Z119585409/00000002 2150Z119585409/00000003 2125Z119585409/00000004 2125Z119585409/00000005 2250
CREATE TABLE jp2width(hid STRING, jwidth INT)
CREATE TABLE htmlwidth(hid STRING, hwidth INT)
Analytic Queries
: ~ 6 h60.000 books24 Million pages
HiveSelect
jid jwidth
Z119585409/00000001 2250
Z119585409/00000002 2150
Z119585409/00000003 2125
Z119585409/00000004 2125
Z119585409/00000005 2250
hid hwidth
Z119585409/00000001 1870
Z119585409/00000002 2100
Z119585409/00000003 2015
Z119585409/00000004 1350
Z119585409/00000005 1700
htmlwidthjp2width
jid jwidth hwidth
Z119585409/00000001 2250 1870
Z119585409/00000002 2150 2100
Z119585409/00000003 2125 2015
Z119585409/00000004 2125 1350
Z119585409/00000005 2250 1700
select jid, jwidth, hwidth from jp2width inner join htmlwidth on jid = hid
Analytic Queries
Perform a simple Hive query to test if the database has been created successfully
Outlook
“Progress generally appears much greater than it really is”
Johan Nestroy, 1847
What have WE learned?
• We need to carefully assess the efforts for data preparation vs. the actual processing load
• HDFS prefers large files over many small ones,is basically “append-only”
• There is still much more the Hadoop ecosystem has to offer, e.g. YARN, Pig, Mahout
What can YOU do?
• Come join our “Hadoop in cultural heritage” hackathon on 2-4 December 2013, Vienna(See http://www.scape-project.eu/events )
• Check out some tools from our github at https://github.com/openplanets/ and help us make them better and more scalable
• Follow us at @SCAPEProject and spread the word!
What’s in it for US?
• Digital (free) access to centuries of cultural heritage data, 24x7 and from anywhere
• Ensuring our cultural history is not lost
• New innovative applications using cultural heritage data (education, creative industries)
Thank you! Questions?(btw, we’re hiring)
www.kb.nlwww.onb.ac.at
www.scape-project.euwww.openplanetsfoundation.org