BHL Technology Overview

Post on 11-Jun-2015

1.099 views 0 download

Tags:

description

Presentation to Smithsonian's Office of the Chief Information Officer.

Transcript of BHL Technology Overview

© 2008 Biodiversity Heritage Library www.biodiversitylibrary.org

Biodiversity Heritage Library (BHL):Technology Overview

Chris FreelandDirector, Bioinformatics

Missouri Botanical Garden

Technical DirectorBiodiversity Heritage Librarychris.freeland@mobot.org

www.biodiversitylibrary.org

© 2008 Biodiversity Heritage Library www.biodiversitylibrary.org

BHL Partners

Museums– American Museum of

Natural History (New York)

– Natural History Museum (London)

– Smithsonian Institution (Washington)

– The Field Museum (Chicago)

Botanical Gardens– Missouri Botanical Garden– New York Botanical Garden– Royal Botanic Garden, Kew

University Libraries– Botany Libraries, Harvard University– Ernst Meyer Library of the Museum

of Comparative Zoology, Harvard University

– University of Illinois

Bioinformatics Institutes – MBL/WHOI– uBio.org

© 2008 Biodiversity Heritage Library www.biodiversitylibrary.org

Why have BHL?In any well-appointed Natural History Library there should be found every book and every edition of every book dealing in the remotest way with the subjects concerned. One never knows wherein one edition differs from or supplements the other and unless these are on the same table at the same time it is not possible to collate them properly. Moreover for accurate work it is necessary for the student to verify every reference he may find; it is not enough to copy from a previous author; he must verify each reference itself from the original.

Charles Davies Sherborn, Epilogue to Index Animalium, March 1922

Charles Davies Sherborn (1861-1942)

© 2008 Biodiversity Heritage Library www.biodiversitylibrary.org

Unique Components of BHL

• Combining metadata records from multiple libraries (similar, but different) and representing through a shared portal

• Use of JPEG2000• Web 2.0 Mashups• Taxonomic data mining• Services• Rare & novel content

© 2008 Biodiversity Heritage Library www.biodiversitylibrary.org

Scanning process

1. Select Book2. Pull from Shelf3. Send to IA scanning center4. Book is scanned & QA5. Page images loaded on IA cluster

1. Derivatives created

6. Book returned to library7. Files harvested from IA portal8. Books available for display within BHL portal

Mushrooms of America, edible and poisonous. Ed. by Julius A. Palmer, Jr. , 1885.

© 2008 Biodiversity Heritage Library www.biodiversitylibrary.org

Scan & Store: Internet Archive

Scanning on Scribes

Storage in Petaboxes

© 2008 Biodiversity Heritage Library www.biodiversitylibrary.org

Scanning & Derivatives

• XML• JP2

• PDF• JPG• TXT• DJVu

Master Derivatives

© 2008 Biodiversity Heritage Library www.biodiversitylibrary.org

Harvest from IA

Extract, Transform, Load (ETL)

• Custom scripts to extract content via IA’s APIs

• Database scripts to transform to relational data structure

• Load into database

© 2008 Biodiversity Heritage Library www.biodiversitylibrary.org

© 2008 Biodiversity Heritage Library www.biodiversitylibrary.org

© 2008 Biodiversity Heritage Library www.biodiversitylibrary.org

© 2008 Biodiversity Heritage Library www.biodiversitylibrary.org

Stable URL

Attribution

Name Finding

Page Turning Page TurningZoom/Pan

Download/View

Browse

Search

Filter

Target/Object

© 2008 Biodiversity Heritage Library www.biodiversitylibrary.org

JPEG2000 (*.jp2) display

• RAW original => 85% .jp2

• LuraTech encoder– Wavelet compression

• LizardTech decoder– Tiled on the fly,

cached for performance

• GSIV browser-based client viewer– ‘AJAXian’

LizardTech ExpressServer

Browser GSIV.js

www.biodiversitylibrary.org

.jp2

.jpg

IA

/page/1274907

pageid: 1274907

BHLdb

http://www.archive.org/download/mushroomsofameri00palm/.../mushroomsofameri00palm_0010.jp2

images.mobot.org

A user requests Mushrooms of America, edible and poisonous, Plate X:http://www.biodiversitylibrary.org/page/1274907

locate:

BHL/IA architecture

= 5.0+ sec transfer

Time to deliver image: 8+ sec

© 2008 Biodiversity Heritage Library www.biodiversitylibrary.org

Reuse, don’t rebuild

TIF Image from ScannerConverted to text via PrimeOCRName finding via TaxonFinder Extract namesSubmit to NameBankSOAP response

Name Finding in action

with Taxonomic Intelligence…

© 2008 Biodiversity Heritage Library www.biodiversitylibrary.org

Names data mining

© 2008 Biodiversity Heritage Library www.biodiversitylibrary.org

Tag cloud from LCSHSubject Heading from library catalog

Expressed as MARCXML

Tag Cloud

© 2008 Biodiversity Heritage Library www.biodiversitylibrary.org

Geocoding LCSH

© 2008 Biodiversity Heritage Library www.biodiversitylibrary.org

RSS Feeds

Specific: Last 25 books published in German from NYBGRSS Feed location: http://www.biodiversitylibrary.org/RecentRss/25/GER/NYBG   

1. Allgemeine deutsche Garten-Zeitung, 7, 1829 (added: 04/03/2008 ) 2. Zeitschrift fr wissenschaftliche Mikroskopie und fr mikroskopische

Technik. 2, 1885 (added: 03/28/2008 ) 3. Zeitschrift fr technische Biologie. 7, 1919 (added: 03/27/2008 ) 4. …

General: Last 25 books from all librariesRSS Feed location: http://www.biodiversitylibrary.org/RecentRss/25   

1. Summa plantarum : v.1 (added: 05/01/2008 ) 2. Vegetable materia medica of the United States (added: 04/30/2008 ) 3. The family herbal; (added: 04/30/2008 ) 4. …

© 2008 Biodiversity Heritage Library www.biodiversitylibrary.org

Services

• Names– v.1 released

http://www.biodiversitylibrary.org/services/name/NameService.asmx

• Stable urls– http://www.biodiversitylibrary.org/bibliography/1652– http://www.biodiversitylibrary.org/name/Carcharodon_carcharias

• Future:– Citation Resolver– Titles Resolver

© 2008 Biodiversity Heritage Library www.biodiversitylibrary.org

BHL Name Serviceshttp://www.biodiversitylibrary.org/services/name/NameService.asmx

© 2008 Biodiversity Heritage Library www.biodiversitylibrary.org

Provider Integration

• Encyclopedia of Life

• Atrium Andes Biodiversity

• Wikipedia

• EDIT Scratchpads

• More to come…

© 2008 Biodiversity Heritage Library www.biodiversitylibrary.org

© 2008 Biodiversity Heritage Library www.biodiversitylibrary.org

© 2008 Biodiversity Heritage Library www.biodiversitylibrary.org

Hardware Infrastructure

• Distributed

• Partially redundant– Work needed

• Mixed platforms

• Mixed app frameworks

© 2008 Biodiversity Heritage Library www.biodiversitylibrary.org

MOBOT

Petabox cluster

Internet Archive

© 2008 Biodiversity Heritage Library www.biodiversitylibrary.org

© 2008 Biodiversity Heritage Library www.biodiversitylibrary.org

File Storage Estimates

• 4MB per page including derivatives

• 1 million pages = 4TB storage

• Expected output:60 – 100 million pages

240 - 400 TB for files

10 - 20 GB for db

© 2008 Biodiversity Heritage Library www.biodiversitylibrary.org

Future Work

• Services– Citation Resolver– Titles Resolver

• Interfaces

• Editing– Authoritative– Community

• Backend

© 2008 Biodiversity Heritage Library www.biodiversitylibrary.org

Fedora

• Funded by Gordon and Betty Moore Foundation to adopt Fedora Commons

• Working with Internet Archive to define use and practice

• Project completionDecember 2009

© 2008 Biodiversity Heritage Library www.biodiversitylibrary.org

Thank You

Chris Freeland

chris.freeland@mobot.org

BHL Portal

www.biodiversitylibrary.org

BHL Blog

biodiversitylibrary.blogspot.com

BHL collection at Internet Archive

www.archive.org/details/biodiversity