Repository Development at LC - Access 2009
-
Upload
dan-chudnov -
Category
Technology
-
view
596 -
download
0
description
Transcript of Repository Development at LC - Access 2009
RepositoryDevelopment
at LCDaniel Chudnov - 2009-10-01 - dchud at loc gov
Access 2009 - Charlottetown, PEI
who we arewhat we dowhat’s next
who we are
30ish peopledev, QA, PM, ops
from libs, uni, industry, etc.
OSIOffice of
Strategic Initiatives
“...capture the digital artifact, register and/or deposit it for the Copyright Office, pass it along to
those who decide whether to include it in the Library, and allow it to be incorporated digitally in the collection, with the optimum flow-through of information forregistration, cataloging, indexing,
and preservation.”(search for “LC21”)
or, to be precise
“capture the digital artifact,
register and/or deposit it for the Copyright Office, pass it along to those who decide whether to include it in the
Library, and allow it to be incorporated digitally in the collection, with the optimum flow-through of
information forregistration, cataloging, indexing, and preservation.”
(search for “LC21”)
“capture the digital artifact,
register and/or deposit it for the Copyright Office,
pass it along to those who decide whether to include it in the Library, and allow it to be incorporated digitally in
the collection, with the optimum flow-through of information for
registration, cataloging, indexing, and preservation.”(search for “LC21”)
“capture the digital artifact, register and/or deposit it for the Copyright Office,
pass it along to those who decide whether to include it in the Library, and allow it to be incorporated digitally in the collection, with the optimum flow-through of
information forregistration, cataloging, indexing, and preservation.”
(search for “LC21”)
“capture the digital artifact, register and/or deposit it for the Copyright Office, pass it along to those who decide whether to
include it in the Library, and
allow it to be incorporated digitally
in the collection,
with the optimum flow-through of information forregistration, cataloging, indexing, and preservation.”
(search for “LC21”)
“capture the digital artifact, register and/or deposit it for the Copyright Office, pass it along to those who decide whether to include it in the Library, and allow it to be
incorporated digitally in the collection, with the optimum flow-through of information for
registration, cataloging,
indexing, and preservation.”
(search for “LC21”)
what we do
“capture thedigital artifact”
at scale
world scalethen
web scale
wdl.org
partnersall over
the world
content from all over
the world
usersall over
the world
wdl.org/ru/
wdl.org/zh/
wdl.org/ar/
launchedApril 2009
lots of press
9,026 req/s1.25 Gbit/son day one
no crashjust bugs
(yay!)
that wasnew for LC
how?
solarisapachenginxmysqlsolr
djangojquery
clean URIs
static pages
global edge caching
what we do
capture the artifact
pass it along
cataloging, indexing
chroniclingamerica.loc.gov
139,582 title records
1,442,462 pages
freely availablenow
download whole issues - tell friends - mash it up
100+ TB16 of 50+ states/terr.and growing quickly
how?
solarisapachemysqlsolr
django
clean URIs
page caching
capture the artifact
pass it along
cataloging, indexing,preservation
preservationstorage
“movage”
capture the artifact
BagIt
packing slipfor data
data in a Bag
.|-- bag-info.txt|-- bagit.txt|-- data| |-- batch.xml| |-- batch_1.xml| |-- batch_ne_dewitt_rework| | |-- 00206538016_batch.xml| | |-- 00206538028_batch.xml| | `-- sn99021999| `-- sn99021999| |-- 00206538016| | |-- 0000.jp2| | |-- 0000.pdf| | |-- 0000.tif| | |-- 0000.xml| | |-- 0001.jp2| | |-- 0001.pdf| | |-- 0001.tif| | |-- 0001.xml
identifiesa bag
.|-- bag-info.txt|-- bagit.txt|-- data| |-- batch.xml| |-- batch_1.xml| |-- batch_ne_dewitt_rework| | |-- 00206538016_batch.xml| | |-- 00206538028_batch.xml| | `-- sn99021999| `-- sn99021999| |-- 00206538016| | |-- 0000.jp2| | |-- 0000.pdf| | |-- 0000.tif| | |-- 0000.xml| | |-- 0001.jp2| | |-- 0001.pdf| | |-- 0001.tif| | |-- 0001.xml
where thedata starts
.|-- bag-info.txt|-- bagit.txt|-- data| |-- batch.xml| |-- batch_1.xml| |-- batch_ne_dewitt_rework| | |-- 00206538016_batch.xml| | |-- 00206538028_batch.xml| | `-- sn99021999| `-- sn99021999| |-- 00206538016| | |-- 0000.jp2| | |-- 0000.pdf| | |-- 0000.tif| | |-- 0000.xml| | |-- 0001.jp2| | |-- 0001.pdf| | |-- 0001.tif| | |-- 0001.xml
packingslip
.|-- bag-info.txt|-- bagit.txt|-- data| |-- batch.xml| |-- batch_1.xml| |-- batch_ne_dewitt_rework| | |-- 00206538016_batch.xml| | |-- 00206538028_batch.xml| | `-- sn99021999| `-- sn99021999| |-- 00206538016| | |-- 0000.jp2| | |-- 0000.pdf| | |-- 0000.tif| | |-- 0000.xml| | |-- 0001.jp2| | |-- 0001.pdf| | | ...|-- manifest-md5.txt`-- tagmanifest-md5.txt
71607ad119be88c842268a76f0b6b9e9 data/sn99021999/00206538107/1884091301/0621.pdfc602d2ac07508059ce5f5597e239b97f data/sn99021999/00206538120/1885100601/0831.xmla59795bd1584532d5cbc0b1d82f75cf8 data/sn99021999/00206538016/1880061401/0593.pdf3c64fac7e2d49671e0d93908ae42a779 data/sn99021999/00206539616/1888101801/0905.xml03158a560baa7479b3805d2b45ee02cd data/sn99021999/00206538028/1880111501/0405.tiffa56ea18580e1446939ed62709e5b2db data/sn99021999/00206538077/1883061901/1145.pdfbf4fb83ff8305e8256970a3466c1a12d data/sn99021999/00206538120/1885061501/0043.pdf8f3649fc812de74b9d9443ee90a8ac9c data/sn99021999/00206538120/1885111101/1109.tife0b83a7f9ca228271fdaecf6348e1cec data/sn99021999/00206538120/1885101201/0871.xml1c2f84e12792c123ba0aabedd0c0bbad data/sn99021999/00206538107/1884071401/0197.xml080e557fe9f68037605e5b80df4bc4ac data/sn99021999/0020653820A/1888050701/0543.tif532efe32c156459d9d9589caf618f502 data/sn99021999/00206538120/1885071401/0250.tifce607af59a96f2656d9448f38ffda072 data/sn99021999/0020653820A/1888052801/0731.pdf60b626d8fd40aca1b425e86a004bb055 data/sn99021999/00206539628/1888111801/0088.xmla467cd62350334c7aa83cf1e9056c1c6 data/sn99021999/00206539616/1888091701/0629.jp21a434f7a4d843a2c8ffe8d0824fafc3f data/sn99021999/00206538028/1880120801/0482.jp222996d89b4a3334256afaddcaa0238d8 data/sn99021999/00206538016/1874102001/0259.jp236f550da273ad4c592fee1761c98322a data/sn99021999/00206538016/1880052201/0518.jp27f7ccec3f2afae896338498372fd476e data/sn99021999/00206539616/1888080101/0200.pdfc247a5d74d0e7f857c534d935661adbe data/sn99021999/00206538107/1884072601/0286.jp24d497a18a154adcc8636239378ab340b data/sn99021999/00206539628/1889021101/0868.pdf2e8ca2558b54b5c49b2f20a355a60895 data/sn99021999/00206538065/1882092001/0136.xmlfb71493048e5010100f18012f5060d42 data/sn99021999/00206538028/1880123001/0569.xml40b100432890b055a5defbfbea815d57 data/sn99021999/00206538107/1884090901/0590.xml46f6d61480dadc1c988b0baa4de8b6c4 data/sn99021999/00206539628/1888122801/0463.pdf1cb8af0648e8c9df395b63226fe7371f data/sn99021999/00206538016/1874101501/0244.pdf9257834023c683b02f354888b2740b8f data/sn99021999/00206539616/1888102301/0956.xml0d52b3b2b1c5459b7e8d500a8566b0bf data/sn99021999/00206538120/1885080801/0425.tif
defines two things
1
what i thinki’m sending you
2
whether youreceived it
just likea
packing slip
works acrossspace
works acrosssystems
works acrossorgs
works acrosstime
easy to make
md5deep
BIL
BagItLibrary
bvar@sun9 /ingest/bvar/test $ bag create --dest new_bag test_data/*12:08:47,044 [main] INFO CommandLineBagDriver : Performing operation: create2.301112941466272:2.312:08:47,141 [main] INFO ManifestImpl : Creating manifest for manifest-md5.txt12:09:09,493 [main] INFO ManifestImpl : Creating manifest for tagmanifest-md5.txt12:09:09,511 [main] INFO AbstractBagImpl : Writing bag12:09:41,507 [main] INFO CommandLineBagDriver : Operation completed.12:09:41,508 [main] INFO CommandLineBagDriver : Returning 0bvar@sun9 /ingest/bvar/push/test_bag $ bag isvalid .11:55:45,582 [main] INFO CommandLineBagDriver : Performing operation: isvalid11:55:46,378 [main] INFO ManifestImpl : Creating manifest for manifest-md5.txt11:55:46,458 [main] INFO ManifestImpl : Creating manifest for tagmanifest-md5.txt11:55:46,540 [main] INFO AbstractBagImpl : Completion check: Result is true.11:56:21,273 [main] INFO AbstractBagImpl : Validity check: Result is true.11:56:21,273 [main] INFO CommandLineBagDriver : Result is true.11:56:21,274 [main] INFO CommandLineBagDriver : Returning 0bvar@sun9 /ingest/bvar/push/test_bag $
Bagger
free/open sourcereleasesfrom LC
sf.net/projects/loc-xferutils/
get yours today - tell friends - start trading bags
that wasnew for LC
pass it along
transferinventoryworkflow
transfer UI - inventory - workflow
how?
apachespring/mvchibernate
mysql
and otherautomationstrategies
lots ofwork
still to do
lots ofintegrationstill to do
register/depositfor
Copyright
not my area,but
we hope to supporteDeposit
with these tools
“Deposit Demand”
June 2009Federal Register
Proposed Rulemaking
stay tunedor
ask my colleagues :)
(ask me whom to ask)
but, not my area
“allow it to be...incorporated digitally
in the collection”
“allow it to be...
incorporateddigitally
in the collection”
how?
traditional approach:
catalog recordsexhibit sites
cost of integrating everything
is high
cost of updating everything
is high
cost ofconsistent web strategies
is low
for example
Linked Data
use URIs as names for thingsuse HTTP URIs
provide useful informationinclude links to other URIs
http://www.w3.org/DesignIssues/LinkedData.html
id.loc.gov
LCSHon the web
free
clean URIs
followyournose
formats
view source
<link rel="alternate" type="application/rdf+xml" href="/authorities/sh00009460.rdf" /><link rel="alternate" type="text/plain" href="/authorities/sh00009460.nt" /><link rel="alternate" type="application/json" href="/authorities/sh00009460.json" />
<rdf:RDF> <rdf:Description rdf:about="http://id.loc.gov/authorities/sh00009460#concept"> <dcterms:modified rdf:datatype="http://www.w3.org/2001/XMLSchema#dateTime">2000-11-27T10:39:57-04:00</dcterms:modified> <skos:prefLabel xml:lang="en">National parks and reserves--Prince Edward Island</skos:prefLabel> <owl:sameAs rdf:resource="info:lc/authorities/sh00009460"/> <rdf:type rdf:resource="http://www.w3.org/2004/02/skos/core#Concept"/> <skos:inScheme rdf:resource="http://id.loc.gov/authorities#conceptScheme"/> <skos:inScheme rdf:resource="http://id.loc.gov/authorities#topicalTerms"/> <dcterms:created rdf:datatype="http://www.w3.org/2001/XMLSchema#dateTime">2000-10-17T00:00:00-04:00</dcterms:created> <skos:narrower rdf:resource="http://id.loc.gov/authorities/sh2002010534#concept"/> <skos:narrower rdf:resource="http://id.loc.gov/authorities/sh2008004743#concept"/> <skos:narrower rdf:resource="http://id.loc.gov/authorities/sh2003002637#concept"/> <skos:narrower rdf:resource="http://id.loc.gov/authorities/sh00009458#concept"/> </rdf:Description> <rdf:Description rdf:about="http://id.loc.gov/authorities/sh2002010534#concept"> <skos:prefLabel xml:lang="en">Prince Edward Island National Park (P.E.I.) </skos:prefLabel></rdf:Description>
<rdf:RDF> <rdf:Description rdf:about="http://id.loc.gov/authorities/sh00009460#concept"> <dcterms:modified rdf:datatype="http://www.w3.org/2001/XMLSchema#dateTime">2000-11-27T10:39:57-04:00</dcterms:modified> <skos:prefLabel xml:lang="en">National parks and reserves--Prince Edward Island</skos:prefLabel> <owl:sameAs rdf:resource="info:lc/authorities/sh00009460"/> <rdf:type rdf:resource="http://www.w3.org/2004/02/skos/core#Concept"/> <skos:inScheme rdf:resource="http://id.loc.gov/authorities#conceptScheme"/> <skos:inScheme rdf:resource="http://id.loc.gov/authorities#topicalTerms"/> <dcterms:created rdf:datatype="http://www.w3.org/2001/XMLSchema#dateTime">2000-10-17T00:00:00-04:00</dcterms:created> <skos:narrower rdf:resource="http://id.loc.gov/authorities/sh2002010534#concept"/> <skos:narrower rdf:resource="http://id.loc.gov/authorities/sh2008004743#concept"/> <skos:narrower rdf:resource="http://id.loc.gov/authorities/sh2003002637#concept"/> <skos:narrower rdf:resource="http://id.loc.gov/authorities/sh00009458#concept"/> </rdf:Description> <rdf:Description rdf:about="http://id.loc.gov/authorities/sh2002010534#concept"> <skos:prefLabel xml:lang="en">Prince Edward Island National Park (P.E.I.) </skos:prefLabel></rdf:Description> explicit concepts, schema, meaning
a web of data...
...with precise meaning
at this URIis this
conceptwith this
meaning
a standard wayto refer toa heading
freely availablenow
download the whole thing - tell friends - amaze enemies
that wasnew for LC
another example
<link rel="resourcemap" type="application/rdf+xml" href="/lccn/sn83030214/1905-01-15/ed-1/seq-25.rdf" /><link rel="alternate" type="image/jp2" href="/lccn/sn83030214/1905-01-15/ed-1/seq-25.jp2" /><link rel="alternate" type="application/pdf" href="/lccn/sn83030214/1905-01-15/ed-1/seq-25.pdf" /><link rel="alternate" type="application/xml" href="/lccn/sn83030214/1905-01-15/ed-1/seq-25/ocr.xml" /><link rel="alternate" type="text/plain" href="/lccn/sn83030214/1905-01-15/ed-1/seq-25/ocr.txt" />
<rdf:Description rdf:about="/lccn/sn83030214/1905-01-15/ed-1/seq-25#page"> <ore:isDescribedBy rdf:resource="/lccn/sn83030214/1905-01-15/ed-1/seq-25.rdf"/> <foaf:depiction rdf:resource="/lccn/sn83030214/1905-01-15/ed-1/seq-25/thumbnail.jpg"/> <ore:aggregates rdf:resource="/lccn/sn83030214/1905-01-15/ed-1/seq-25.jp2"/> <ore:aggregates rdf:resource="/lccn/sn83030214/1905-01-15/ed-1/seq-25/ocr.txt"/> <ore:aggregates rdf:resource="/lccn/sn83030214/1905-01-15/ed-1/seq-25.pdf"/> <ore:aggregates rdf:resource="/lccn/sn83030214/1905-01-15/ed-1/seq-25/ocr.xml"/> <ore:aggregates rdf:resource="/lccn/sn83030214/1905-01-15/ed-1/seq-25/thumbnail.jpg"/> <rdf:type rdf:resource="http://chroniclingamerica.loc.gov/terms#Page"/> <ore:isAggregatedBy rdf:resource="/lccn/sn83030214/1905-01-15/ed-1#issue"/> <dcterms:issued rdf:datatype="http://www.w3.org/2001/XMLSchema#date">1905-01-15</dcterms:issued> <ndnp:sequence rdf:datatype="http://www.w3.org/2001/XMLSchema#long">25</ndnp:sequence> <dcterms:title>New-York tribune. - 1905-01-15 - 25</dcterms:title></rdf:Description>
OAI-OREaggregation
this is apage
it has thesefiles in these formats
it is thissequence number
it ispart of this issue
it has thisissue date
it has this title
all explicit concepts
all exposedin the app
on the web
that wasnew for LC
the web is the API
the
webis the
API
there’s an API doc...
...it’s just abunch of links
“...make resources
availableand
useful...”
from the mission of the Library
“allow it to be...
incorporateddigitally
in the collection”
from the LC21 report
“...sustain and preservea
universalcollection...”
from the mission of the Library
each appconsistent
aboutmeaning
follow your noseto
concept definitions
in our appsand in yours
distributedconceptualintegration
the web is auniversal collection
this is a way toincorporate digitally
our digital artifactson our web
your digital artifactsin your web
our digital artifactsin your web
your digital artifactsin our web
available&
useful&c.
summary
content that scaleson the way in
apps that scaleon the way out
movagemovagemovage
transferinventoryworkflow
all in active development
the BagIt spec
try it - it works
free/open sourcesoftware releases
free datayou can use
web of dataavailable and useful
view source:
wdl.orgchroniclingamerica.loc.gov
id.loc.govsf.net/projects/loc-xferutils/
dchud at loc gov - @dchud