ECHO DEPository Technical Architecture Phase 1 Final Report

Post on 20-Oct-2021

3 views 0 download

Transcript of ECHO DEPository Technical Architecture Phase 1 Final Report

ECHODEPositoryTechnicalArchitecturePhase1FinalReportReportofprojectactivitiesfromFall2004through2007

[AuthorName]

UniversityofIllinoisatUrbanaChampaigninpartnershipwithOCLCContributors:MattCordial,DavidDubin,JanetEke,JosephFutrelle,ThomasHabing,LeahHouser,PatriciaHswe,WilliamIngram,JoanneKaczmarek,RobertManaster,JoelPlutchak,BethSandore,JohnUnsworthDecember2008RevisedJuly2009ReDraft2.3–notfinalized

ECHODEPositoryTechnicalArchitecturePhase1FinalReportNarrativeReportNationalDigitalInformationInfrastructurePreservationProgramUniversityofIllinoisatUrbana‐ChampaignwithOCLC

2

TableofContents

1. Preface.................................................................................................................................41.1. AbouttheECHODEPositoryProject(Phase1)............................................................41.2. AboutThisDocument ..........................................................................................................41.3. ReviewofProjectObjectivesandDeliverables...........................................................41.3.1. ModelsandToolstoSupportWebArchiving .....................................................................41.3.2. RepositoryEvaluationandInteroperability .......................................................................51.3.3. Long‐termSemanticPreservationResearch.......................................................................6

2. ArchivingtheWeb:theWebArchivesWorkbench .............................................82.1. Overview ..................................................................................................................................82.2. TheWebArchivingProblem..............................................................................................82.2.1. TheUbiquitousWeb ......................................................................................................................82.2.2. VolumeandSelectionofWebContent...................................................................................92.2.3. TheImportanceofContext .........................................................................................................9

2.3. TheArizonaModel:AnArchivalApproachtoWebArchiving ............................ 102.3.1. Background.....................................................................................................................................102.3.2. AnArchivalApproach ................................................................................................................112.3.3. ArizonaModelSummary ..........................................................................................................11

2.4. TheWebArchivesWorkbench:ImplementingtheArizonaModel ................... 122.4.1. DevelopmentConsiderations..................................................................................................122.4.2. OverviewoftheWebArchivesWorkbench......................................................................142.4.3. ATouroftheWebArchivesWorkbench ...........................................................................152.4.4. WebArchivesWorkbenchToolsSummary......................................................................202.4.5. BehindtheScenes:OCLC’sTechnicalImplementationoftheWebArchivesWorkbench......................................................................................................................................................21

2.5. Findings­UserFeedback................................................................................................. 242.5.1. LimitedResourcesandLimitedTime .................................................................................252.5.2. ComplexityoftheTools.............................................................................................................252.5.3. WebContentDelivery ................................................................................................................25

2.6. ConclusionsandNextSteps............................................................................................. 263. RepositoryEvaluationandInteroperability ....................................................... 273.1. RepositoryEvaluation ...................................................................................................... 273.1.1. BuildinganEvaluationFramework:ApplyingtheTrustedDigitalRepositoryChecklisttoRepositoryEvaluation.......................................................................................................273.1.2. RepositoryTesting:IngestandExportTestsOnFourKeyOpen‐sourceRepositories ....................................................................................................................................................283.1.3. TestingApproachandMethodology....................................................................................323.1.4. RepositoryTestingFindings:NarrativeReports,andAnnotatedAuditChecklistCommentary ...................................................................................................................................................343.1.5. ConclusionandNextSteps.......................................................................................................35

3.2. HubandSpokeArchitecture(HandS):SupportingRepositoryInteroperabilityandEmergingPreservationStandards.................................................. 363.2.1. HandSOverview ...........................................................................................................................363.2.2. TheNeedforInteroperabilityandPreservationSupport ..........................................363.2.3. HubandSpokeKeyPrinciples................................................................................................37

ECHODEPositoryTechnicalArchitecturePhase1FinalReportNarrativeReportNationalDigitalInformationInfrastructurePreservationProgramUniversityofIllinoisatUrbana‐ChampaignwithOCLC

3

3.2.4. METSProfile...................................................................................................................................393.2.5. HandSWorkflowCycle ..............................................................................................................443.2.6. HandSTechnicalImplementation.........................................................................................503.2.7. LessonsLearned...........................................................................................................................553.2.8. NextSteps:theHubandSpoke ..............................................................................................563.2.9. Conclusion.......................................................................................................................................57

4. PreservingMeaning,NotJustObjects:SemanticsandDigitalPreservation 584.1. Introduction:TheNeedforaSemanticsofPreservationApproach ................. 584.1.1. ThePreservationSemanticsProblem.................................................................................584.1.2. OurGoal............................................................................................................................................59

4.2. TheProblems:UnderstandingSemanticPreservation ......................................... 594.2.1. ProblemsPosedbyDescriptivePracticeandStructures............................................594.2.2. UnderstandingtheSemanticPreservationProblem:Summary..............................64

4.3. TowardMoreCapableArchivesandRepositories .................................................. 644.3.1. Recap:Theneedforautomatedinferencecapability...................................................644.3.2. BECHAMELandBuildingaMetadataOntology ..............................................................654.3.3. OvercomingSemanticProblemsinMetadataEncoding:AResourceandDescriptionVocabulary .............................................................................................................................654.3.4. ResolvingSemanticAmbiguity:anInferenceExample ...............................................664.3.5. AutomatedInferenceasaPreservationService.............................................................68

4.4. SystemArchitecture .......................................................................................................... 694.4.1. Architecture:Overview .............................................................................................................69

4.5. LessonsLearnedandNextSteps ................................................................................... 714.6. Conclusion............................................................................................................................. 71

5. ANotefromthePIs ...................................................................................................... 736. References....................................................................................................................... 746.1. ArchivingtheWeb:theWebArchivesWorkbench................................................. 746.1.1. Resources ........................................................................................................................................74

6.2. RepositoryEvaluationandInteroperability ............................................................. 756.2.1. RepositoryEvaluation................................................................................................................756.2.2. HandSToolsSuite ........................................................................................................................75

6.3. PreservingMeaning,NotJustObjects:SemanticsandDigitalPreservation.. 777. Appendices...................................................................................................................... 807.1. WebArchivesUserGuide................................................................................................. 807.2. WebArchivesWorkbenchImplementationGuide ................................................. 807.3. AnnotatedTrustedDigitalRepositoryChecklist ..................................................... 807.4. UsingtheAuditChecklistfortheCertificationofaTrustedDigitalRepositoryasaFrameworkforEvaluatingRepositorySoftwareApplications(DLibarticle)... 807.5. RepositoryTestingFindings:Narrative...................................................................... 807.6. RepositoryFindingsCommentaryUsingtheAnnotatedTrustedDigitalRepositoryChecklist ..................................................................................................................... 807.7. ResourceDescriptionVocabulary:AnOntologyofMetadataDescriptions ... 807.8. SustainedAccesstoEjournals:ContextValue,andFutureProspectus............ 80

ECHODEPositoryTechnicalArchitecturePhase1FinalReportNarrativeReportNationalDigitalInformationInfrastructurePreservationProgramUniversityofIllinoisatUrbana‐ChampaignwithOCLC

4

1. Preface

1.1. AbouttheECHODEPositoryProject(Phase1)TheECHODEPository(Phase1)isanNDIIPP‐partnerresearchanddevelopmentprojectattheUniversityofIllinoisatUrbana‐Champaign(UIUC)inpartnershipwithOCLC,theNationalCenterforSupercomputingApplications);theMichiganStateUniversityLibrary;andanallianceofstatelibrariesfromArizona,Connecticut,Illinois,NorthCarolinaandWisconsin.OuraimistosupportthedigitalpreservationeffortsoftheLibraryofCongressbyaddressingissuesofhowwecollect,manage,preserve,andmakeusefultheenormousamountofdigitalinformationourcultureisnowproducing.Phase1projectactivities(Fall2004through2007)includeddevelopingwebarchivingtools,evaluatingexistingrepositorysoftware,developinganarchitecturetoenhanceexistingrepositories’interoperabilityandpreservationfeatures,andmodelingnext‐generationrepositoriesforsupportinglong‐termpreservation.

1.2. AboutThisDocumentThisnarrativereportprovidesadetailedoverviewofeachoftheareasofworkdescribedbelow.Theattachedappendicesprovidespecificadditionalprojectdeliverables.Acollectedarchiveofallprojectdeliverables,includingposters,presentationsandpublications,isforthcoming.SeveralsectionsincludematerialcontributedbythesameauthorswhowrotearticlesonECHODEPprojectsforanissueofLibraryTrends,guest‐editedbyPatriciaCruseandBethSandore,toappearintheWinter2009issue(specifically,LibraryTrends,Volume57,Number3).

1.3. ReviewofProjectObjectivesandDeliverables

1.3.1. ModelsandToolstoSupportWebArchivingGoals:

• Articulateamethodologyforselectingdigitalmaterialsattheaggregatelevelbasedonarchivalprinciples,anduseprovenance,functionalanalysis,andcontextanalysistofacilitatemeta‐taggingforretrieval.

• Buildasuiteofopensourcesoftwaretoolsthatsupportidentification,capture,anddescriptionofwebsites.

Deliverables:

ECHODEPositoryTechnicalArchitecturePhase1FinalReportNarrativeReportNationalDigitalInformationInfrastructurePreservationProgramUniversityofIllinoisatUrbana‐ChampaignwithOCLC

5

• TheArizonaModel,anarchivalapproachtowebarchiving(developedbytheArizonaStateLibraryandArchives)

• TheWebArchivesWorkbenchsuiteoftools(developedbyOCLC)Overview

Traditionally,webarchivingmethodshavefocusedoneithermanualorautomatedcaptureapproaches,bothproblematic.Manualitem‐levelselectionfailsduetotheenormousnumberofresourcesontheweb,whilefullyautomatedweb‐captureapproachesriskburyingsubstantivematerialsunderamountainofirrelevantinformation.Toaddressthisfundamentalproblem,OCLCbuiltasuiteofopen‐sourcewebarchivingtoolsthatbridgethegapbetweenmanualselectionandautomatedcapture.BasedontheArizonaModel,whichprovidesforintegrationofbothhumanandmachineprocesses,theWebArchivesWorkbench(WAW)comprisesfourtoolstohelparchivistsidentify,describe,selectandharvestweb‐basedcontentforstorageinanyrepository.

DetailsSeeSection2ofthisreport,andAppendixitems6.1and6.2.

1.3.2. RepositoryEvaluationandInteroperabilityGoals:

• Install,configure,test,andevaluateexistingopen‐sourcedigitalrepositorysystems,withparticularregardtosupportforinteroperabilityandemergingpreservationstandards.

Deliverables:

• Arepositoryevaluationframeworkbasedonthe2005RLG/NARAAuditChecklistfortheCertificationofaTrustedDigitalRepository,DraftforPublicComment,withmappingtocurrentversion

• Repositorytestingfindings• TheHubandSpoke(HandS)toolssuitesupportingrepository

interoperabilityandpreservationmetadata• PREMIS‐basedMETSprofiles

Overview

Tohelpunderstandhowwellexistingrepositorysystemssupporttoday’sdigitalpreservationefforts,weevaluatedfourexistingopen‐sourcerepositorysystems(DSpace,ePrints,FedoraandGreenstone).Evaluationactivitiesincludedtheingestionandmanipulationofhalfaterabyteof

ECHODEPositoryTechnicalArchitecturePhase1FinalReportNarrativeReportNationalDigitalInformationInfrastructurePreservationProgramUniversityofIllinoisatUrbana‐ChampaignwithOCLC

6

heterogeneouscontentineachrepositorysystem,andthedevelopmentofapreservation‐focusedrepositoryevaluationframeworkbasedonemergingstandardsfortrusteddigitalrepositories.

EarlyevaluationfindingsledtothedevelopmentoftheHubandSpoke(HandS)Architecture,aproof‐of‐conceptsuiteoftoolstoenhancetheinteroperabilityandpreservationfeaturesofsystemstested.TheHandSsuitesupportstoday’slibraries’effortstomanagecontentinmultiplerepositorysystemsandtopreservevaluablepreservationdata.Itincludesthedevelopmentofacommonstandards‐basedmethod(aPREMIS‐basedMETSprofile)forpackagingcontentthatallowsdigitalobjectstobemovedinandoutofmorerepositoriesmoreeasilywhilesupportingthecollectingoftechnicalandprovenanceinformationcrucialforlong‐termpreservation.Thismodelhaspotentialwideapplicability,andisalreadyinuseinseveralreal‐worldarchivingprojects.

DetailsSeeSection3ofthisreport,andAppendixitems6.3,6.46.5and6.6.

1.3.3. Long‐termSemanticPreservationResearch

Goals:• Researchtechniquestomigratethesemanticcontentofdocumentsand

documentstructuresacrossgenerationsofencodingschemes.

Deliverables:• Articulationofsemanticpreservationproblemsposedbycurrentmetadata

practice• Developmentofaformalmetadatadescriptionontology• Demonstrationofautomatedinferenceusingreasoningsoftware

Overview

Currentfirst‐generationrepositorysystemspreservethestructureofinformation,notitsmeaningorsemantics.Whenwemovecontentfromonesystemtoanother,thisstructuremaybesubtlyorunsubtlytransformed.Tomeaningfullypreserveourdigitalcontentovertime,wethereforehavetoinfermeaningorsemanticsfromstructuresthatchangeovertime.Becauseofthevolumeofinformationtobepreserved,weneedtobeabletodothiswithautomatedtools.

TheUIUCGraduateSchoolofLibraryandInformationScience(GSLIS)collaboratedwiththeNationalCenterforSupercomputingApplications

ECHODEPositoryTechnicalArchitecturePhase1FinalReportNarrativeReportNationalDigitalInformationInfrastructurePreservationProgramUniversityofIllinoisatUrbana‐ChampaignwithOCLC

7

(NCSA)tocontributetothedevelopmentofnext‐generationarchiveswithsemanticanalysiscapabilitiestoreducelong‐termpreservationrisks.UsingrepositorytechnologydevelopedatNSCAandautomatedreasoningtoolsdevelopedatGSLIS,wemodelhowsemanticinferencecapabilitymayhelpnext‐generationarchivesheadofflong‐termpreservationrisks.Thisworkincludesarticulatingsemanticpreservationproblemsposedbycurrentpractice,andanalyzingreal‐worlddatamigrationexamplestodevelopaformalunderstandingofhowdescriptiveinformationaboutarchiveddigitalresourcesisstructured.Thisunderstandingispresentedinaformalmetadatadescriptionontology.

DetailsSeeSection4ofthisreport,andAppendixitem6.7.

ECHODEPositoryTechnicalArchitecturePhase1FinalReportNarrativeReportNationalDigitalInformationInfrastructurePreservationProgramUniversityofIllinoisatUrbana‐ChampaignwithOCLC

8

2. ArchivingtheWeb:theWebArchivesWorkbench

2.1. OverviewAcoredeliverableoftheECHODEPositoryProject'sfirstphasewasOCLC'sdevelopmentoftheWebArchivesWorkbench(WAW),anopen‐sourcesuiteofWebarchivingtoolsforidentifying,describing,andharvestingWeb‐basedcontentforingestintoanexternaldigitalrepository.ReleasedinOctober2007,thesuiteisdesignedtobridgethegapbetweenmanualselectionandautomatedcapturebasedonthe"ArizonaModel,"whichappliesatraditionalaggregate‐basedarchivalapproachtoWebarchiving.(By“aggregate‐basedarchiving,”wemeanarchivingitemsbygrouporinseries,ratherthanindividually.)CorefunctionalityofthesuiteincludestheabilitytoidentifyWebcontentofpotentialinterestthroughcrawlsof"seed"URLsandthedomainstheylinkto;toolsforcreatingandmanagingmetadataforassociationwithharvestedobjects;websitestructuralanalysisandvisualizationtoaidhumancontentselectiondecisions;andpackagingusingaPREMIS‐basedMETSprofiledevelopedbytheECHODEPositorytosupporteasieringestionintomultiplerepositories.ThenextsectionsprovideanoverviewoftheWebarchivingproblem;backgroundontheArizonaModel;anoverviewofhowthetoolsworkandtheirtechnicalimplementation;andabriefsummaryofuserfeedbackfromtestingandimplementingthetools.AppendixitemsincludetheWebArchivesWorkbenchUserGuide(6.1),whichprovidesdetailedscreen‐by‐screendocumentationofthetoolsuite’sfunctionality.TheWAWImplementationGuideisprovidedinAppendix6.2.

2.2. TheWebArchivingProblem

2.2.1. TheUbiquitousWebForabroadrangeoforganizations,Websitesarenowthedeliverymechanismofchoicefornearlyanytypeofinformationcontent.Muchofthiscontentiscreatedanddisseminatedinelectronicformatsonly,withprinted(hard)copiesconsideredjustacourtesyorconvenience.Theelectronicformatenvironment,whileexpedientforcurrentaccesspurposes,presentschallengesforanyonechargedwithpreservingtheinformationovertime.ThesechallengesincludethesheervolumeofWeb‐publishedinformation,traditionalissuesofselectionanddescription,aswellthetechnicalchallengesassociatedwithlong‐termpreservationofdigitalobjects.

ECHODEPositoryTechnicalArchitecturePhase1FinalReportNarrativeReportNationalDigitalInformationInfrastructurePreservationProgramUniversityofIllinoisatUrbana‐ChampaignwithOCLC

9

2.2.2. VolumeandSelectionofWebContentAnimmediatechallengeofWebarchivingisassuringthatallcontentoflong‐termrelevancedeliveredthroughtheWebisidentifiedandcollected(i.e.,harvested).DifficultiesarisefirstfromthetaskofselectingpertinentcontentforpreservationfromtheenormousvolumeofinformationstreamingfromWebserversatanygivenpointintime.Selectiondecisionswillbeinfluencedbythechargeoftheindividualresponsibleforcapturingspecificcontenttypes(suchasalibrarianorarchivist)basedonappraisal,orcollectiondevelopment;onpoliciescreatedinconcertwiththemissionoftheinstitutionororganization;andontheaudience,orusercommunity,beingserved.ThesheervolumeofcontentpublishedontheWebmakesafullymanualperusalofonlineresourcesinfeasible.VolumeisstillafactorevenwhenWebcrawlers—asexplainedbelow—areengaged.ThedynamicnatureoftheWebalsocreatesproblemsforselectionandharvestingofcontent.URLscanchangeovernight;resourcescanbetakenoff‐linewithlittleornonotice;andnew,relatedcontentcanbeaddedinnewordifferentdirectoriesthanthosevisitedpreviouslybyaWebcrawlerharvestinganorganization'swebsite.AlthoughWebcrawlingautomatesarchivingofawebsite,itisquitepossibleforWebcrawlerssimplytomisscontentbecauseofa“robotsexclusionprotocol”(activatedbythesitecreatortomakepartsofasite“uncrawlable”)orbecauseoftheimpenetrablecharacteroftheDeepWeb(wherecontent,suchasaresultspagetoaWebform,isinaccessibletoaWebcrawlerorWebspider).1Inaddition,thevastmeasureoftheWebrendersscalableWebcrawlinganalmostintractabletechnicalchallenge.KnowingwheretofindallcontenteligibleforharvestingaccordingtocollectiondevelopmentandappraisalpoliciesbecomesnearlyimpossiblewithoutintentionalcoordinationorwithoutWebcrawlingtoolsandresourcesthataredesignedfor,andtakeaccountof,thefluidnatureofwebsitecontentandthemassivescaleoftheWeb.

2.2.3. TheImportanceofContextContextisaboutunderstandingrelationshipsbetweendifferentanddiscretepiecesofinformation.Itisaboutunderstandingwhytheinformationwascreated,bywhichindividualororganization,andatwhatpointintime.Contextualinformationcanhelpdefinetheboundariesandthescopeofharvestedcontent.Aswithanalogobjects,muchoftheusefulnessofdigitalobjectswhichmakeupourculturalrecorddependsonourhavingdescriptiveandcontextualinformationaboutthem.Oncecontentisidentifiedandharvested,itisnecessarytoprovideaccesstothedigitalobject.Suchcontentaccessmeansthatattentionshouldbepaidtocapturingaccuratemetadataalongwiththecontentitself.Thiscontextualmetadata1Webarchiving.(2009,March2).InWikipedia,TheFreeEncyclopedia.Retrieved21:56,March2,2009,fromhttp://en.wikipedia.org/w/index.php?title=Web_archiving&oldid=274363238

ECHODEPositoryTechnicalArchitecturePhase1FinalReportNarrativeReportNationalDigitalInformationInfrastructurePreservationProgramUniversityofIllinoisatUrbana‐ChampaignwithOCLC

10

willhelpdescribetheorigin(or"provenance")oftheresource,aswellaswhyandwhenitwascreated.(Forexample,isthediscoveredresourceoneinaseriesofannualreportsfromaparticularstateagency?Isitasinglepublicationsummarizingresearchfindings?Ordoesitencompassresultsfromaspecificsurveytakenaspartofalargerefforttorevampcommunityservices?)Inthecaseofadigitalobject,metadatanotonlysupportshumaninterpretationofcontent,itisneededtoprovidecrucialtechnicalinformationformaintaininglong‐termviabilityoftheobjectitself.

2.3. TheArizonaModel:AnArchivalApproachtoWebArchivingTheWebArchivesWorkbenchtoolsuiteispremisedontheprinciplesofthe“ArizonaModel,”anaggregate‐basedapproachtoWebarchivingdesignedtobridgethegapbetweenhumanselectionandautomatedcapture.“Aggregate‐based”meansthatratherthanarchiveitemssingly,orindividually,theyareorganized(grouped)inseries,orinaggregates.TheArizonaModelwasdevelopedin2003byRichardPearce‐MosesoftheArizonaStateLibraryandArchives.

2.3.1. BackgroundMoststatelibrariesandarchiveshavemandatestocollectstateagencypublicationsandmakethemavailabletothepublic.Tothisend,therearewell‐establisheddepositorysystemsthathaveworkedwithpaperpublicationsformanyyears.InaWebenvironmentthenuancesofdeterminingwhatapublicationis,orwhoisresponsibleforselectionandcollectionofparticularinformationresources,becomeslessclear.Nonetheless,tomeetthesemandateslibrariansandarchivistsmuststillidentify,select,acquire,describe,andprovideaccesstostateagencyinformation"published"onwebsites.Inearlyattemptstodevelopacollectionofstateagencyelectronicpublications,twoapproachescameabout.AccordingtoPearce‐Moses,Cobb,andSurface(2005),thefirstapproachhasitspremisein“traditionallibraryprocessesofselectingdocumentsonebyone,identifyingappropriatedocumentsforacquisition;electronicallydownloadingthedocumenttoaserverorprintingittopaper;thencataloging,processing,anddistributingitlikeanyotherpaperpublication.”(175)Whilethisapproachensuresthatvaluabledocumentswillbegathered,itsdependenceonmanualselectionlimitsarchivingtoonlyaveryfewitems.ScalingthisprocessinaccordancewiththevastnessofWeb‐baseddocumentswouldnecessitateanexpansioninpersonnelthatfewstatelibrarieshavethefundingtoaddress.(Pearce‐Moses,Cobb&Surface,2005)Alternatively,intheotherapproach,softwaretoolsthatautomateregularlyoccurringWebcrawlsareengaged.AsPearce‐Moses,Cobb,andSurfaceassert,thismodel“tradeshumanselectionofsignificantdocumentsforthehopethatfull‐textindexingandsearchengineswillbeabletofinddocumentsoflastingvalueamongtheclutterofother,ephemeralWebcontentcapturedintheprocess.”(176)Yet,whilethismodelrelieveslibrariansand

ECHODEPositoryTechnicalArchitecturePhase1FinalReportNarrativeReportNationalDigitalInformationInfrastructurePreservationProgramUniversityofIllinoisatUrbana‐ChampaignwithOCLC

11

archivistsoftheupfrontonusofselectionandorganization,atthesametimeitmayundulyburdenfuturesearchers,iffull‐textindexingandsearchcapabilitiesdonotevolveasanticipated.TheArizonaModel,explainedindetailbelow,constitutesathirdapproachtoWebarchiving,incorporatingbothhumanassessmentandautomatedtools.

2.3.2. AnArchivalApproachTheArizonaModelappliesanarchivalperspectivetocuratingcollectionsofWebpublications.Itexploitscertaintellingparallelsbetweenwebsitesandarchives:namely,theconceptofprovenance(i.e.,documentsclassedtogetherstemfromthesamesource)andtheorganizationalstructureinherentinboththesekindsofcollections—directoriesandsubdirectoriesforwebsites,andseriesandsubseriesforarchives.(Pearce‐Moses,Cobb&Surface,2005)Intheory,ifwebsitesorganizeWebpublicationsusingcommonfiledirectorystructures,informationaboutindividualdocumentswithinsub‐directoriescouldbeinheritedfromparentdirectories.IntheArizonaModel,whichdrawsonbasicarchivalpractice,websitesarehandledashierarchicalaggregatesratherthanasindividualitems,andtheoriginalorderofthedocuments(theorderinwhichthecreatingagencyoversawthem)ismaintained.Provenanceandoriginalorderareconsideredimportantcontextualpiecesofinformation.Retainingdocumentsintheorderinwhichtheywereoriginallymanagedandkeepingthemclusteredtogetherbasedontheoriginatingagencyenhanceone’sknowledgeofthecreationandoriginaluseofthedocuments.Provenanceandoriginalorderalsoallowfor"inheritance"ofhigher‐levelmetadatameanttodescribethehomeagencyfromwhichthedocumentscameandthewaythedocumentswereoriginallyarranged.Finally,anarchivalapproachtocuratingacollectionofWebdocuments—focusingfirstonaggregates(collectionsandseries),ratherthanonindividualdocuments—trimsthenumberofitemsthatneedtobeappraisedbyahumandowntoamoremanageablenumber.

2.3.3. ArizonaModelSummaryTheArizonamethodologyisbasedonanarchivalapproachtotheWebthatincorporatesbothhumanselectionandautomatedcapture.Inthisapproach,Webmaterialsaremanagedinawaysimilartotheorganizationofmaterialsinpaper‐basedarchives:asahierarchyofaggregatesratherthanasindividualitems.ThisapproachreducestoamorepracticalsizethesheervolumeproblemofpreservingWebmaterials,whilemaintainingascalabledegreeofhumaninvolvement.ItistheguidingmodelforOCLC’sWebArchivesWorkbench.

ECHODEPositoryTechnicalArchitecturePhase1FinalReportNarrativeReportNationalDigitalInformationInfrastructurePreservationProgramUniversityofIllinoisatUrbana‐ChampaignwithOCLC

12

2.4. TheWebArchivesWorkbench:ImplementingtheArizonaModel

TheArizonaModelisparticularlyinstructiveinitsevocationofwhere,inthepracticeofarchivalmanagement,automationcanbeconsideredmostuseful.Thatis,whiletechnologymaybeappliedforinformationprocessingactivitiessuchasdatasearchingandtracking,andlistconstructionandclassification,tasksfordistinguishingwhethercontentisin‐scopeorisvaluablearebestreservedforhumans.TheoverallgoalbehinddevelopingasuiteoftoolsbasedontheArizonaModelistoachieveaproductivecomplementbetweenautomatedprocessingandhumandecision‐making,allthewhileadheringtoestablishedarchivalprinciples.ThesoftwarethatOCLCcreated,theWebArchivesWorkbench,comprisesfivetoolstoidentify,select,describe,andharvestWeb‐basedmaterials,aswellastokeeptrackof,orlog,theseactivitiesandtogeneratereportsaboutthem.Indoingso,theyserveasaconduitbetweenhumaninvolvement(viamanualselection)andcomputerizedcaptureofWebcontent:theyconvertthearchivist'spoliciesforcollectingcontentcreatedontheWebtosoftware‐centeredrulesandconfigurations.Theyalsoassistinformationprofessionalsbyprovidingthemeanstoaddmetadatatoharvestedobjectsasaggregates.Inaddition,thetoolsimplementthePREMIS‐basedMETSprofilesdevelopedbyECHODEP(attheUniversityofIllinois)forpackagingcontent;bydesigntheseprofilesfacilitateingestionintomultipleexternalrepositoriesandsupportlong‐termpreservation.2PackagingisthelaststepintheWAWworkflow,afterwhichtheobjectsarereadyforingestintoanexternaldigitalrepository.

2.4.1. DevelopmentConsiderationsOCLCledthedevelopmentofthetoolsuite.Priortotooldesignanddevelopment,OCLCcarefullyconsideredtheusercommunity,whichitidentifiedasablendoflibrariansandarchivists.Significanttoitsconsiderationwastheissueofterminology:howshouldtoolsandfeaturesintheWebArchivesWorkbenchbenamed,orcalled,ifamixedcommunityoflibrariansandarchivistswastoserveasitsuserbase?The word “series,” for example, might invoke semantics and usage for an archivist that is different, even unfamiliar, for a librarian. Thus,inexploringtheusercommunity,OCLChadarchivistslookatnewtypesofmetadataandaskedlibrarianstothinkaboutprinciplesofarchiving,suchasarchivalseriesandthecurationof

2TwoMETSprofilesdevelopedbyECHODEPareatworkhere:theECHODepGenericMETSProfileforPreservationandDigitalRepositoryInteroperability(accessibleathttp://www.loc.gov/standards/mets/profiles/00000015.html)andtheECHODepMETSProfileforWebSiteCaptures(accessibleathttp://www.loc.gov/standards/mets/profiles/00000016.html).Theformeristhe'toplevel'format‐genericprofile,whichfocusesonimplementingPREMIS.Thelatter,awebcaptureprofile,isanexampleofa'sub‐profile,’whichisusedwiththefirstonetoprovideastructureformoreformat‐specificinformation.

ECHODEPositoryTechnicalArchitecturePhase1FinalReportNarrativeReportNationalDigitalInformationInfrastructurePreservationProgramUniversityofIllinoisatUrbana‐ChampaignwithOCLC

13

documentsinaggregateratherthanasindividualitems.Eventually,OCLCelectednottodevisenewterminologyfortheconceptsatissue;notonlydidtheteamconcludethatterminologywas,inessence,atrainingmatter,italsosawthattheworkoflibrariansandarchivistsoftenoverlap—i.e.,eachisfrequentlyengagedinthemilieuoftheother.Indoinghigh‐levelanalysisfortheuserinterface,OCLCarrivedatseveralworkingassumptionsthathadsomebearingonthedesignofthetoolsuite.OneassumptionwasthatbecausethetoolsintheWebArchivesWorkbenchmightchangeovertime,theyneededtobe“aware”ofeachotherandenablethesharingofdata,but—asimportant—theusershouldhavetheabilitytooptnottouseatoolintheWorkbench.Throughinterviewswithlibrariansandarchivists,OCLCalsolearnedthatharvestingresponsibilitiesoftenweresharedamongindividuals;asaconsequence,datageneratedbyatoolhadtoberenderedshareablebymultipleusers—andsimultaneouslyso.Thisfeaturewouldallowausertoviewtheworkofanother.Inaddition,ratherthantryingtointegratetheWorkbenchintoaninstitution’smanyauthenticationschemes,OCLCincorporatedasimplescheme,allowingtheWorkbenchtorunwithjustbasicadministration.Intermsofharvesting,OCLCdesignedmorethanoneharvestingworkflow,sothatausercouldselecttheappropriatelevelofanalysisandsophisticationforatask.Forinstance,theQuickHarvestfeatureisasingle‐screenlaunchpointthatrunsaharvestimmediately.TheAnalysistool,whichispartofanextendedharvestingworkflow,requiresmoreset‐up,butitresultsinabigger“pay‐off”intermsofthewebsitechangeobservationsithandlesautomaticallyfortheuser(thisisexplainedbelowmoreformallyinthedescriptionoftheAnalysistool).Finally,wherethedepositofharvestedinformationisconcerned,OCLCknewthatingesttoavarietyofrepositories,includingitsownDigitalArchiveaswellasDSpacerepositories,wouldneedtobeaccommodated.Aclean,simpleinterfacewascreatedbetweenthepointwheretheWorkbenchendsandarepositorysoftwareapplicationwouldbegin;thatis,theWorkbenchgeneratesharvestedpackagesofcontentinafilesystemthattherepositorythenpicksupandprocesses.(Thisisthepointintheworkflowatwhichtheabove‐mentionedPREMIS‐basedMETSprofilesdevelopedbyECHODEPisimplemented.)

ECHODEPositoryTechnicalArchitecturePhase1FinalReportNarrativeReportNationalDigitalInformationInfrastructurePreservationProgramUniversityofIllinoisatUrbana‐ChampaignwithOCLC

14

2.4.2. OverviewoftheWebArchivesWorkbenchTheWebArchivesWorkbenchisasuiteofwebarchivingtoolsforidentifying,selecting,describingandharvestingweb‐basedcontentbasedonlibraryandarchivalpractice.Itbridgesthegapbetweenmanualselectionandautomatedcapturebytransformingcollectionpoliciesintosoftware‐basedrulesandconfigurations.Itaccommodatesavarietyofwebharvestingapproaches,includingmassharvesting,selectiveharvesting,andindividualdocumentharvesting.ContentispackagedusingtheECHODEPMETSprofile,whichisdesignedtosupportthecollectionofPREMISpreservationmetadata,andtofacilitateingestionintoavarietyofexternalrepositories.ThefivetoolsintheWorkbencharetheDiscovery,Properties,Analysis,Harvest,andSystemtools.Below(Fig.1)isanoverviewoftheWorkbenchWorkflow,followedbyamoredetailedtourofthefunctionalityofeachtool.

Figure1:OverviewofWAWWorkflow

ECHODEPositoryTechnicalArchitecturePhase1FinalReportNarrativeReportNationalDigitalInformationInfrastructurePreservationProgramUniversityofIllinoisatUrbana‐ChampaignwithOCLC

15

2.4.3. ATouroftheWebArchivesWorkbenchThescreenshotinbelowdisplaysthemainWAWtoolsscreenaftertheuserhasloggedon.Thefivetoolsareexemplifiedbythetopmostrowoftabs.(ThoughtheAlertstabsitsinthisrow,itislessatoolthanafeatureoftheWorkbench.ItenablesuserstoaccessacollectionofreportsandalertsfortheDiscovery,Properties,Analysis,andHarvestTools.)IntheinterfacefortheWAWtools,atabiscoloredintosignifywhichtoolisopen,oractive,atthatparticularmoment.InFigure2,forexample,theDiscoverytabisshaded,becausetheDiscoverytooliscurrentlyactive.Similarly,theEntryPointstabisshaded,becauseitisactiveasacomponentoftheDiscoverytool.

Figure2:ScreenshotofWAWinterfacehomescreenAkeyadvantagetotheWorkbenchtoolsisthatharvestingofWebcontentmaybescheduledsothatitoccursonaregularbasis.However,theWorkbenchtoolsalsoofferusersthealternativeofrunningaone‐timeharvest.ThisisknownastheQuickHarvest,accessibleviatheHarvesttab.QuickHarvestisaddressedbrieflyinthediscussionbelowoftheHarvesttool.

2.4.3.1. TheDiscoveryTool:FindingWebContentofInterestThefirststepinconstructinganarchiveofWeb‐basedresourcesistodeterminewhichpartsoftheWebholddesirable,andthuscollection‐worthy,content.ThisstepliesatthecruxoftheDiscoveryTool.TheDiscoveryToolaidsinidentifyingpotentiallyrelevantwebsitesbycrawlingrelevant"seed"entrypointstogeneratealistofdomainstowhichthe"seed"siteslink.(Note:AnentrypointisaspecificwebsiteURLwheretheDiscoveryToolwillbegintosearchfordomainsorcollectWebcontent.AdomainisaserverontheInternetthatmaycontainWebcontentandisidentifiedbyahigh‐leveladdress.Forexample,

ECHODEPositoryTechnicalArchitecturePhase1FinalReportNarrativeReportNationalDigitalInformationInfrastructurePreservationProgramUniversityofIllinoisatUrbana‐ChampaignwithOCLC

16

http://www.illinois.gov/news/isawebsite,anditsdomainis"Illinois.gov".DomainsdoNOTinclude“http://”.)3Inanapproachthateffectivelyborrowsfromcitationanalysis,theDiscoveryToolisdesignedontheideathaton‐topicsiteslikelypointtoothersitesaddressingasimilartopic.Thedomainsinthegeneratedlistarethenmanuallyevaluatedasin‐scopeorout‐of‐scope,basedonsubjectinterestandcollectingpolicies.(SeeFigure3,whichshowsalistofdomainsreturnedafterentrypointshavebeencrawled,aswellasradiobuttonsthatnotethescopeforeachdomain.)Thisprocessresultsinalistofdomainsdefiningasub‐setoftheWebthatisrelevantfortheuser'sarchivingpurposes.Domainsmarkedasin‐scopecanbeassociatedwithanEntity(i.e.,creator,oragency,ororganizationresponsiblefortheWebcontent).Later,inthePropertiesandAnalysisTools,metadataassociatedwithentities(creatorssuchasagenciesororganizations)canbeinheritedbycontentharvestedfromaparticularwebsite.

Figure3:ScreenshotoftheinterfacefortheDomainsfeatureoftheDiscoveryToolInsum,theDiscoveryToolisusedto:

• Generatealistofpotentiallyrelevantdomainsbycrawlingseedsites.• Assigndomainsasin‐scopeorout‐of‐scope.• AdddomainsmanuallytotheDomainslist.• Associatedomainswithentities(creatingagenciesororganizations).

3Anoteaboutcapitalizationinthissectionthatprovidesatourofthesoftware:here,entrypointsanddomainsarenotcapitalized,becausewearespeakingoftheminthegeneraluseoftheDiscoveryTool.However,theyarealsofeaturesoftheDiscoveryTool.Whenwediscussthemassuch,theywillbecapitalized.

ECHODEPositoryTechnicalArchitecturePhase1FinalReportNarrativeReportNationalDigitalInformationInfrastructurePreservationProgramUniversityofIllinoisatUrbana‐ChampaignwithOCLC

17

2.4.3.2. ThePropertiesTool:EnteringMetadatatoDescribeContentCreators(Entities)

AnotherpremiseoftheArizonaModelisthat,asmuchaspossible,metadatashouldbeenteredonlyonceandbeinheritedbyassociatedharvestedobjects.AftertheEntryPointsandDomainfeaturesoftheDiscoveryToolarerun,andentities(i.e.,contentcreators)havebeenassociatedwithdomains,metadataabouttheresultingentitiesmaybeenteredviathePropertiesTool.Besidesenablingthemanagementofinformationaboutentities,thePropertiesToolalsoallowstheusertodescribetherelationships(e.g.,parent/child)ofentitieswithoneanother,aswellasenterotherinformationsuchascontactinformation.Importantly,inaddition,thePropertiesToolalsocanbeeasilyengagedtocreateanalysesandseriesfromentities'websites.Thepurposeofenablinganalysisofawebsiteistoexamineitsstructure—i.e.,thedirectoriesthatmakeupthewebsite.(FormoreontheAnalysisTool,seebelow.)Insum,thePropertiesToolisusedto:• Createandmanagealistofcontentcreators(entities).• Assignmetadataandotherpropertiestoentities.• Specifywebsitesthatentitiesareresponsiblefor,andcreateanalysesandseries

(explainedbelow)basedonthosewebsites.

2.4.3.3. TheAnalysisTool:VisualizingtheStructureofaWebsiteThroughtheAnalysisToolitispossibletodiscernwhetherthereisvaluablecontentinthedirectoriesthatcompriseawebsiteand,ifso,toidentifythosechunksofcontent."Series"referstoflexibleaggregatesofcontentthatareanalogoustoarchivalseries—whichmaybeawholewebsiteoraportionofit(e.g.,onlyPDFsofannualreports),orevenoneindividualpageordocumentfromwebsites.Looselydefined,aseriesisanycollectionofWebmaterialthatauserchoosestocollectinone"bucket."Inaddition,seriesareusedinordertodrivetheWorkbenchharvestoperations.WhileseriesmaybeestablishedwithinthePropertiesTool,theycanalsobeestablishedandmanagedusingtheAnalysisTool,thenharvestedandpackagedintheHarvestTool.TheAnalysisToolhastwofunctionalareas:

• theAnalysisscreen,whichprovidesvisualizationtoolstoaidincontentselectiondecision‐makingandinseriesstructuredecisions.Here,too,abaselineanalysiscanbecreatedagainstwhichtomeasurefuturewebsiteanalyses;

• theSeriesscreen,whereseriesarecreated,edited,andmanaged;Seriesobjectsarekept;andSeriesharvestsareregulated.

TheAnalysisToolisusedto:

ECHODEPositoryTechnicalArchitecturePhase1FinalReportNarrativeReportNationalDigitalInformationInfrastructurePreservationProgramUniversityofIllinoisatUrbana‐ChampaignwithOCLC

18

• Analyzethestructureofawebsite.• Enterassociatedentities.• Setabaselineanalysisforcomparisonwithfutureanalyses.• Adjustsettings,suchasspidersettingsandchangenotificationthreshold

settings.• Definea"series"forharvesting(e.g.,harvestasanindividualobject),with

optiontoassociateitwithanentity.• Holdseriesobjectspriortoharvest.• Scheduleharvestsofseries.

Inaddition,operationsforholdingseriesobjectsandharvestingthemmaybeaccessedviathePropertiesTool.

2.4.3.4. TheHarvestTool:Reviewing,Packaging,andIngestingHarvestedContent

AlltheharvestsintheWorkbench,includingseriesharvests(viatheAnalysisTool)andquickharvests,arelistedintheHarvestTool.TheHarvestToolisusedtomonitorthestatusofharvestsandtoprovideanopportunitytoreviewandmodifytheharvestbeforepackagingitupandingestingitintoarepository.Theremaybesingle‐objectharvestsormultiple‐objectharvests,dependingonwhethertheoptiontoharvestcontentasindividualobjectswasselectedintheSeriesdetailsscreenofanAnalysis‐basedSeries(i.e.,intheAnalysisTool).TheQuickHarvestfeatureschedulesone‐timeharvestsofcontentbasedonaURLinputteddirectlyintotheHarvestTool.Afterharvestsarecompletetheymaybereviewed,atwhichtimeadditionalmetadatamaybeassigned.Theusercanrender,ordisplay,theharvestedcontentwithintheWAWtool,offtheHarvestResultspage.Theusercanactually"stepinto"theharvestedcontentatboththeharveststartingpointandatanyotherpointinthewebsite(viathewebsitefilestructuredisplay),andthesoftwarewillrenderthewebsiteappropriately.ThepurposeofthedisplayfeatureintheWebArchivesWorkbenchistoallowtheusertoverifythecorrectnessofwhatwasharvested—“correctness”meaningthatalltheinformationexpectedtohavebeencollectediscollected.Oncetheharvestedcontentisconfirmedascorrect,itthencanbeingestedintotheuser'slocalrepository.Insum,theHarvestToolisusedto:

• MonitorthestatusofharvestsscheduledintheAnalysisTool.• Deletecompletedharvests.• Reviewcompletedharvestcontent,whethersingle‐objectormulti‐object,

priortoingest.• Reviewcompletedharvests;ifdesired,editmetadataand/orinclude/exclude

content.

ECHODEPositoryTechnicalArchitecturePhase1FinalReportNarrativeReportNationalDigitalInformationInfrastructurePreservationProgramUniversityofIllinoisatUrbana‐ChampaignwithOCLC

19

• Ingestharvestedcontentintoarepository.• Launchaone‐timequickharvestusingtheQuickHarvestTool.

2.4.3.5. TheAlertsTab:WorkbenchNotificationsAsmentionedabove,the“Alerts”tabisnotatoolbut,rather,afeaturefornotifyingtheuserofavarietyofsystemsinformation.Thisinformationincludesnotificationabouterrors,incompleteprocesses,completedprocesses,andnewinformation(suchasthediscoveryofanewdomain,oranewfolderencounteredduringanalysis).Inshort,theAlertsTabisusedtoreviewreportsandalertsaboutWorkbenchfunctions.

2.4.3.6. TheSystemTools:MonitoringandManagingWorkbenchActivitiesTheSystemToolstabcontainsanumberofbehind‐the‐scenesfunctionsthataffectandreportonactivitiesofthefivemaintoolsoftheWorkbench.TheSystemToolsaredividedintofourfunctionalareas:

• theAuditLogpage,whichdisplaysrecentWorkbenchactivitiesandevents;• theSpiderSettingspage,wheretheusercanconfiguredefaultDomain,

Analysis,andHarvestspidersettings,aswellascreateadditionalDomain,Analysis,andHarvestspiderswithcustomsettings.Specifically,typesofspidersettingsinclude—butarenotlimitedto—depth(howdeeplyawebsiteshouldbecrawled,orspidered)andparametersoftime(when,howfrequently,andforhowlong);

• theImport/Exportpage,throughwhichtheusercanimportorexportavarietyofmetadatacommonlyusedintheWorkbench.Theseincludeentities,domains,andsubjectheadings.

• theReportspage,whichgeneratesprintablereportsonactivitiesofthemainfiveWorkbenchtools.Itoffersaviewofin‐developmententityandseriesreports.

ECHODEPositoryTechnicalArchitecturePhase1FinalReportNarrativeReportNationalDigitalInformationInfrastructurePreservationProgramUniversityofIllinoisatUrbana‐ChampaignwithOCLC

20

2.4.4. WebArchivesWorkbenchToolsSummaryTheWebArchivesWorkbenchimplementsanarchivalapproachtotheselectionandpreservationofdigital(Web‐based)content.TheWorkbenchautomatesmuchofthemethodologyembracedbytheArizonaModel,particularlybeyondtheinitialselectiondecisionsmadebythearchivist(e.g.,decidingatthestartofthearchivingprocesswhichwebsite,orwhichpartofawebsite,tocaptureandpreserve).Afterselectionparametersareset,theWorkbenchfacilitatesthecaptureandmanagementofthedigitalmaterialsinhierarchicalaggregates‐‐notunlikethearchivingofprint‐basedmaterials.

OVERVIEW OF WEB ARCHIVES WORKBENCH TOOLS

Discovery Tool

• discover domains

• group and prioritize domains

Comprising the Entry Points and Domains tabs, the Discovery Tool helps to identify potentially relevant web sites by crawling relevant “seed” Entry Points to generate a list of domains that they link to. At the end of this process the users have a list of domains that defines the sub-set of the web relevant for their archiving purposes. From here, the Properties and Analysis Tools are used to manage creator information about domains, and associate this information with harvests of content.

Properties Tool

• organize collection space

• create metadata

Comprising the Entities tab, the Properties Tool is used to maintain information about content creators or ‘Entities’ (e.g., government agencies), and associate them with the domains and web sites they are responsible for. The Properties Tool also allows users to describe the relationships (e.g., parent/child) of Entities with one another, as well enter high-level metadata about them that may be inherited by content harvested from their web sites. Importantly, the Properties tool can also be used to create and associate Series with Entities’ web sites. Series and harvests are then further managed using the Analysis and Harvest/Package Tool.

Analysis Tool

• visualize site structure

• associate metadata

• schedule harvests

Comprising the Analysis tab and the Series tab, the Analysis Tool provides website structure visualization tools to aid content selection decisions, and allows users to define archival Series, associate metadata with these series, and schedule recurring harvests of Web content. Harvesting activities are then monitored and managed in the Harvest Tool..

Harvest Tool

• review content

• package for ingest in external repository

Comprising the Harvester and Quick Harvest tabs, the Harvest Tool lists all harvests within the Workbench, including Series harvests scheduled using the Analysis Tool as well as Quick Harvests. It is used to monitor their status, initiate the final harvesting and ingest steps for the completed harvests tracked in the Harvest Tool, including reviewing harvest contents and metadata before ingest. This is the final step in the Web Archives Workbench workflow. It also offers a separate Quick Harvest feature.

Systems Tools

• reports and settings

The System Tools manage and monitor Workbench activities, reporting on operations undertaken in the four other tools. It has four functional sections: an Audit Log page (shows recent Workbench activities); a Spider Settings page (parameters for spidering may be set here); an Import/Export page (for moving metadata); and a Reports page (for producing printable reports about activities performed by the other tools).

ECHODEPositoryTechnicalArchitecturePhase1FinalReportNarrativeReportNationalDigitalInformationInfrastructurePreservationProgramUniversityofIllinoisatUrbana‐ChampaignwithOCLC

21

2.4.5. BehindtheScenes:OCLC’sTechnicalImplementationoftheWebArchivesWorkbench

AnISO9001company,OCLChasanexternallyauditedQualitySystembasedontherequirementsofISO9001asanaidforensuringthatproductsmeetuserexpectationsandspecifiedrequirements.OCLC'sprojectdevelopmentlifecycleisaprocessthatspecifieshowOCLCservicesaremarketedanddeveloped.Thisprocessincludeslifecycledocumentssuchasprojectplans,requirements,design,testplans,operationssupportplansandpost‐projectreviews.TheWebArchivesWorkbenchprogramfollowedthislifecycle.TheWAWprogramwasdividedintothreemainprojectsandmanysmallerreleasesinordertoreduceriskandtocreateafeedbackloopallowingrefinementoftherequirementsbasedonpreviousreleases.Therewerethreemajorsoftwarereleases,plusapproximately20additionalreleasesoverthecourseofthethree‐yearprogram.Thethreemaindevelopmentprojectswerebasedonthemainareasoffunctionalityofthetoolsuite:(1)DomainandEntity,(2)AnalysisandPackager,and(3)SiteAnalysisandChangeManagement.ThoughtheDomainandEntityfeaturesinWAWweresomewhatfunctionallysimple,theDomainandEntityprojectcarriedasignificantamountofriskbecauseitbuiltthetechnicalfoundationonwhichtherestoftheprojectwouldrest.TheSiteAnalysisandChangeManagementtoolswereriskyduetotheusabilityissuesinvolvedinclearlyrepresentingtotheusertheprocessofharvestingandevaluatingchangestowebsites.ThroughouttheprojectoneofourmainconcernswashowtorepresenttheArizonaModelinaclearandusablewayinsoftware.(Thisconcernisaddressedinthesection“TheWebArchivesWorkbenchWorkflow.”)Basedonearlydiscussions,thesystembegantobeseenasa“workbench,”intowhichcomponentsandsystemswouldbeincorporatedanddroppedovertime—perhapsbecauseuserswouldprefertoapplysomeoftheirlocaltoolsorperhapsbecausetheywouldhavemultipletoolsforagiventask.Additionally,eachcomponentwouldgrowitsdataqualityovertime,thereforeforcingtherestofthesystemtoadapteasilytoevolvingspecificationsanddataversions.Therefore,thearchitectureisdesignedforlocation,interface,anddata‐exchangetransparencies,whichmeansthatchangesinthosethreemainareasareexpectedtodriveallothersystemcharacteristics.ThehighleveltechnicalarchitectureofthesystemwasspecifiedusingtheReferenceModel‐OpenDistributedProcessing(RM‐ODP).4Thisframeworkusesvarious

4Forthespecification,see“TheISOReferenceModelforOpenDistributedProcessing–AnIntroduction,”athttp://www.enterprise‐architecture.info/Images/Documents/RM‐ODP2.pdf.

ECHODEPositoryTechnicalArchitecturePhase1FinalReportNarrativeReportNationalDigitalInformationInfrastructurePreservationProgramUniversityofIllinoisatUrbana‐ChampaignwithOCLC

22

viewsofasystem,includingadomainmodelview,aninformationview,anapplicationview,andatechnologyanddeploymentview.5Usingthisframework,OCLCcreatedthefollowingearlydomainmodelofthesystem.(SeeFig.4onnextpage.)Someoftheboxesinthisdomainmodelwerelaterremovedfromtherequirements,asourunderstandingofthesystemtobebuiltchangedovertime.Thearchitectureconsistedofseverallayers:client,integration,service,andpersistence.TheclientlayerconsistedofauserinterfaceimplementedusingtheStrutsframeworkasamodel‐view‐controllertostructuretothecode.ThesecondlayerisaWebserviceslayerthatprovidesthehooksforaclienttotalktotheapplication(althoughthecodewasnotusedinthisway).Thislayeralsoprovidesintegrationbetweentoolsandtranslationbetweentheinternalandexternalrepresentationsofthedata.EachdevelopingWAWtool(Entity,Analysis,Domain,etc.)implementedaconsistentHelperAPItoallowtheuserinterfacelayertoAdd/Update/Delete/Searchsingleormultipleobjects.TheOracledatabaseprovidedapersistencelayer.Oncethehighleveldesignwasproduced,adetaileddesignwasproducedforeachtool.OCLCcreatedusecasesforallmainactivitiesineachofthetools.

5InRM‐ODPthearchitectureofasystemisdescribedby5views(essentially5differentpointsofview)reflectingtheseparationofresponsibilitiesbetweenbusinesssponsors,developers,andsupportstaff.Thoseviewsare:

1. Enterprise‐community,enterpriseobjects(domainmodel),objectives(requirements/usecases),roles

2. Information‐schemas,objectattributes,databoundaries,constraints,semantics3. Computational‐components,interfaces,interactions,contracts4. Engineering‐transparencies(location,access,failure,persistence),nodes,channels5. Technology‐technologies&products(theonlydependenceonspecificproductsand

implementationpackages)

ECHODEPositoryTechnicalArchitecturePhase1FinalReportNarrativeReportNationalDigitalInformationInfrastructurePreservationProgramUniversityofIllinoisatUrbana‐ChampaignwithOCLC

23

Figure4:DiagramshowingOCLC'searlydomainmodelofthesystemthateventuallydevelopedintotheWAWsuiteoftools.Eachdeveloperworkedinhisown“sandbox,”whereaWAWinterfacewassetupforhisexclusiveuse.Theworkofmultipledeveloperswasintegratedintoa

ECHODEPositoryTechnicalArchitecturePhase1FinalReportNarrativeReportNationalDigitalInformationInfrastructurePreservationProgramUniversityofIllinoisatUrbana‐ChampaignwithOCLC

24

developmenttestenvironmentcalled“Baseline.”Thisway,productmanagersandtestanalystscouldreviewworkinprogressinBaseline.WhenBaselinewasreadyitwasmigratedintoaQualityAssuranceenvironment,whereformalizedtestingwasdoneagainstatestplan.Formajorinstalls,BaselinewasalsoinstalledatUIUCforadditionaltesting.Thefinalstepofthedevelopmentprocesswastodeploythesoftwareintoaproductionenvironment.TheWebArchivesWorkbenchwasreleasedasanopen‐sourcepackageonSourceForgeinOctober2007.ReleasedocumentationincludesdetailedinstallationinstructionsandadetailedUserGuideforunderstandingandusingthetools.• WAWReleasehomepage:

https://sourceforge.net/projects/webarchivwkbnch/• AdministrationGuide:

https://sourceforge.net/project/showfiles.php?group_id=205495(alsoprovidedasanAppendixiteminthisreport)

• UserGuide:https://sourceforge.net/project/showfiles.php?group_id(alsoprovidedasanAppendixiteminthisreport)

• WAWsoftwarepackage:http://webarchivwkbnch.cvs.sourceforge.net/webarchivwkbnch/webarchivwkbnch/

TheAdministrationGuidehasruntimeenvironmentrequirementsforWAW.Italsohasalistofall3rd‐partysoftwareusedbyWAWintheIncorporatedCodesectionofthedocument.Thethird‐partysoftwareisincludedintheWAWdistribution(refertolinkforWAWsoftwarepackage).AnOCLCsubscriptionisnotrequiredtouseWAWortousethisthird‐partysoftware.PleaseseetheHOWTO‐build‐install‐locally.txtfileintheWAWdeploymentforadditionalinformation.TheWAWtools,asdevelopedbythisproject,willcontinuetobemadepubliclyavailableindefinitelythroughSourceForge.Inaddition,in2008OCLCreleasedanewarrayofservicesincorporatingcomponentsoftheWAWtoolsintoaworkflowwithCONTENTdm,WorldCAT,andtheOCLCDigitalArchive.

2.5. Findings‐UserFeedbackTestingoftheWAWtoolswasundertakeninvaryingdegreesbytheoriginalprojectcontentpartners,aswellasbyseveralvolunteerorganizations.Feedbackabouttheirexperiencesworkingwiththetoolswasgatheredduringlarge‐groupprojectmeetingsatOCLC,aswellasthroughphoneconversationsande‐mailexchanges.TheoverallresponseindicatesthattheWebarchivingapproachoftheWAWtoolswas“elegant”andworthconsideration,butinpracticecontentpartnersgenerallydidnotimplementthefullfunctionalityofthetools.Thus,thepotentialbenefitsof

ECHODEPositoryTechnicalArchitecturePhase1FinalReportNarrativeReportNationalDigitalInformationInfrastructurePreservationProgramUniversityofIllinoisatUrbana‐ChampaignwithOCLC

25

applyinganarchivalapproachtotheWebwerenotrealizedcompletely.Reasonsforthispartialimplementationhavetodowithinadequateresourcesandtimetowardtrainingfortheuseofthetools,whichalsopointsuptheircomplexity(explainedinfurtherdetailbelow).TheWebArchivingWorkbenchispowerfulandextensiveintermsofwebharvestingandcontent,orseries,analysis,but—accordingtothefeedbackfromourcontentpartners—atacostofheuristicsandusability.Notsurprising,theQuickHarvestfunctionality(which,becauseseriesanalysisisnotanoptioninit,involvedfewerstepsandthuslessmanagementthantheregularHarvesttool)wasengagedmostoften;forsome,theQuickHarvestfeaturebecameamuch‐valuedcomponentoftheirdailyworkflows.Changesincontentdeliveryapproaches—suchasfromstaticWebpagestodatabase‐drivenpages—constitutedanotherreasonfornotapplyingthefullfunctionalityofthetools.

2.5.1. LimitedResourcesandLimitedTimeDuringtheirparticipationintheECHODEPositoryproject,statelibraryandarchivespartnersremainedundercontinualoperationalpressurestorespondtotheneedforcapturingcontentfromagencywebsites.SomepartnerstestedtheWAWtoolswhilecontinuingtouseotherWebcontentcaptureapproachesinordertomeettheirimmediateobligations,leavingfewerresourcestofocusontheWAWtools.Becausethetoolswerestillunderdevelopment,testingofthevariousphasedreleasesmayalsohavebeendifficulttoincorporateintodailyworkflows.Supportfromtheproject(intheformofinterns)hadbeenplannedbutwasgearedtotheearlyreleasesoftheWorkbench,beforethefullfunctionalityofthetoolswasimplemented.Inhindsight,puttingprojectresourcestowarddirectworkwithcontentpartners,asoriginallyintended,mighthaveresultedinmoreuseofthefullfunctionalityofthetools,especiallyiftimedmorespecificallytocoincidewithlater,morefullyfunctional,softwarereleases.

2.5.2. ComplexityoftheToolsAccordingtouserfeedback,theQuickHarvestandDiscoverytoolswereeasiesttouse,becausetheycouldbesetupquicklyandincorporatedintoexistingworkflowswithoutincreasingtheneedfornewresources.ThefullfunctionalityofthetoolsinvolvesunderstandingaprocesswithagreaterlevelofcomplexitythanthatpresentedbytheQuickHarvestoption;partnersreportedthatitwaseasiertousetheQuickHarvestandDiscoverytools,ratherthanexpendtimeandresourcesforlearning,ortesting,thetoolssuiteasawhole.Further,somecontentpartnersreportthatthecomplicatedinterfaceofthetoolswasabarriertousingthemtotheirfullestpotential.

2.5.3. WebContentDeliveryTheassumptionproposedbythearchivalmodel—thatawebsiteanditsdirectoriesaresimilartoanarchivalrecordcollectionandsetofrecordseries—doesnotapply

ECHODEPositoryTechnicalArchitecturePhase1FinalReportNarrativeReportNationalDigitalInformationInfrastructurePreservationProgramUniversityofIllinoisatUrbana‐ChampaignwithOCLC

26

todayaseasilyasitdidwhenthemodelwasfirstproposedin2003.Anincreasingamountofcontentisnowdeliveredthroughdatabase‐drivenwebsitesratherthanthroughstaticWebpages.Therelationshipsbetweencontentitemsthatmayhavebeenobviouswhenstoredinafiledirectoryarenotalwaysapparentwhenstoredinadatabase.Therefore,crawlingdomainstofindpotentialcontenttoharvestandapplyinginheritedmetadataaccordingtoadirectorystructurearenowlessusefulapproachesthantheywerejustafewyearsago.Nonetheless,despitethisshiftinhowinformationisdeliveredviawebsites,theconceptofcontentinheritingmetadatafrompreviouslyharvestedcontent,andthenassociatingthatcontentwithanexistingaggregatecollection,continuestobeusefulformakingautomatedharvestprocessesmoreeffective.

2.6. ConclusionsandNextStepsStatelibrariansandarchivistscontinuetosearchforthebestmethodsforcapturingWebcontentbasedontheirspecificmandatesandtheresourcestheyhaveavailabletothem.RecentdevelopmentsinWebarchivingservicesandtoolsprovidenewopportunitiesforpartneringwithothersandforexploringnewworkflows.TheWebArchivesWorkbenchtoolsareoneoptionamongmany.TheyautomatethemethodologyprescribedbytheArizonaModel,whichispremisedonkeyarchivalpractices,suchasobservationofprovenanceandadherencetooriginalorder.Thefourmaintools(Discovery,Properties,Analysis,andHarvest)enabletheidentification,selection,description,andpackagingofdigitalcontent.Inaddition,theWAWsuiteincludesfunctionalitiesforerrornotification,aswellasSystemtoolsforoverseeingandreviewingWorkbenchactivities(intheformofauditlogs,spidersettings,metadataimport/exportoptions,andreportsontheactivitieslaunchedbyothertoolsinthesuite).ThelessonslearnedfromdevelopingtheWorkbench,andtheunderlyingarchivalmodelusedtodirectitsdevelopment,underscorethemergingrolesandresponsibilitiesofarchivistsandlibrariansinthedigitalenvironmentandtheneedtore‐evaluateandre‐envisionworkflows.Moreover,thecontinuingmissionandsignificanceofthisworkhavebeenaffirmedinthesecondphaseofNDIIPP.Forexample,theUniversityofIllinois,OCLC,andtheUniversityofMarylandhavepartneredtodevelopastand‐alone,open‐sourcemetadataextractiontoolintendedtoprovideaccesstoarchivedcontent–akindofnextstepfortheWebArchivesWorkbench.Inaddition,intheStateInitiativescomponentofNDIIPP,aselectionofstatelibrariesacrossthenationarecollaboratingtodeveloptoolsandservicemodelsforthemanagementandpreservationofstategovernmentdigitalmaterials.Theseprojectsaddressdigitalpreservationinavarietyofcontexts,includingdisasterreadinessandtherecoveryofdata.ThroughtheStateInitiativeswork,NDIIPPisaddressingthefundamentalissueofkeepingat‐riskstategovernmentresourcesviableaspartofournationalheritageandrecord.

ECHODEPositoryTechnicalArchitecturePhase1FinalReportNarrativeReportNationalDigitalInformationInfrastructurePreservationProgramUniversityofIllinoisatUrbana‐ChampaignwithOCLC

27

3. RepositoryEvaluationandInteroperabilityAnothercomponentoftheECHODEPositoryprojectistheevaluationofvariousopensourcerepositorysoftwareapplications,withafocusonhowtheseapplicationssupportactivitiesofaninstitutionororganizationinterestedinprovidingservicesassociatedwithatrustworthydigitalrepository.Thissectiondescribesthedevelopmentofanevaluationframeworkbasedonthefirstdraftofthe2005RLG/NARAAuditChecklistfortheCertificationofaTrustedDigitalRepository,DraftforPublicComment(AuditChecklist),ourrepositorytestingandfindings,andhowtheseactivitiesledtothedevelopmentofatoolsuite(theHubandSpoke)forsupportingrepositoryinteroperabilityandthecollectionofmetadataimportantforpreservation.

3.1. RepositoryEvaluation

3.1.1. BuildinganEvaluationFramework:ApplyingtheTrustedDigitalRepositoryChecklisttoRepositoryEvaluation

Ourgoalistoprovideanevaluationframeworkthatreflectscurrentthinkingondigitalpreservationstandardstohelpcuratorsofdigitalcollectionslibrariansandarchivistsassessdigitalrepositorysystems,withafocusontheirabilitytosupportlong‐termpreservation.The2005RLG/NARAAuditChecklistfortheCertificationofaTrustedDigitalRepository,DraftforPublicCommentprovidedavitalstartingpoint.TheAuditChecklistwasdevelopedbyajointtaskforcefromRLGandtheNationalArchivesandRecordsAdministration(NARA).Itprovidesameansbywhichaninstitutioncanperformaself‐evaluationtodeterminehowwellitispositionedatanorganizationalleveltoprovideanexpectedleveloftrustworthinessasadigitalrepository.Weconsideredittobeastate‐of‐the‐artarticulationofwhatitmeans,atanorganizationallevel,tobeasuccessfulcuratorofdigitalresources.Wethereforedecidedtousethisdocumentasastartingpoint,andprovidesupportforusingthoseportionsthatarerelevanttosoftwareasa‘lens’forconsideringrepositorysoftwaresystems.OurprojectteamreviewedeachAuditChecklistitemwiththequestioninmind,“Howmightarepositorysoftwaresystemsupportanorganizationinmeetingthiscriterion?”SomeChecklistitemsareapplicabletosoftwareapplications;othersarenot.Weisolatedtherelevantitems,and,throughmuchdiscussion,testingandreview,developedasystemofannotationstodescribehoweachparticularrelevantChecklistitemmightbeappliedtoassessmentofrepositorysoftwaresystems.ThisadaptedChecklistisourAnnotatedAuditChecklist.

ECHODEPositoryTechnicalArchitecturePhase1FinalReportNarrativeReportNationalDigitalInformationInfrastructurePreservationProgramUniversityofIllinoisatUrbana‐ChampaignwithOCLC

28

OurAnnotatedAuditChecklistevaluationframeworkisprovidedinthisreportinAppendix6.3.Astheversionweused,the2005RLG/NARAAuditChecklistfortheCertificationofaTrustedDigitalRepository,DraftforPublicCommenthasnowbeensucceededbytheTrustworthyRepositoriesAuditandCertification:CriteriaandChecklist(TRAC),wehaveincludedmappingtoequivalentsectionsofthenewversion.AdditionaldetailsandexamplesofannotationsareprovidedinAppendix6.4.Thefollowingoverviewisextractedfromthissource.

3.1.1.1. Findings:theAnnotatedAuditChecklistAsaRepositoryEvaluationFramework(fromKaczmareketal,2006)

WefoundtheprocessitselfofadaptingtheAuditChecklistasaframeworkforourrepositorysoftwareapplicationevaluationtobeausefullearningexperience.SituatingourevaluationwithintheoriginalAuditChecklistprovidedaframeworktodiscussrepositorysoftwareapplicationswithoutlosingsightofthelargerorganizationalcontext.Asweusedittodocumentourrepositoryinstallationandexperimentationexperiences,wefoundtheannotatedAuditChecklistprovidedagoodframeworkforlookingatrepositorysoftwareapplicationswithinthecontextofdigitalpreservation.However,informationaboutotheraspectsofsoftwarenotdirectlyrelatedtopreservation(e.g.,easeofinstallation,easeofmaintenance,programminglanguageused)donotfitwellintothisframework.Importantly,wehavealsofoundthattheprocessitselfofannotatingtheoriginalAuditChecklistprovidedaforumfortheprojectteammemberstobegindiscussionsthathaveopenedupopportunitiestoexploreourindividualassumptionsaboutvariouschecklistitemsandourinterpretationsofterminology.Throughthesediscussionsweestablisheddirectionstotakeourevaluationactivitiesfurther.

3.1.2. RepositoryTesting:IngestandExportTestsOnFourKeyOpen‐sourceRepositories

Thefouropen‐sourcerepositorysoftwareapplicationsthatweretestedwereDSpace,Eprints,Fedora,andGreenstone.Thecollectionitemsusedastestdataaredescribedindetailbelow.OuttestingapproachandmethodologyareexplainedinSection3.1.3,alsobelow.

3.1.2.1. TestData:aHeterogeneousCanonicalSetInordertotesttherepositories,anumberofheterogeneouscollectionsofdigitalitemswereidentified.Eachofthesecollectionshadvaryingstructures,formats,andexistingmetadata.Anoverviewofeachcollectionfollowsbelow.

ECHODEPositoryTechnicalArchitecturePhase1FinalReportNarrativeReportNationalDigitalInformationInfrastructurePreservationProgramUniversityofIllinoisatUrbana‐ChampaignwithOCLC

29

3.1.2.1.1. AerialPhotosThisisarelativelysmallcollectionofscannedaerialphotographsfromacoupleofIllinoiscounties.Itconsistsof1,021distinctscannedphotographsandtheiraccompanyingmetadata,whichaccountsfor2,042distinctfilesforatotalof235megabytes.ThemetadataarebasedonFederalGeographicDataCommittee(FGDC)GeospatialMetadataStandard.ThescannedimageswereJPEGsofscreenresolutionquality;thearchivalqualityimageswerenotavailableforthisproject.

3.1.2.1.2. DLI(DigitalLibraryInitiative)JournalArticlesTheDLI(DigitalLibraryInitiative)collectionconsistsof85,650distinctjournalarticlesonthesubjectsofscience,technology,andengineeringfromfivedifferentpublishers.ThiscollectionwascreatedaspartoftheGraingerLibrariesearlierNSFandCNRIfundedDigitalLibrarytestbed.Thiswasbyfarthelargestcollectionwith2,247,455filesforatotalof76,148megabytes.Eachjournalarticletypicallyconsistedofseveralinstantiations,typicallyincludingXMLandSGMLconformingtooneofseveraldifferentDTDsplusaPDFversion,butinsomecasesalsoPostscriptorTeXversions,plusalloftheassociatedfilessuchasimagesandmetadata,whichalsooccurredinseveraldifferentformats.

3.1.2.1.3. WILLPublicRadioBroadcasts

WILListhelocalPublicbroadcastingstation,andthiscollectionconsistsoftheaudiorecordingsforaselectionofitsFocus580,dailytalkradioprograms.Atotalof310programsareincluded.EachprogramhasaWAVaudiofileplustwoXMLmetadatafilesforatotalof930filesand82,456megabytes.ThemetadatawasoriginallyinaMicrosoftAccessdatabase.

3.1.2.1.4. VincentVoiceAudioCollectionThiscollectionisaselectionofaudiorecordingsfromtheVincentVoiceLibraryatMichiganStateUniversity.Itconsistsof209recordings,manyofwhicharecomposedofseveralaudioWAVfiles.ThereisanEncodedArchivalDescription(EAD)fileassociatedwitheachrecordingforatotalof3,515filesand110,186megabytes.

3.1.2.1.5. DOQ(DigitalOrthophotoQuadrangles)DataThisisacollectionof1,073highresolution,DigitalOrthophotoQuadranglesoftheChicagoarea.EachDOQconsistsofsixfiles:theimageTIFFfile,the‘worldfile’usedforgeo‐referencingtheTIFF,anFGDC1XMLmetadatafile,plusatext‐onlyversionofthemetadata,andtheDTDforthemetadatafile,andanXSLTstylesheetforthemetadata.

3.1.2.1.6. “TheCanonicalSet”ThefirstrepositorytestedwasDSpace.Foreachcollection,specializedscriptsandXSLTswerewrittentoarrangetheitemsandmetadatainsuchawaythattheycould

ECHODEPositoryTechnicalArchitecturePhase1FinalReportNarrativeReportNationalDigitalInformationInfrastructurePreservationProgramUniversityofIllinoisatUrbana‐ChampaignwithOCLC

30

beingestedintoDSpaceusingtheItemImportutility.Thisprocessincludedmovingthefilesassociatedwitheach“item”intoasingledirectory,creatingdescriptivemetadatainDSpace’sidiosyncraticQualifiedDublinCoreformat,andcreatingacontentmanifest.TheDSpacebulkingestpackageformatwasacceptedasthebaselineconfigurationfromwhichallotherprocessingwouldoccur.Thecollectionofallthedigitalpackagesinthisformatbecameour“canonicalset.”Seefiguresbelowforvariousbreakdownsofallthefilesinallthisset:

CanonicalTestSet::NumberofPackagesbyCollection

ECHODEPositoryTechnicalArchitecturePhase1FinalReportNarrativeReportNationalDigitalInformationInfrastructurePreservationProgramUniversityofIllinoisatUrbana‐ChampaignwithOCLC

31

CanonicalTestSet:TotalMegabytesbyCollection

CanonicalTestSet:NumberofFilesbyCollection

ECHODEPositoryTechnicalArchitecturePhase1FinalReportNarrativeReportNationalDigitalInformationInfrastructurePreservationProgramUniversityofIllinoisatUrbana‐ChampaignwithOCLC

32

CanonicalTestSet:NumberofFilesbyFilenameExtension

3.1.3. TestingApproachandMethodologyInanutshell,ourprocesswastoinstallaparticularrepository,dowhateverwasnecessarytoingestourcanonicaldatasetintotherepository,dowhateverwasnecessarytoexportourcanonicaldatasetbackoutoftherepository,andrecordinanarrativefashionourfindings,especiallyinthecontextofdigitalpreservationand,also,ofourAnnotatedTrustedDigitalRepositoryChecklist.ItneedstobementionedthattheAnnotatedTrustedDigitalRepositoryChecklistandourconceptofwhatwasactuallyrequiredforlong‐termdigitalpreservationwasbeingcontinuallyrevisedinparallelwiththerepositoryevaluationprocess.Moredetailsofthetestingapproachandmethodologyareprovidedbelow.First,differentprojectstaffmembers,consistingprimarilyofgraduateresearchassistants,wereassignedtoeachofthedifferentrepositoriestobeevaluated.Cooperationbetweenevaluatorswasencouraged,especiallywhendifferentskillsetsmightbeneededinperforminganevaluation.Someoftheevaluatorshadlibrarybackgroundswhileothershadtechnicalcomputerbackgrounds.

ECHODEPositoryTechnicalArchitecturePhase1FinalReportNarrativeReportNationalDigitalInformationInfrastructurePreservationProgramUniversityofIllinoisatUrbana‐ChampaignwithOCLC

33

Theapproachwasfairlyfreeformandtheevaluatorshadsomeleewayinthedetails,butingeneraltheyfollowedthisroughoutline:

• Thefirststepwastobecomefamiliarwiththerepositorytobeevaluated.Thiscouldinvolvereviewofanypreviousevaluationsorotherwrittenmaterialabouttherepository,includingthedocumentationprovidedbytherepositoryitself.Thisstepculminatedintheinstallationoftherepositoryonourtestserver.

• Thenextstepwastodevelopaprocessforingestingourcanonicaldatasetintotherepository.Thisrequiredtheevaluatortogainagoodunderstandingofthedetailsoftherepository’ssupportedmetadataformatsandsupportedfilestructures,aswellastherepository’sprogramminginterfacesorbatchprocessingtoolsthatcouldbeusedtofacilitatetheingest.Theevaluatorwouldalsoneedtobecomefamiliarwithourcanonicaldatasetatthispoint,iftheywerenotalreadyfamiliarwithit.Theingestprocessgenerallyconsistedofthesesteps:

o Developmappingsbetweenthevariousmetadataformatsrepresentedinthecanonicaldatasetandthemetadataschemasrequiredbytherepository.ThesemappingscouldbeimplementedusingXSLTtransformationsorinsomecasesbywritingcustomizedcomputerprograms.

o Packagethemetadataandfilesinaformatthatisdigestiblebytherepository.Thiscouldbeassimpleascreatingatextfilemanifest,ornamingfilesaccordingtosomestandardandputtingthemallinacertaindirectorystructure,orascomplexascreatingMETSorFoxMLXMLpackages.Similartothemetadata,thesepackagesareusuallyimplementedusingsomecombinationofXSLTandcustomizedcomputerprograms.

o Finally,theactualingestneededtooccur.Onceagain,thiscouldbeassimpleasrunningoractivatingtherepository’snativeingesttool,orascomplexaswritingacustomingestprogramthatusestherepository’slow‐levelprogramminginterfaces.Iftherepositorysupportedanativebatchingestmechanismeveryattemptwasmadetouseitasisbeforeresortingtothecreationofanycustomizedingestprograms.

• Oftendevelopmentoftheingestprocesswasiterative,consistingofdevelopingandimplementingaprocess,testingit,andrefiningituntiltheentirecanonicalsetcouldbereliablyingested.

• Afterthecanonicaldatasethadbeeningested,theprocesswasreversed,andthedatawasexportedordisseminatedbackoutoftherepository.Similartoingest,thedisseminationcouldbeassimpleasinvokingnativerepositoryfunctions,orascomplexaswritingacustomprogram—althoughwe

ECHODEPositoryTechnicalArchitecturePhase1FinalReportNarrativeReportNationalDigitalInformationInfrastructurePreservationProgramUniversityofIllinoisatUrbana‐ChampaignwithOCLC

34

preferredtousenativebatchexportcapabilitiesiftherepositoryhadsupportforany.Thisprocesswasalsoofteniterative.

• Evaluatorswereencouragedtorecordtheirprocessesandfindingsabouttherepositoriesthroughoutthisprocess.

• Finally,reuseoftransformationsandcomputerprogramsbetweendifferentrepositorieswasencouraged.

Inparallelwiththerepositoryevaluationprocessesdescribedabove,teammembersalsoparticipatedinthereviewoftheTrustedDigitalRepositoryChecklist,sothatbothofthesetasksinformedtheotherinaniterativefashion.ThesimultaneousTrustedDigitalRepositoryChecklistreviewandtherepositoryevaluationsculminatedintheevaluatorsbeingaskedtoapplyourAnnotatedTrustedDigitalRepositoryChecklisttotheirrepositoryevaluationfindings,whichareprovidedintheappendicesofthisreport.Unfortunately,oneofthepitfallsofemployinggraduateassistantsonalong‐termprojectlikethisisthattheyeventuallygraduateormoveontootherassistantshipsastheireducationalgoalsprogressorchange.Whilewestronglyencouragedourgraduateassistantstodocumenttheirwork,wefoundinsomecasesthattheirnoteswerenotalwaysdetailedenoughforustoaccuratelyreflecttheirfindingsasweappliedtheirevaluationstoourAnnotatedTrustedDigitalRepositoryChecklist.Thissometimesrequiredustorevisit,orrecreate,atestforagivenrepositoryinordertoaddressoneormoreofthechecklistitems.Anotheroutcomeoftherepositoryevaluationsisthatasweprogressedwithingestandexporttestingofthedifferentrepositories,oneofourgoalsbecametobeabletoreliablymoveacollectionofdigitalobjectsbetweenanytwooftherepositoriesthatwerebeingevaluatedandbackagain(roundtripping).ThiswasthegenesisforourcurrentlydevelopingHubandSpokerepositoryinteroperabilityarchitecture.

3.1.4. RepositoryTestingFindings:NarrativeReports,andAnnotatedAuditChecklistCommentary

Thefollowingopensourcerepositorieswereevaluated:• DSpace:Version1.2.2withlaterupgradeto1.3.1• Eprints:Version2.3.13• Fedora:Version2.0,withlaterupgradesto2.1,2.1.1and2.2• Greenstone:Version2.6,withupgradeto7.7

Overviewreportswereproducedforeachrepository,containingthefollowingsections.TheseareprovidedinAppendix6.5,RepositoryTestingFindings:Narrative.

ECHODEPositoryTechnicalArchitecturePhase1FinalReportNarrativeReportNationalDigitalInformationInfrastructurePreservationProgramUniversityofIllinoisatUrbana‐ChampaignwithOCLC

35

• RepositoryOverview• TestingandTechnicalEnvironment• Methodology• Findings• Conclusion

WealsodocumentedourexperiencesusingourAnnotatedChecklistevaluationframework.ThisdocumentationisprovidedinAppendix6.6.

3.1.5. ConclusionandNextStepsTwokeyoverallfindingsweretypicallylowout‐of‐the‐boxsupportforinteroperabilityandlowsupportforemergingpreservationstandards.Duringthedevelopmentofourtestbedwefoundourselvesdevelopinganumberofdifferentthoughsimilarcustomizedscriptsandprogramsforexportingdigitalpackagesfromonerepositorysystemandimportingthosedigitalpackagesintoanotherrepositorysystem.Therepositorysystemsthemselveshadverylittleincommonthatwouldfacilitatethistask.Theytypicallysupporteddifferentdescriptivemetadataformats,hadnosupportforprovenancemetadata,offeredlittleornosupportfortechnicalmetadata,andemployeddifferentmeansofidentifyingthefilesconstitutingapackage.Thedevelopmentofanin‐housetooltofacilitatedatainteroperabilitybetweenmultiplerepositorieswithouttheneedtodevelopcustomizedmechanismsforeachrepositorycombinationthereforesoonemergedasakeytasktosupportourrepositoryevaluationactivities.Atthesametime,wewerealsocomingtoamorestructuredunderstandingofemergingdigitalpreservationstandards,specificallyearlydraftsofAnAuditChecklistfortheCertificationofTrustedDigitalRepositories(RLG,2005;Kaczmarek,Hswe,Eke,&Habing,2006;Kaczmarek,Habing,&Eke,2006)andthePREMISDataDictionaryforPreservationMetadata(PREMISWorkingGroup,2005).Webegantoseethataformally‐developedinteroperabilityarchitecturedesignedwithafocusonprovidingadditionalsupportforretentionofprovenanceandtechnicalmetadatacouldbeavaluableandpracticalprojectdeliverable,andonewithimmediateapplicationinourownlibrariesandinotherinstitutionsthatcommonlyimplementmultiplerepositorysystemstomanageandpreservedigitalcollections.ThesefindingsledourproposingtheHubandSpoketoolsuiteasanadditionalprojectdeliverable.Thisworkisdescribedindetailinthenextsection(Section3.2).

ECHODEPositoryTechnicalArchitecturePhase1FinalReportNarrativeReportNationalDigitalInformationInfrastructurePreservationProgramUniversityofIllinoisatUrbana‐ChampaignwithOCLC

36

3.2. HubandSpokeArchitecture(HandS):SupportingRepositoryInteroperabilityandEmergingPreservationStandards

ThissectiondescribesthedevelopmentoftheHubandSpoke(HandS)toolsuite,builttohelpcuratorsofdigitalobjectsmanagecontentinmultiplerepositorysystemswhilepreservingvaluablepreservationmetadata.ImplementingMETSandPREMIS,HandSprovidesastandards‐basedmethodforpackagingcontentthatallowsdigitalobjectstobemovedbetweenrepositoriesmoreeasilywhilesupportingthecollectionoftechnicalandprovenanceinformationcrucialforlong‐termpreservation.(Notethatrelatedprojectwork,investigatingthemorefundamentalsemanticissuesunderlyingthepreservationofthemeaningofdigitalobjectsovertime,isprofiledinSection4.)

3.2.1. HandSOverviewTheHandSisasuiteoftoolsbuilttosupportmovingcontentbetweenrepositorieswhilegeneratingandmaintainingPREMIS‐basedtechnicalandpreservationmetadata.Itemergedoutofprojectactivitiestoevaluateopen‐sourcerepositories(seeSection3.1)inwhichwefoundtypicallylowout‐of‐the‐boxsupportforinteroperabilityandlowsupportforemergingpreservationstandards.ThenextsectiondescribestheimpetusandrationalebehindtheHandSdevelopmentinmoredetail.

3.2.2. TheNeedforInteroperabilityandPreservationSupport

3.2.2.1. InstitutionsCommonlyRelyonMultipleRepositoriesTherearecurrentlymanydifferentdigitalrepositoriesinwidespreaduse,includingDSpace,Greenstone,Fedora,EPrints,andCONTENTdm,alongwithdigitalarchiveserviceslikethosefromOCLCandCDL.Therearealsomanydifferentsourcesofinputintothesesystems,suchasfromwebcrawlerslikeHeritrixorpackagedcontentfromOCLC'sWebArchivesWorkbench,aswellasnumerousdigitizationandscanningservices.Itisalsonotuncommonforseveralofthesesystemstobeinusewithinasingleinstitution.Ifcuratorswishtosharedatainternally,orwithotherinstitutionsorconsortia,thenmultiplerepositorysystemsverylikelywillcomeintoplay.Repositoryinteroperabilityissuesalsoemergeasinstitutionsupdateorreplacetheirrepositorysystems,andmustmigratecontentfromanexistingrepositorysystemtoitsreplacement.

3.2.2.2. Out‐of‐the‐boxRepositoryInteroperabilityisLowOurrepositoryevaluationexperimentsandourexperienceswithrepositoriesinproductionatourowninstitutionsshowthatthenativeabilityforrepositoriestointeroperateistypicallyverybasic.Almostnoneofthesystemswetestedwereable

ECHODEPositoryTechnicalArchitecturePhase1FinalReportNarrativeReportNationalDigitalInformationInfrastructurePreservationProgramUniversityofIllinoisatUrbana‐ChampaignwithOCLC

37

tooperatewithoneanotherbeyondarudimentarylevel,usuallyrestrictedtotheOAIProtocolforMetadataHarvesting(OAI‐PMH)forDublinCore.IfanyOAISconceptsareimplemented(andfeware),suchastheuseofsubmissionordisseminationinformationpackages(SIPsandDIPs),theseimplementationsvarygreatly.InanidealOAIS‐compliantworld,aDIPfromonerepositoryshouldbeaSIPtoanother.However,inreality,adisseminationpackageproducedbyDSpacecannotbeusedforsubmissionintoEprints.Becauseoftheseinconsistencies,achievinganyrealinteroperabilitybetweenrepositorysystemsusuallyentailssomelevelofcustomsoftwaredevelopment.Further,anytimeanewrepositoryisaddedtothemix,newsoftwarewillneedtobedevelopedinordertoaccommodatetheaddedrepository.

3.2.2.3. SupportforEmergingPreservationStandardsisLowFewofthecurrentrepositorieshaveanyexplicitsupportforpreservation,suchasforcollectingpreservationmetadataasarticulatedbyPREMIS,oractivitiestosupportpreservationsuchasformatmigrationsorchecksumvalidationsasoutlinedintheTrustedDigitalRepositoryChecklist.Foraninstitutionthatdeploysseveralrepositorysystems,ataskassimpleasperformingconsistentbackupstooff‐linestoragecanbecomecomplicatedbythefactthatthesystemsstoretheirunderlyingdatadifferently.Theremaybedatastoredinrelationdatabases,XMLdatabases,RDFtriplestores,andvariousfilesystems–allofwhichmustbebackedup,andmayrequiredifferentbackuptechniques.Insummary,thegenerallackofrepositorysupportforinteroperabilityandforemergingrepositorystandardsatatimewhenlibrariesandotherinstitutionscommonlyrelyonmultiplerepositorysystemstomanage,shareandpreservecontent,isthefundamentalimpetusbehindthedevelopmentoftheHandStoolsuite.Thekeyprinciplesofinteroperabilityandpreservation,andtheapproachesimplementedintheHandStosupportthem,areexaminedmorecloselyinthenextsection,followedbyafunctionalandtechnicaloverviewoftheHandStools.

3.2.3. HubandSpokeKeyPrinciplesTheHubandSpokeapproachisbasedonthetwokeyprinciplesofinteroperabilityandpreservation,withtheunderstandingthatinteroperabilityisnotonlyanenduntoitself,butitisalsocriticalforpreservation.

3.2.3.1. InteroperabilityToreducethecomplexityofinteroperability,theHubandSpokeusesacommonpackagingformatwhichisusedforinterchangeofdigitalresourcesbetweendifferentrepositories.Digitalpackagescomingfromarepositoryaretransformedintothiscommonformatbeforeanyfurtherprocessing,anddigitalpackagesaretransformedfromthiscommonformatintothenativerepositoryformatwhenbeing

ECHODEPositoryTechnicalArchitecturePhase1FinalReportNarrativeReportNationalDigitalInformationInfrastructurePreservationProgramUniversityofIllinoisatUrbana‐ChampaignwithOCLC

38

placedintoarepository.TheideaistoreduceanN2problemintoa2NproblemasshowninFigure5below.

Figure5:InteroperabilityStandards:aSimpleIdea

3.2.3.2. PreservationThesecondkeyprincipleisthatthecommonpackagingformataswellastheprocessesthatactonthatpackagingformatshouldnotonlysupportinteroperability,butshouldalsopromotepreservation.Thisprincipletreatsthecommonpackagingformatasanarchivalinformationpackage(AIP)intheOAISmodel.Theassumptionbeingthatonereasonpackagesarebeingmovedbetweenrepositoriesisforpreservation.ThereareseveralfeaturesoftheHubandSpokearchitecturethatpromotepreservation.Onekeypreservationfeatureistherelianceoncurrentbestpracticesregardingpreservationmetadata,primarilyinformedbyPREMIS.TheHubandSpokeisespeciallyconcernedwithtechnicalmetadataaboutthefilesandbitstreamswhichcompriseadigitalpackage,andalsowithprovenancemetadataabouttheeventsthatoccurduringthepackage'slifetime,includingeventspertainingnotonlytothefilesandbitstreams,butalsotothemetadataitself.Thetechnicalmetadataisusedtovalidatethefilesandbitstreamsthroughoutadigitalobject'slifetimeandareupdatedasrequired,forexamplewhenaformattransformationoccurs.Theprovenancemetadataisalsoupdatedthroughoutanobject'slifetime.ThetoolsthatimplementtheHandSarchitectureperformthese

ECHODEPositoryTechnicalArchitecturePhase1FinalReportNarrativeReportNationalDigitalInformationInfrastructurePreservationProgramUniversityofIllinoisatUrbana‐ChampaignwithOCLC

39

actionsautomaticallyasrequiredduringprocessing,butthedataarealwayspresentinthepackagessothatothersystemscanalsoperformtheseactionsasneeded.Anotherkeypreservationfeatureisthetreatmentofthepackagesthemselves.Forpurposesofrepositoryinteroperabilityandalsotosupportpreservation,theHubandSpokeframeworktreatstheinstantiationsofthepackagesasfirstclassdigitalobjects.ThismeansthatwhenaHandSpackageistransformedforingestionintoaspecificdigitalrepository,notonlyarethemetadata,files,andbitstreamsthatcomprisethepackagedecomposedasappropriatefortherepositoryanduploaded,butthepackageitself(inourcaseasuiteofMETSfiles)isalsotreatedasadigitalobjecttobeuploadedtotherepository.Laterwhenthedigitalpackageneedstobedisseminatedfromtherepository,notonlyarethemetadata,files,andbitstreamsavailablefordownload,butalsotheoriginalHandSpackage.ThisallowstheHandSsystemtocomparethepackageasitwasoriginallyingestedtohowitnowappearsasdisseminatedfromtherepository.Thisprocess,wefeel,iscriticalforpreservationinanenvironmentofheterogeneousandchangingrepositories.AnotheraspectofthistreatmentoftheHubandSpokepackagesasfirstclassdigitalobjectsisthatwecancreatesnapshotsofindividualpackagesatpointsintimeandalsorecordpreservationmetadatadataaboutthepackagesnapshots.TheHandStoolsuitecurrentlyimplementsthisconceptasamasterpackagewhichreferencestime‐stampedsnapshotsofthemainpackage.Themasterpackagealsorecordspreservationmetadataaboutthesnapshots.Thisapproachisexplainedinmoredetailinthenextsection,whichdescribestheconcreteimplementationoftheHandSpackagesusingMETS.

3.2.4. METSProfileTorealizetheaboveprinciples,wewantedtoutilizetheprevailingdigitallibrarystandardsasmuchaspossible.Tothatend,weadoptedMETSasthepackagingstandard,PREMISasthepreservationmetadatastandard,andMODSasthedescriptivemetadatastandard.Wealsooptionallyutilizeseveralformat‐specifictechnicalmetadatastandardssuchasMIXandtextMDforimageandtextobjectsrespectively,amongothers.OurMETSprofile,theECHODEPGenericMETSProfileforPreservationandDigitalRepositoryInteroperability(Habing,2005),iscurrentlyregisteredwiththeLibraryofCongress.Asalreadydescribed,theprimaryfocusoftheHandSMETSprofileistoenablerepositoryinteroperabilityandtosupportpreservationofrepositorycontent.Becauseofthestrongfocusonpreservationratherthanaccess,theHandSprofileisrelativelynoncommittalregardingfileformatsorstructures;instead,specialattentionisgiventoadministrativeandtechnicalmetadata,particularlytointegratingthePREMISdatamodelandschemaintoMETS.Weanticipatethatourfileformat‐agnosticHandSprofilemaybeoverlaidontopof,orinheritedby,otherprofilesthatbetterdefineaparticularfileformatorstructure,providingthemwith

ECHODEPositoryTechnicalArchitecturePhase1FinalReportNarrativeReportNationalDigitalInformationInfrastructurePreservationProgramUniversityofIllinoisatUrbana‐ChampaignwithOCLC

40

addedsupportforpreservationorinteroperability.Anexampleofthisarrangement,whereaformat‐specificMETSprofileisimplementedasasubclassofthePREMIS‐focusedHandSprofile,istheECHODEPMETSProfileforWebSiteCaptures(Habing,2006),alsoregisteredwiththeLibraryofCongress.ThustheECHODEPGenericMETSProfileforPreservationandDigitalRepositoryInteroperabilityisgenerallynotconcernedwithrenderingormakingaccessibleanyparticularrepresentationofanobject,butitisconcernedwithpreservingtheobjectanditsrepresentations,includingthehistoryofhowthosehavechangedoverperiodsoftime.Inthiscontext,preservationreferstoshort‐terminteroperability,preservingtherepresentationsandmetadataasadigitalpackageismovedbetweentwodifferentrepositories.Italsoreferstothelong‐termpreservationofthepackageanditshistoryasitexistsinvariousrepositoriesforlongperiodsoftimeandundergoesvarious"preservationactions"suchasfixitychecks,normalizations,orformatmigrations.Notethatthoughtheprofileisgenerallyagnosticaboutalmostallaspectsofadigitalobject'srepresentation,suchasstructureorfileformats,wehavemadesomepragmaticconcessions,suchasmandatingatleastMODSfortheprimarydescriptivemetadata(dmdSec)whileatthesametimeallowingmultiplealternativedescriptivemetadatasections.Thealternativedescriptivemetadatasectionsareusedasameanstorecordvariousversionsofthesemetadataastheyhaveexistedindifferentrepositoriesoratdifferentpointsintime.ApotentialusagescenariocanbeillustratedinthefollowingmigrationexampleusingourMETSprofile:

1. WestartwithadigitalobjectwhoseoriginalsourcedescriptivemetadataisintheMARCXMLformat.BecauseourprofilerequiresMODSastheprimarydescriptivemetadata,theMARCXMLwillbetransformedintoMODS,andtheMODSwillbestoredintheMETSdocumentalongwithaprovenancestatementwithsomedetailsaboutthetransformation,especiallyidentifyingthesourcemetadataformat.However,becausedescriptivemetadataareconsideredtobeasignificantpartoftherepresentationofanentityandbecausetransformationsbetweenmetadataformatsareoftenimperfect,theoriginalMARCXMLformatisalsostoredintheMETSdocumentasanalternatemetadataformat.

2. NowsupposethatthedigitalobjectistobeingestedintoDSpace.DSpace,however,doesnothavenativesupportfortheMODSorMARCXMLmetadataformats;therefore,aspartoftheingestprocess,theMODSmustbetransformedintotheidiosyncraticDublinCore(DC)metadataformatthatissupportedbyDSpace.ThismetadataformatisalsoaddedasanotheralternatedescriptivemetadataformattotheMETSdocument,alongwithaprovenancestatementdescribinghowthisnewDCformatwasderivedfromtheprimaryMODSformat.

ECHODEPositoryTechnicalArchitecturePhase1FinalReportNarrativeReportNationalDigitalInformationInfrastructurePreservationProgramUniversityofIllinoisatUrbana‐ChampaignwithOCLC

41

3. NextimaginethattheobjectexistsinDSpaceforsomeperiodoftimeduringwhichthedescriptivemetadataundergoessomerevision,suchastheadditionofnewsubjecttermsortheadditionofanabstract.NowtheobjectistobedisseminatedfromDSpaceforingestintosomenewrepository.ThiscouldtriggertheadditionofanotheralternatedescriptivemetadatasectiontotheMETSdocument.ThisalternateformatwouldconformtotheidiosyncraticDSpaceDublinCoreformat,buttheprovenancestatementwouldspecifythatthisDCformatrepresentsanewerversionofthedescriptivemetadatathanwasoriginallyingestedintoDSpace.

Theabovescenariowouldproduceachainofdescriptivemetadataformats,suchasMARCXML(original)→MODS(primary)→DC(version1)→DC(version2),withprovenancePREMISeventstatementsadequatetodeterminethesequenceofeventsthatledtothischain.AspartofthisprofilewealsoenvisionfutureprocessesthatmightreconcilelatermetadatarevisionsandmergethoserevisionsbackintoanewprimaryMODSdescriptivemetadatasection.Thepreservationofsemanticsduringthesetypesofmigrationsisoneoftheconcernsofsemanticpreservationdescribedinthefinalsectionofthisreport(Section4).Becausewefeelthatadministrativemetadataareimportantforpreservation,thisprofileisfairlyprescriptivewhenitcomestotheadministrativemetadata,whichcanbeassociatedwithalmostallofthesectionsthatmakeuparepresentation:structures,filesandbitstreams,anddescriptivemetadata.ParticularattentionispaidtothetechnicalandprovenancemetadataassociatedwiththeseMETSsections.

3.2.4.1. MasterMETSProfileAnotherkeyideabehindourMETSprofileistheideaofaMasterMETSdocument.EachpackageintheHandSarchitectureconsistsofasingleMasterMETSdocument,oneormoreMETSSnapshotdocuments,plusallthefilesandbitstreamsthatarereferencedfromtheMETSSnapshots,asshownbelowinFigure6.

ECHODEPositoryTechnicalArchitecturePhase1FinalReportNarrativeReportNationalDigitalInformationInfrastructurePreservationProgramUniversityofIllinoisatUrbana‐ChampaignwithOCLC

42

Figure6:MasterMETSShowingMultipleSnapshotsandAssociatedFilesEachSnapshotrepresentsaversionofthedigitalpackageatapointintime,usuallywhenthepackageiseitherretrievedfromorplacedintoagivenrepository.Nearlyanyaspectofadigitalobject'srepresentationmaychangewithtime,includingdescriptivemetadata,structure,and,asillustratedabove,eventhefilesreferencedfromapackagemaychangeovertime,perhapsasformatmigrationsoccur.ThesechangesarerecordedasprovenancestatementsintheMETSSnapshotinwhichthe

ECHODEPositoryTechnicalArchitecturePhase1FinalReportNarrativeReportNationalDigitalInformationInfrastructurePreservationProgramUniversityofIllinoisatUrbana‐ChampaignwithOCLC

43

changeismanifest.Forexample,METSSnapshot2intheabovediagramwouldhaveaPREMISeventdescribingthatFile1wasdeletedfromthepackageandFile4wasaddedtothepackage.Inmostcases,theHandSsystemcanautomaticallydetectwhenthesechangesoccurandwillautomaticallyaddtheappropriateprovenancestatementsorembellishthetechnicalmetadataasrequired.However,itmaynotbeabletodeterminewhythechangesoccurredwithoutsomesortofintelligentintervention.HandSisabletodetectthechangesbecauseithasaccesstothepreviousSnapshotsandcancomparetheSnapshotofthepackageasitwentintoarepositorytothepackagethatisretrievedfromtherepository.ThisisoneoftheprimaryreasonsthattheMETSdocumentsthemselvesarealsoplacedintoarepositoryalongwiththeotherfilesthatareactuallypartofthepackage.TheMETSprofileimplementationsdescribedaboveareanintegralpieceoftheHandSarchitecture,usedasframeworkforgeneratingandmaintainingPREMIS‐basedmetadataovertimetosupportlongtermpreservation.ThenextsectionlooksinmoredetailatothermechanismsoftheHandStoolsuite,andillustratesitsoverallworkflowcycle.

ECHODEPositoryTechnicalArchitecturePhase1FinalReportNarrativeReportNationalDigitalInformationInfrastructurePreservationProgramUniversityofIllinoisatUrbana‐ChampaignwithOCLC

44

3.2.5. HandSWorkflowCycle

Figure7:HandSWorkflowCycleAsdescribedintheprecedingsections,theHandSToolSuiteprovidesaframeworkforsustainingandenrichingpreservationmetadatafordigitalobjectsastheyaremovedinto,outof,andbetweendigitalrepositorysystems.Digitalobjectsorpreservationpackagestypicallyrefertoasetoffilesthatrepresentsasingleintellectualentity,includingmetadataabouttheentityoraboutthefilesthemselves.IntheHubandSpokeworkflowcycle(seeFigure3),digitalobjectsareretrieved,convertedtoacommonprofile,validated,enrichedwithmetadata,transformedintoarepository‐compatibleform,andingestedintoadigitalrepository.

3.2.5.1. WorkflowOverview:GET,PROCESS,PUTPreservationpackagesmayentertheHandSworkflowinvariousways:somemaycomefromthird‐partyapplicationsliketheOCLCWebArchivesWorkbench,othersmaybedisseminatedfromadigitalrepositorylikeDSpaceorEPrints,andsomewilloriginatesimplyasdirectoriesoffilesonacomputerfilesystem.Inanycase,thesetoffilesthatmakeupthepreservationpackagemustfirstbegatheredandorganizedforprocessing.Objectsenteringtheworkflowfromadigitalrepositorysystemmustfirstbefetchedfromtherepositorybyinteractingwithitsnativedissemination

ECHODEPositoryTechnicalArchitecturePhase1FinalReportNarrativeReportNationalDigitalInformationInfrastructurePreservationProgramUniversityofIllinoisatUrbana‐ChampaignwithOCLC

45

routine,whichwillvaryfromrepositorytorepository.Thisinteractionwiththerepositorysystemisfacilitatedbyour“LightweightRepositoryCreate,Retrieve,Update,andDelete”Service—affectionatelynamedLRCRUD.LRCRUDismadeupoftwomodules:theLRCRUDClient,whichrunsonthesamemachineastheotherHandStools,andtheLRCRUDService,whichrunsalongsideadigitalrepositorysystem.Toretrieveapackagefromtherepository,theLRCRUDClientmakesarequesttotheLRCRUDService.TheLRCRUDService,inturn,communicatesdirectlywiththerepositorysystemandretrievesthepackageviatherepository’snativedisseminationroutine.TheLRCRUDServicezipsupthepackageandsendsitoverthenetworktotheClient.OncethepackagehasbeenreceivedbytheLRCRUDClientandverified,itscontentsareunzippedontothelocalfilesystem.Fromthere,theTo‐HubPackagertoolconvertsthedigitalobjectintowhatwecallaHubPackage.AHubPackageismadeupofthecontentfilesthatconstitutethedigitalobject;METSdocumentscontainingdescriptive,administrative,andstructuralmetadataabouttheobjectatvariouspointsintime;andasingleMasterMETSdocumentthatcompriseschronologicalandstructuralinformationabouttheotherMETSdocuments.TheMasterMETSfilewillcontainapointertoatleastone,butpotentiallyseveralotherECHODEPMETSdocuments,eachofwhichservesasasnapshotoftheHubPackageatsomepointinitslifecycle.TheECHODEPMETSdocumentistheheartofaHubPackage;itholdstogetherallthefilesandvariousmetadatathatmakeupthepackage.WhenaHubPackageiscreated,anewECHODEPMETSdocumentisgeneratedforthepackage.IfthepackagealreadycontainsanolderECHODEPMETSdocument(generatedpriortoingestionintotherepository),thenewMETSdocumentiscomparedtotheolderonetodiscoveranychangesordamagestothepackagethatmighthaveoccurredwhileinthecustodyoftherepository.TheECHODEPMETSdocumentisthenenrichedwithtechnicalmetadataandvalidatedagainsttheECHODEPMETSProfileregisteredwiththeLibraryofCongress(Habing,2005)].TheTechMDAugmentortoolenrichestheMETSdocumentwithformat‐specifictechnicalmetadatafoundbyanalyzingeachofthepackage'scontentfiles,andconvertingtheresultintoPREMISObjectmetadata.Oncethepackagehasbeenanalyzedandenriched,theProfileValidatorcloselyinspectstheconstituentfilesthatmakeupthepackage,bothdataandmetadata,andverifiestherearenoerrorsorinconsistencies.Atthispoint,theHubPackageisreadytobesentontoanotherrepository.Butfirst,ithastobeconvertedintoaformcompatiblewithingestionintothetargetrepository,whichagainwillvaryfromrepositorytorepository.ThisfinalconversioniscarriedoutbyaFrom‐HubPackagertool,builtspecificallyforthetargetrepository.Fromthere,thepackageishandedofffromtheLRCRUDClienttotheLRCRUDServiceforthetargetrepositoryandingested.

ECHODEPositoryTechnicalArchitecturePhase1FinalReportNarrativeReportNationalDigitalInformationInfrastructurePreservationProgramUniversityofIllinoisatUrbana‐ChampaignwithOCLC

46

3.2.5.2. WorkflowExampleTheworkflowcyclemightbemoreeasilyunderstoodbyfollowinganexamplepreservationpackageasitmakesitswaythroughtheprocess.Forthisexample,wewillusethreesmallfilesthatmakeupasinglewebpage:anHTMLfile,aCascadingStyleSheet(CSS),andaJPEGimage.Thesethreefilescomposeasinglepreservationpackage,oritem,whichhasbeensubmittedtoaDSpacedigitalrepository.UsingtheHandStoolsuite,wewilltransfertheitemfromDSpacetoanEPrintsrepository,whilegeneratingpreservationandtechnicalmetadataalongtheway.1.RetrieveRepositoryXdisseminationpackageviaLRCRUDInourexample,supposewehaveanLRCRUDServicerunningalongsideaDSpacerepositoryonaremoteserver.TheLRCRUDClientapplicationsendsarequesttotheLRCRUDServicetoretrieve(GET)apackagefromtherepository.TheLRCRUDServicerelaystheretrievalrequesttoDSpaceusingtherepository’snativedisseminationmethod.Theoutputofarepository'sdisseminationwilltypicallybemadeupofanynumberofmetadatastreamsandothersupportingartifactsinadditiontotheitem’scontentfiles.InthecaseofDSpace,thepackagewillincludeaDSpaceMETSfilethatencompassesMODSdescriptivemetadataaboutthepackageandPREMIStechnicalmetadatapertainingtoeachoftheconstituentbitstreams.Inourexample,thepackagereturnedbyDSpacenowcontainsfourfiles:theHTML,CSS,andJPEGfileswebeganwithandaDSpaceMETSfile.TheLRCRUDServicereceivestheDSpacedisseminationandpackagesitscontentsintoaziparchive,whichwillbetransmittedoverHTTPtotheLRCRUDClient.TheLRCRUDServicealsocalculatesfilesizeandachecksumvalueforthezipfilebeforesendingit,andtransmitsthesevaluesasContent‐MD5andContent‐LengthHTTPheaderfieldsalongwiththepackagezipfile.AstheLRCRUDClientreceivesthepackagezipfile,ittoocalculatesfilesizeandchecksumvalues,whicharevalidatedagainsttheHTTPheaderfieldstoensurethepackagewasunharmedduringthefiletransfer.Assumingthevaluesagree,thepackageisunzippedandsavedtodisk.2.CreateHubPackagefromrepositorydisseminationfilesTocreateaHubPackagefromtherepositorydisseminationpackage,theTo‐HubpackagerneedstoproduceanewECHODEPMETSdocumentforthepackage.Thepackagerbeginsbysearchingtheretrievedfilesforanymetadataincludedbytherepository.Inourexample,thepackagerlocatestheDSpaceMETSdocumentandretrievesitsMODSdescriptivemetadatastream.ThisDSpaceMODSmetadatawillbetransformedintoAquiferMODSandinsertedintothenewECHODEPMETSdocument’sdescriptivemetadatasection.Otherrepositoriesexportmetadataindifferentformats(e.g.,DublinCore),butinallcasesthepackagemetadataareultimatelytransformedtoAquiferMODSbytheTo‐HubPackager.

ECHODEPositoryTechnicalArchitecturePhase1FinalReportNarrativeReportNationalDigitalInformationInfrastructurePreservationProgramUniversityofIllinoisatUrbana‐ChampaignwithOCLC

47

TheTo‐HubPackagerthencreatesanentryinthefilesectionoftheECHODEPMETSdocumentforeachofthepackage’sconstituentfiles.Inourexample,thenewECHODEPMETSdocumentwillcontainafileelementforeachofourthreecontentfiles.TheTo‐HubPackagerwillalsocreateinthenewECHODEPMETSdocumentthreePREMIStechnicalmetadataobjectstocorrespondtothreefileelements.Eachwillcontainbasictechnicalmetadataaboutoneofthefiles,includingchecksumvalues,filesize,andMIME‐type.Finally,anyleftoverdescriptiveandtechnicalmetadataelementsfromtheDSpaceMETSdocumentareinsertedintotheECHODEPMETSdocumentasalternatemetadatasothatitisneverlost.IfthepackagecontainsolderECHODEPMETSdocuments(becauseithadbeenpackagedbyHandSbeforeenteringtherepository),themostrecentECHODEPMETSdocumentiscomparedtothejust‐generatedECHODEPMETSdocumenttoexposeanychangesthepackagemayhaveundergonesinceitwaslastanalyzed.ThesedataarerecordedinthenewECHODEPMETSdocument’sprovenancemetadataasPREMISevents.IfthepackagecontainsaMasterMETSdocument,apointertothenewMETSdocumentiscreatedanddesignatedasthemost‐currentECHODEPMETSdocumentforthepackage.IfnoMasterMETSdocumentcanbefound,theTo‐HubPackagercreatesonefromscratch.OncethenewMETSdocumenthasbeencreatedandtheMasterMETSdocumentisupdated,HubPackagecreationiscomplete.OurexampleHubPackagenowconsistsofaMasterMETSdocument,whichpointstoasingleECHODEPMETSdocument.ThisECHODEPMETSdocumentcontainsdescriptivemetadataaboutthepackage;technicalmetadataabouteachofthethreecontentfiles,alongwithpointerstothosefilesandthetechnicalmetadataleftbyDSpace;andprovenancemetadatadocumentingthepackage’sexportfromtherepositoryanditsHubPackagetransformation.3.Generatetechnicalmetadata;augmentHubPackageMETSDocumentUsingtoolsfromtheJSTOR/HarvardObjectValidationEnvironment(JHOVE),theHandSTechMDAugmentermoduleanalyzeseachoftheHubPackage'scontentfilesandgeneratesformat‐specific,technicalmetadataforeach.TheJHOVE‐generatedmetadataistransformedusingformat‐specificXSLTstylesheets,andinsertedintothetechnicalmetadatasectionoftheECHODEPMETSdocumentastechnicalmetadata.AnyinconsistenciesbetweenthetechnicalmetadatacurrentlyheldintheMETSdocumentandthosegeneratedbyJHOVEarerecordedintheprovenancesectionoftheECHODEPMETSdocumentasPREMISvalidationevents.ThetechnicalmetadatastoredintheECHODEPMETSdocumentisformattedincompliancewiththefollowingmetadatapreservationstandards:AudioMDforaudiofiles;TextMDfortext,XML,andHTML;andMIXforimages.InourexampletheHTML,CSS,andJPEGfileswilleachbeanalyzedbyJHOVE.TheJHOVEoutputforboththeHTMLandCSSfileswillbeformattedasTextMD,andtheoutputfortheJPEGimagewillbeformattedasMIX.Eachwillbeinsertedintothe

ECHODEPositoryTechnicalArchitecturePhase1FinalReportNarrativeReportNationalDigitalInformationInfrastructurePreservationProgramUniversityofIllinoisatUrbana‐ChampaignwithOCLC

48

ECHODEPMETSdocumentinatechnicalmetadataelementcorrespondingtotheappropriatefileelement.TheJHOVEanalysisitselfisalsodocumentedandrecordedintheECHODEPMETSdocumentasavalidationevent.OneofthelimitationsofJHOVEisitssmallnumberofsupportedmediatypes.InitscurrentreleaseJHOVEoffersnosupportforclosedformatssuchasMicrosoftOfficefiles.AnotherdrawbackofusingJHOVEisthatitonlyreportstheMIME‐typecorrectlyforHTMLorXMLfilesiftheyarewellformed;otherwiseitreportsthemasplaintext,causingdiscrepancieswithintheECHODEPMETSdocumentandvalidationwarnings.Nevertheless,wefoundJHOVEtobeausefultoolforanalyzingfilesandgeneratingtechnicalmetadata.FormoreonJHOVEvisithttp://hul.harvard.edu/jhove/.4.ValidateHubPackageMETSDocumentagainstMETSProfileTheProfileValidatorexaminesthecurrentECHODEPMETSdocumentfortheHubPackageagainsttherequirementsofourMETSprofilescurrentlyregisteredwiththeLibraryofCongress(Habing2005,2006).KeyvalidationpointsincludecheckingtomakesurethattheprimarydescriptivemetadataelementcontainsaMODSobjectthatconformstotheAquiferMODSprofile;thateveryfilereferencedbythefilesectionhasassociatedtechnicalmetadataPREMISobjects;andthatallprovenancemetadataassociatedwithafilecontainvalidPREMISeventelements.TheProfileValidatoralsochecksthatthepackagecontentfilesreferencedbytheECHODEPMETSdocumentareaccountedfor,andthattheirchecksum,file‐size,andmime‐typevaluesarecorrect.OurexampleECHODEPMETSdocumentpassesvalidationforthefollowingreasons:itcontainsvalidAquiferMODSinitsprimarydescriptivemetadataelement;eachofitsfileelementsreferencetechnicalmetadataelementscontainingvalidandcompletePREMISobjectmetadata;anditconformsstructurallytoourMETSprofilerequirements.Oncethevalidationhascompleted,thevalidationeventitselfisdocumentedandrecordedintheECHODEPMETSasaPREMISvalidationevent.5.CreateRepositoryPackagefromHubPackageBeforearepositorycanacceptapackageforsubmission,itmustfirstreceiveadescriptionofthepackage’scontents.TheFrom‐HubPackagermoduleusesdescriptivemetadataextractedfromtheHubPackageECHODEPMETSdocumenttogeneratetherepository‐specificmetadataneedforpackagesubmission.ThisprocessusuallyinvolvestransformingtheAquiferMODSmetadatafoundintheECHODEPMETSdocumentintoametadataformatrequiredforrepositorysubmission,andwillvaryfromrepositorytorepository.Inourexample,wearesendingthepackagetoanEPrintsrepository,whichmeansthepackagerwillgenerateanEPrints‐specificmetadatafilefromtheAquiferMODSstream.ThetransformationeventisrecordedintheECHODEPMETSdocumentasa

ECHODEPositoryTechnicalArchitecturePhase1FinalReportNarrativeReportNationalDigitalInformationInfrastructurePreservationProgramUniversityofIllinoisatUrbana‐ChampaignwithOCLC

49

PREMISmetadata‐transformationevent,andthenewly‐generatedmetadataisaddedtotheMETSdocumentasalternatedescriptivemetadata.ARepositoryPackagezipfileisthencreated,consistingoftheMasterMETSdocument,allthesubordinateMETSsnapshotdocumentsandthecontentdatafiles,aswellasanyrepository‐specificmetadatafiles.6.SendRepositoryPackagetoRepositoryYviaLRCRUDAtthislaststepinourexample,wehaveanEPrints‐specificLRCURDServicerunningonaremoteserverwithanEPrintsrepository.TheLRCRUDClientsendsarequesttotheLRCRUDServicetocreate(POST)anewpackage.TheLRCRUDServicerelaysthecreaterequesttotherepositoryand,usingtherepository’snativemethods,createsanemptyrecord.TheLRCRUDServicereceivesanewlocationidentifier,orhandle,correspondingtothenewlycreatedlocationintherepository,whichitsendsbacktotheLRCRUDClient.Thislocationidentifierisinsertedintothepackage’sECHODEPMETSdocumentastheprimaryIDfortheMETSdocument.TheLRCRUDClientthensendsarequesttotheLRCRUDServicetoupdate(PUT)thenewpackageatthatlocation.TheLRCRUDClientcalculatesfilesizeandchecksumvaluesforthepackagezipfilebeforesendingittotheService,andittransmitsthesevaluesasContent‐MD5andContent‐LengthHTTPheaderfieldsalongwiththepackage.AstheLRCRUDServicereceivesthepackagezipfilefromtheClient,itcalculatesitsownfilesizeandchecksumvaluesandvalidatesthemagainsttheHTTPheaderfieldstoensurethepackagewasunharmedduringthefiletransfer.OncetheLRCRUDServicehasvalidatedthefiletransfer,itunzipsthepackageandingestseachofitscontents—includingthepackageMETSfiles—intotherepositoryusingtherepository’snativesubmissionroutine.Therepository‐specificdescriptivemetadatathatwasgeneratedinStep5aboveissubmittedtotherepositoryaswell.Oncethepackagehasbeenfullyingested,theLRCRUDServicereturnsanupdateresponsemessagetotheLRCRUDClientconfirmingthesuccessfulsubmission,oranerrorifthesubmissionfailed.Somerepositoriesallowforcertainbitstreamstobegivenprivilegedstatus.InsuchcasestheMasterMETSandECHODEPMETSfilesmayreceivespecialstatus;butinallcasestheMETSfilesarepreservedalongwiththeotherpackagecontentfilesandaretreatedasfirstclassobjectswithregardtotherepository.Thatway,whenthepackageisretrievedfromtherepository,allthemetadatapertainingtothestateofthepackagebeforeitwassubmittedtotherepositoryisnotlost.

3.2.5.3. WorkflowRecapThroughtheworkflowprocessdescribedabove,HandSprovidestoolstofacilitatemovingdigitalobjectsbetweenmultiplerepositorieswhilegeneratingandmaintainingimportantPREMIS‐basedtechnicalandprovenancepreservationmetadata.Digitalobjectsareretrieved,convertedtoacommonprofile,validated,

ECHODEPositoryTechnicalArchitecturePhase1FinalReportNarrativeReportNationalDigitalInformationInfrastructurePreservationProgramUniversityofIllinoisatUrbana‐ChampaignwithOCLC

50

enrichedwithmetadata,transformedintoarepository‐compatibleform,andingestedintothetargetrepository.

3.2.6. HandSTechnicalImplementationThekeytechnicalcomponentsoftheHandSimplementationaretheHubandSpokeMETSprofileJavaclasses,providinganextensibleJavaAPItoourMETSprofilewithApacheXMLBeans;theTo‐andFrom‐HubPackagermodules,facilitatinginteroperabilitythroughpluggableinterfaces;andtheLightweightRepositoryCRUDService(LRCRUD),supportingthedisseminationandsubmissionofobjectsbydefiningaprotocolfortransmittingdigitalobjectstoandfromrepositorysystemsoverHTTP.

3.2.6.1. HubandSpokeMETSProfileAPIThecoreoftheHandSToolSuiteisourMETSProfileAPI,aJavacoderepresentationofaMETSXMLdocumentcompiledfromourMETSprofile.ThebulkofourMETSclasseswerecreatedwithApacheXMLBeans(http://xmlbeans.apache.org/),atoolforgeneratingJavaclassesfromXMLschemafiles(XSDfiles).WithXMLBeans,weareabletocompileXMLschemadocumentstoproduceaJavacodestructure,allowingustoworkwithXMLdatathroughourownJavaclassesandmethods.TocreateourMETSprofileAPI,wecombinemethodsfromXMLBeans‐generatedclassesfromtheMETS,MODS,andPREMISschemas,alongwithformat‐specificpreservationmetadataschemaslikeMIX,TextMD,andAudioMD.Wealsolayercustom‐builtconveniencemethodsontopoftheXMLBeans‐generatedmethodstofacilitateadditionalmanipulationoftheMETSdocumentinafashionuniquetoourMETSprofile.AnewHandSProfileJavaobjectcanbecreatedfromscratchgivenasetofcontentfilesandaccompanyingmetadata,orbyinstantiatinganexistingXMLMETSdocumentthatconformstoourprofile.Onceinstantiated,theunderlyingMETSdocumentobjectcanbeoperateduponprogrammaticallythroughAPIcalls.InworkingwiththeAPI,weareassuredthatanymanipulationoftheMETSdocumentwillalwaysbeconsistentwithourMETSprofile.Forinstance,toaddanewfiletothepreservationpackage,acallismadetothetotheaddFile()method,whichinturntriggerscallstoothermethodsthatensuretheMETSobjectremainsconsistentwithourprofile—suchasaddinganewPREMISObjecttechMDsectionassociatedwiththenewfile,andgeneratingchecksum,MIME‐type,andfilesizevalues.AtanytimetheMETSobjectcanbevalidatedagainstourprofile,orre‐serializedasXMLandsavedtothefilesystem.

3.2.6.2. To‐andFrom‐HubPackagersTofacilitaterepositoryinteroperability,theHandSToolSuiteincludesasetofpackagerclassesfortransformingacollectionofpreservationitemsintoaHubPackage,andfortransformingaHubPackageintoaformrequiredforsubmission

ECHODEPositoryTechnicalArchitecturePhase1FinalReportNarrativeReportNationalDigitalInformationInfrastructurePreservationProgramUniversityofIllinoisatUrbana‐ChampaignwithOCLC

51

intoagivenrepository.ForitemsenteringtheHubandSpokefromadigitalrepository,therepository‐specificTo‐Hubpackagertakesthenativerepositorydissemination,unpacksit,andinstantiatesanewHandSMETSProfileobjectfromitscontents.Goingtheotherway,arepository‐specificFrom‐HubpackagerpreparesaHubPackageforsubmissionintotheparticularrepository.Currently,wehaveTo‐HubpackagersforprocessingitemscomingfromDSpace,EPrints,OCLC’sWebArchivesWorkbench,orfromadirectoryoffiles.OurcurrentlistofFrom‐HubpackagertargetsincludesDSpace,EPrints,andtheLibraryofCongressarchivestandardBagit.To‐andFrom‐HubpackagermodulesfortheFedorarepositoryarecurrentlyindevelopment.Wehaveemployedapluggablearchitectureforcreatingpackagermodules.BaseTo‐andFrom‐HubclassesareimplementedinJavaasabstractclasseswiththeintentionthattheywillbeoverriddenandextendedbyotherprogrammersneedingtotailortheHandSToolstotheirspecificrepositoryorarchivingstandard.Thismodulararchitectureallowsotherdeveloperstocreatepackagerplug‐insfortheirownrepositorysystemswithouthavingtorecompileorre‐factortheexistingHandScodebase.

3.2.6.3. LightweightRepositoryCRUDService(LRCRUD)TheLightweightRepositoryCRUDservicespecificationdefinesdisseminationandsubmissionweb‐serviceinterfacestodigitalrepositorysystemsforusewiththeECHODEPHubandSpokeToolSuite.TheLRCRUDspecificationdefinesaprotocolfortransmittingdigitalobjectstoandfromrepositorysystemsoverHTTP.ItenablesuserstoobtainobjectsinaformatexpectedbytheHandSprocessingscriptsandsuppliesdigitalobjectstorepositoriesinaformatexpectedbytheirnativeingestionmechanisms.Thespecificationisimplementation‐agnostic:itsimplydefinestheparametersandresponsesrequiredtoenableaserviceimplementationtocommunicatewiththeLRCRUDclientapplication.ThisallowsLRCRUDimplementerstochoosethemostappropriateenvironmentandprogramminglanguagesforinteractingwiththeirchosenrepository.TheHandSToolSuitecurrentlyhasLRCRUDimplementationsforDSpace,EPrints,andFedora.TheLRCRUDServicefollowsRepresentationalStateTransfer(REST)conventions.ItexposesCRUDactionsonrepositorycontentovertheHTTPprotocol.Asmentionedabove,CRUDisanacronymforCreateRetrieveUpdateandDelete–thebasicoperationsthatapplicationsshouldimplementwhenactinguponpersistentstoragelikerelationaldatabasemanagementsystems,filesystems,andthelike.TheLRCRUDclientcommunicateswiththeLRCRUDserviceviaHTTPmethods,statuscodes,andheaders.ThelistbelowshowshowtheCRUDactionsaremappedtotheHTTPmethods:

• Create==POST• Retrieve==GET• Update==PUT

ECHODEPositoryTechnicalArchitecturePhase1FinalReportNarrativeReportNationalDigitalInformationInfrastructurePreservationProgramUniversityofIllinoisatUrbana‐ChampaignwithOCLC

52

• Delete==DELETE

InmostcasestheLRCRUDservicewillresideonthesamehostastherepositoryitservessothatithasaccesstotherepository'sAPI.LRCRUDisessentiallya"dumb"packager;itissimplyawaytosupplyfilestotheremoterepositoryinanyformat/configurationthatitcannativelyingest.InthisitissimilartoprotocolsliketheSimpleWebServiceOfferingRepositoryDeposit,orSWORD(Allinson,François,&Lewis,2008),whicharebeingadoptedbyrepositories‐‐andwhichmaymakethesubmissionfunctionofLRCRUDultimatelyunnecessary.Itmaybebeneficialtopresentsomedescriptivestep‐by‐stepexamplesinordertoclarifythefunctionsoftheLRCURDcomponentswithintheHubandSpokeToolSuite.Theseexamplesdescribeindetailtheinteractionsbetweentheclientandtheservice.

3.2.6.4. LRCRUDFunctions‐‐Examples

3.2.6.4.1. Dissemination

Figure8:LRCRUDDissemination

ECHODEPositoryTechnicalArchitecturePhase1FinalReportNarrativeReportNationalDigitalInformationInfrastructurePreservationProgramUniversityofIllinoisatUrbana‐ChampaignwithOCLC

53

Dissemination(seeFigure8)istheactofretrievinganitemfromarepository,wherebytheitemisdefinedasanintellectualentitycomprisinganynumberofcontentstreams,metadatastreams,andothersupportingartifacts.ItemsdisseminatedfromarepositoryusingtheLRCRUDservicearemostlikelyboundforprocessingandtransformationbytheHandSToolSuiteto‐hubpackager.ThepackagercreatesMETSfilesconformanttotheHandSprofile,extractsandaugmentstechnicalmetadata,andrecordsprovenanceinformation.Describedbelowarethefourmajorstepsinnegotiatingdissemination:

1. TheLRCRUDclientsubmitsanHTTPGETrequesttotheLRCRUDservice.TheGETrequestprovidestheIDoftheitemdesiredviatheLRCRUDserviceURLsyntax.

2. Theservicecallstherepository'snativedisseminationroutinefortheIDindicated.

3. Theservicereceivestheoutputfromthedisseminationandaddstheentirecontentintoazip‐file.

4. Theservicereturnsthezip‐filecontainingthe"package"totheclient.

3.2.6.4.2. Submission

Figure9:LRCRUDSubmission

ECHODEPositoryTechnicalArchitecturePhase1FinalReportNarrativeReportNationalDigitalInformationInfrastructurePreservationProgramUniversityofIllinoisatUrbana‐ChampaignwithOCLC

54

Submission(seeFigure9)istheactofeither1)addinganitemtoarepositoryforthefirsttime;or2)updatinganitemalreadyintherepository.Describedbelowistheprocessofaddinganitemtoarepositoryforthefirsttime.Thisisatwo‐stageprocess;thefirststagereservesanidentifierinthesystem,whilethesecondactuallyplacescontentintherepository.Stage1­Createstubrecordtoreserveanidentifier:ItiscriticaltonotethatthepackageitselfisnotuploadedaspartofthePOSTrequest;rather,thePOSTrequestcreatesonlyastuborplaceholderrecord.ThereasonthattheactualpackageisnotuploadedaspartofthePOSTisthattheidentifierassignedtothepackagebytherepositoryneedstobeembeddedintheMETSfilewhichispartofthepackage.ThetypicalsequenceofoperationstoingestanewpackageistousePOSTtocreateanewplaceholderrecordandgettheidentifierforthatrecord.Thatidentifieristhenusedtoupdateprovenanceandothermetadatathatispartofthepackage,andthentheplaceholderrecordisupdatedoroverwrittenwiththeactualpackageusingthePUTaction.Themajorstepsinthisprocessare:

1. TheLRCRUDclientissuesaPOSTrequesttotheLRCRUDservicespecifyingtheIDof"where"tocreatetherecord(e.g.inaspecificcollection)ifneeded.

2. Theservicecallstherepository'snativeitemorIDcreationroutine.3. TherepositorysuppliestheservicewiththeIDforthenewly‐createdrecord.4. TheservicerespondstotheclientwithanHTTP201"Created"messageand

returnstheIDintheLocation:header.Stage2–Uploadandingesttheitem:Inthisstage,theitemisuploadedandplacedintherepository.Thisistheexactprocessforupdatinganexistingitem:

1. TheLRCRUDclientissuesaPUTrequesttotheLRCRUDservicetoreplacethepackageidentifiedbythesuppliedURI.Theentitybodyoftherequestwillcontainazip‐filecontainingthe"package"tobeingested.

2. Theserviceunpacksthefilesandcallstherepository'snativeingestionroutine.

3. TheservicerespondstotheclientwithanHTTP204"NoContent"messageindicatingthattherequestwassuccessful.

ECHODEPositoryTechnicalArchitecturePhase1FinalReportNarrativeReportNationalDigitalInformationInfrastructurePreservationProgramUniversityofIllinoisatUrbana‐ChampaignwithOCLC

55

3.2.7. LessonsLearnedBelow,innoparticularorder,areseveralkeylessonslearnedduringthecourseofdevelopingtheHandSarchitecture.

3.2.7.1. Mergingmetadataispotentiallyrisky.Afterexpendingmucheffortexploringhowwemightmergedifferentversionsofmetadatafilessoastomaintainasinglemastermetadatafile,wereachedtheconclusionthatthiswasverydifficultproblem,andpotentiallydangerousintermsofdataloss.Thisrealizationledustoourcurrentarchitecture,whichskirtstheissueofmergingmetadataintoasinglefilebymaintainingmultipleSnapshotMETSfilesallreferencedfromacommonMasterMETSfile.

3.2.7.2. METSsupportsmultiplemetadataformatswell.CombiningPREMIS,MODS,andotherXML‐basedtechnicalmetadataformatsintoasingleMETSdocumentworkedwellforthisparticularproject.ThegeneralstructureofMETSseemedtolenditselftoconstructingpreservationpackages.Ourconceptualmodel,whichwasdirectlyinfluencedbytheMETSandPREMISstructures,consistedatahighleveloftheintellectualentityhavingoneormorerepresentations.Theserepresentationsandalltheircomponentpartsweretheprimaryfocusofthepreservationefforts.TheMETSfileitselfistreatedastheabstractparentrepresentationoftheintellectualentity.However,therearealsooneormoreconcreterepresentationsconsistingofeachstructMapwithintheMETSfile.TheserepresentationsconsistoftherelationshipsembodiedinthestructMap(andpossiblytherelatedstructLinksections);thefilesandbitstreamsreferencedfromthestructMap;andtheassociateddescriptivemetadata(dmdSec),whichcouldbereferencedviathestructMaporviaindividualfilesorbitstreams.AllremainingpartsoftheMETSdocument,primarilytheheader(metsHdr)andadministrativemetadata(amdSec)sections,arenotconsideredpartoftheintellectualentity’srepresentationsbutare,instead,metadataabouttheserepresentations‐‐mostlyconcernedwithpreservationandthushavingastrongfocusontechnicalandprovenancemetadata.Therewerepragmaticchallengesingettingthesedisparatemetadatastandardstoworktogether,however,andthenextparagraphconveysonesuchexample.

3.2.7.3. ImplementingPREMISinMETSrequireshigh‐levelstructuraldecisions.EmbeddingPREMISmetadatawithinaMETSpackagewasnotanintuitiveundertaking.Therewereseveralreasonsforthis.Amongthesewerevariousoverlapsinthemetadatafieldssupportedbyeachstandard.Whenfacedwiththeseoverlapsourgeneralapproachwastoprovidethemetadatainbothplaces.Althoughthisapproachintroducedduplicationandtheopportunityforinconsistenciesintothemetadata,wefeeltheaddedflexibilityinprocessing

ECHODEPositoryTechnicalArchitecturePhase1FinalReportNarrativeReportNationalDigitalInformationInfrastructurePreservationProgramUniversityofIllinoisatUrbana‐ChampaignwithOCLC

56

compensatedfortheseshortcomings.Moreover,theHandStoolsuitevalidationstepsensuredthatthesetypesofinconsistencieswerenotpresent.DecisionswerealsorequiredastowheretoembedthePREMISentitieswithintheMETSfile.Whiletheseentitiesareclearlyalladministrativemetadata,theydonotalwaysfitneatlywithinoneofthefoursubgroups,techMD,digiprovMD,sourceMD,orrightsMD,providedbyMETS.RefertotheECHODEPMETSprofiles(Habing,2005)fordetails.ProjectstaffalsoparticipatedinaworkinggroupchairedbyRebeccaGuentherattheLibraryofCongresstoaddressthisissue.Theworkinggroupproducedareport,GuidelinesforusingPREMISwithMETSforexchange(Guenther,2008).

3.2.8. NextSteps:theHubandSpokeDevelopmentoftheHubandSpoketoolsuiteisongoing.Thelatestversionsofthesourcecodecanbedownloadedfromtheproject’sSourceForgewebsite:http://sourceforge.net/projects/echodep/.RecentdevelopmentsincludetheadditionofaFrom‐SpokefortheBagItspecification(Boyko,Kunze,Littman,&Madden,2008)andmodificationstosupportversion1.5ofDSpace.WorkiscontinuingapaceonbothFrom‐andTo‐SpokesfortheFedorarepositorywithparticularattentionbeingpaidtohowourMETSprofilemightbeaccuratelymappedtoaFedoracontentmodel,reducingtheneedforpotentiallylossymappingsashavethusfarbeenrequiredforotherrepositorysystems.Theprojectisalsolookingatotherpotentialrepositories,suchasLOCKSSorCONTENTdm,forSpokedevelopment.InadditiontodevelopingnewSpokes,wearealsomonitoringdevelopmentswiththenextversionofJHOVE,aswellaswiththeGlobalDigitalFormatRegistry(GDFR),toexplorehowthesetoolsmightbeusedtoenhancetheformat‐specifictechnicalmetadatawearecurrentlygeneratingfordifferentfiletypes.

3.2.8.1. SupportingPreservationNowandintheFutureTheHubandSpoke(HandS)frameworkenhancestheinteroperabilityandpreservationfeaturesofexistingopen‐sourcerepositorysystems.Itprovidesasuiteoftoolstofacilitatemovingdigitalobjectsbetweenrepositoriesmoreeasilywhilesupportingthecollectionoftechnicalandprovenanceinformationcrucialforlong‐termpreservation.Itisintendedtosupportcurators’effortstodaytomanagecontentinmultiplerepositorysystemsandtopreservevaluablepreservationdatainaccordancewithemergingdigitalpreservationstandards.Inthelongterm,however,weseetheneedforthenextgenerationofdigitalrepositoriestodomoreinordertosupportourabilitytopreservethemeaningofthedigitalobjectsmaintainedinrepositories.Currentrepositorysystemspreservethestructuresofdigitalobjects,fromwhichmeaningorsemanticsmustbeinferred.Learningfromreal‐worlddatamigrationexamplesfromtheHandSefforts,GSLISandNCSAresearchersareworkingtomodelhowsemanticinferencecapabilitymayhelpnext‐generationarchivespreservethemeaning(notjustthestructures)of

ECHODEPositoryTechnicalArchitecturePhase1FinalReportNarrativeReportNationalDigitalInformationInfrastructurePreservationProgramUniversityofIllinoisatUrbana‐ChampaignwithOCLC

57

digitalobjectsandheadofflonger‐termpreservationrisks.Specifically,wearedevelopingautomatedreasoningtechniquestargetedatidentifying,andeventuallycorrecting,problematicmetadatadescriptions.Thisworkisprofiledseparatelyinthenextsection(Section4).

3.2.9. ConclusionWithdigitalpreservationstillinitsinfancy,manychangestoemergingstandards,strategies,andmethodologiescanbeexpectedinthecomingyears.TheHubandSpokeframeworkprovidesamodelthatattemptstoincorporatecurrenttechnologiesandbestpracticesfromthefieldtosupportdigitalpreservationincurrentrepositoryenvironments.ItimplementsMETSandPREMIStoprovideastandards‐basedmethodforpackagingcontentthatallowsdigitalobjectstobemovedbetweenrepositoriesmoreeasilywhilesupportingthecollectionoftechnicalandprovenanceinformationcrucialforlong‐termpreservation.HandSisintendedtohelpcuratorsofdigitalobjectstodaybyprovidingimprovedsupportforpreservationandinteroperabilitytoexistingrepositorysystems.Ultimately,inordertomeaningfullypreserveourdigitalcontentovertime,wewillneedthenextgenerationofpreservationtoolstosupportautomaticinferenceofmeaning,orsemantics,fromchanged—andthuspotentiallyambiguous—informationstructures.

ECHODEPositoryTechnicalArchitecturePhase1FinalReportNarrativeReportNationalDigitalInformationInfrastructurePreservationProgramUniversityofIllinoisatUrbana‐ChampaignwithOCLC

58

4. PreservingMeaning,NotJustObjects:SemanticsandDigitalPreservation

AkeygoaloftheECHODEPositoryprojectistoinvestigatebothpracticalsolutionsforsupportingdigitalpreservationactivitiestoday,andthemorefundamentalresearchquestionsunderlyingthedevelopmentofthenext‐generationofdigitalpreservationsystems.Earlierinthisreport,wereviewedtwoareasofactivitythataimtosupporton‐the‐groundpreservationeffortsinexistingtechnicalandorganizationalenvironments:theWebArchivesWorkbench,asuiteoftoolstohelpcuratorscollectandmanageweb‐baseddigitalresources;andtheHandStoolssuite,whichaimstoenhanceexistingrepositories’supportforinteroperabilityandemergingpreservationstandards.Inthelongerterm,however,werecognizethatsuccessfuldigitalpreservationactivitieswillrequireamorepreciseandcompleteaccountofthemeaningofrelationshipswithinandamongdigitalobjects.Thissectiondescribesprojecteffortstoidentifythecoreunderlyingsemanticissuesaffectinglong‐termdigitalpreservation,andtomodelhowsemanticinferencemayhelpnext‐generationarchivesheadofflong‐ternpreservationrisks.

4.1. Introduction:TheNeedforaSemanticsofPreservationApproach

4.1.1. ThePreservationSemanticsProblemLikeanyinformationmanagementactivity,digitalpreservationeffortsareguidedbyhumanunderstanding.Decisionsaboutdocumentingafileformat,emulatinganenvironment,ormigratingfromonesystemtoanotheraremadewithanunderstandingofhowlevelsofdigitalexpressioncascadeandinterrelate:voltage,bit,octet,pointer,integer,grapheme,pixel,polygon,color,pitch,textstring,tree,image,tuple,file,andsoon.Thecomplexityoftheserelationshipsposesfewseriousproblemsforhumanbeings‐‐infact,theproblemsliepreciselyintheeasewithwhichourmindsinterpretthoserelationships.Long‐termpreservationisdistributednotonlyovertimebutalsoacrosstheresponsibilitiesofmanydifferentpeople.Itisdirectedatcollectionsmuchtoolargetoallowthoughtfulattentiontoindividualresources.Wemustthereforebuildintoourtoolsamorecarefulandpreciseencodingoftheknowledgethatguidesoureffortlessmentaldeductions.Thepreservationhazardsthatresultfromcurrentdescriptivepracticeandourexperimentswithautomatedtoolstoamelioratethoserisksaredescribedinthesectionsthatfollow.

ECHODEPositoryTechnicalArchitecturePhase1FinalReportNarrativeReportNationalDigitalInformationInfrastructurePreservationProgramUniversityofIllinoisatUrbana‐ChampaignwithOCLC

59

4.1.2. OurGoalOurgoalistounderstandbetterthesemanticproblemsarisingindigitalpreservation,andhowwemightapplythatunderstandingtothedevelopmentofresourcesandtools.Specifically,weareexperimentingwithautomatedinferencesaboutentities,theirproperties,therelationshipswithinandbetweenthem,andhowthesefactsareexpressedinmetadatadescriptions.Enrichingthatmetadatawithnewdeducedassertionsisonestepinheadingoffdigitalpreservationrisks.Weareworkingtowardadeductivesystemforreasoningaboutanomalousorincompletemetadata.Theaimisnottoautomaticallydeduceallmissinginformationortocorrectmalformedrecords,buttocallhumanattentiontodescriptionsthatareproblematicorsuspicious.Ourworkbeginswithananalysisofthekindsofsemanticproblemsposedbycurrentdescriptivepracticeandmetadataschemas,informedbyanalysesofreal‐worlddatamigrationexamples.Wehaveappliedtheunderstandinggainedinthisanalysistothedevelopmentofadraftmetadataontology(discussedinSectionC),whichmovesustowardamoreformalunderstandingofhowdescriptiveinformationaboutarchiveddigitalresourcesisstructured.Thismetadataontologyiskeytoaproof‐of‐conceptexperimentalsystemcomposedoftheRDFrepositoryTupeloandtheBECHAMELreasoningsoftware.

4.2. TheProblems:UnderstandingSemanticPreservation

4.2.1. ProblemsPosedbyDescriptivePracticeandStructuresInmanypreservationeffortsmetadatadescriptionmayseemstraightforward,butcrucialinformation‐–includingfactsthatseemobviousatfirstglance‐‐isleftunstated,andmustbeinferredbyhumanreaders.(Anexampleisprovidednext.)Asdiscussedabove,thissituationmaynotberiskywhenpeopleareavailabletoreasonaboutindividualrecords,butahuman‐basedmanualapproachdoesnotscaleoverlargecollectionsizesorovertime.Thesheervolumeofdigitalinformationmeansweincreasinglyrelyonautomatedmachineprocessingofrecords.Butsoftwaretoolsexecutetransactionsusingonlyknowledgethathasbeenexplicitlyrepresentedforthem.Ouraimthereforeistomakethoseunstatedfactsavailableinaformthatsoftwarecanuse.Thisworkbeginswithaninvestigationofthekindsofsemanticproblemsposedbycurrentinformationstructuresandimplementations.Theseproblemsbreakdownintothreebasiccategories:

• Semanticproblemsrelatingtodescriptivepractice• Semanticproblemsrelatingtoencodingstandards

ECHODEPositoryTechnicalArchitecturePhase1FinalReportNarrativeReportNationalDigitalInformationInfrastructurePreservationProgramUniversityofIllinoisatUrbana‐ChampaignwithOCLC

60

• Semanticproblemsrelatingtometadataschemadesign

4.2.1.1. Semanticproblemsrelatingtodescriptivepractice.Someoftheproblemswefacearearesultofhowresourcesaredescribedusingmetadata,whileotherproblemsarisebywayofhowthosedescriptionsareexpressed,andwhathappenstothemovertimeastheyaremigratedfromonesystemtoanother.OnesemanticproblemofparticularinteresttousiswhatRenearetal(2002)describeas"ontologicalvariationinreference."Essentially,metadatacanfailtomakecriticaldistinctionsinwhat,precisely,itisdescribing.Theproblemisillustratedinthemetadataexamplebelow,whichshowspropertiesassertedatanumberofdifferentlevelsofabstraction.

Figure10:ExampleofMultipleLevelsofAbstractioninMetadataDescriptionWeseeinthisexamplepropertiesoftheimageitself(likeitstitleandsubjectmatterinlines8and23)describedalongsidepropertiesofthefilewhichencodestheimage(itsMIMEclassificationinlines2and12),propertiesofthemetadatadescription(itscreationdateinline28),andpropertiesoftherepositorysoftwareobjectthat

ECHODEPositoryTechnicalArchitecturePhase1FinalReportNarrativeReportNationalDigitalInformationInfrastructurePreservationProgramUniversityofIllinoisatUrbana‐ChampaignwithOCLC

61

expressesthemetadatadescription(e.g.,thatitdisseminatesresources,andhasparticulardatastreamsassociatedwithit;lines1,3,7,11,18,and22).Themainpreservationriskproceedingfromthismixingoflevelsistheinabilitytodistinguish,withoutsemanticinformationabsentinthedescription,thelevelatwhichaparticularpropertyapplies.Forexample,whatisitexactlythathasaMIMEclassificationimage/jpeg?IsittheFedorarecordorisitoneorbothofthedatastreams?Ahumanreadercaneasilyresolvethatkindofambiguitywithoutconsciouseffort,butpreservationtransactions(suchasmigration),whicharetypicallyexecutedthroughsoftware,cannot.Thepreservationaimhereispresumablytopreserveaccesstotheimage.Thataimmayormaynotdependonpreservingthejpegfileexpressingtheimage,andpreservingtheFedoraobjectthatexpressesthemetadataisalmostcertainlynotarequirement.Thisexamplethereforeillustratestheproblemofmixedlevelsofdescription.Weneedtoclarifyandenrichmetadatadescriptionsbylinkingtheirassertionsexplicitlytotheappropriateentities,orelsedrawtheattentionofhumananalyststorecordsthatcannotbedisambiguatedautomatically.

4.2.1.2. Semanticproblemsrelatingtoencodingstandards.Inadditiontoproblemsofdescriptivepractice,wefacesemanticproblemsstemmingfromlimitationsoftheencodingtechnologiesinwhichmetadatadescriptionsareexpressed.Theseproblemsgenerallyfallintooneofthefollowingtwocategories:

4.2.1.2.1. Syntacticoverloadinginconventionalmarkup.

Familiarencodingtechnologiesformetadatadescriptions,suchasthosebasedonXML,workwellmostofthetime,buttheyhavecertainfundamentalproblems,suchasthoseasstemmingfromtheuseofmultiplecompetingsemanticrelationshipsandofunstructureddata.Specifically:

• Competingsemanticrelationships:Preservationmetadataformatstypicallyoverloadasimplesyntaxwithmultiplecompetingsemanticinterpretations.TypicalexamplesincludeXMLapplicationswhereasmallnumberofsyntacticrelationships(e.g.,theparent/childrelationshipbetweenelements)representanynumberofsemanticrelationships(whole/part,propertyname/value,etc.)thatarecontextdependent.Oftenapreciseinterpretationofthesesemanticscanbefoundonlyintheexecutionofapplicationsoftwarethatconsumesthefile‐‐and,presumably,inthemindoftheprogrammerwhowrotetheapplication.

• Unstructureddata:Theinformationinresourcedescriptionsmayonlybeincompletelyavailableformachineprocessingandverification.Crucial

ECHODEPositoryTechnicalArchitecturePhase1FinalReportNarrativeReportNationalDigitalInformationInfrastructurePreservationProgramUniversityofIllinoisatUrbana‐ChampaignwithOCLC

62

contextualdatamayexistonlyasnaturallanguageannotationsorasunstructuredinformationinthecontentofmetadatafields.

ThemetadataexamplepresentedinFigure10doesnotexhibitproblemsofsyntacticoverloading,becauseitconformstoastandardserializationoftheRDFabstractmodelinwhichpropertiesandrelationshipsareexplicitlyidentified.Butthesecondproblemisevidentinhowmuchinformationinthisdescriptionisexpressedinnaturallanguagetextandannotations.(Note,forexample,thedc:dateelement(line28)inwhichthedocumentedevent(scanningandprocessing)isprependedtothedatestring.)

4.2.1.2.2. Problemswithobjectmodels.

Otherpotentialsemanticproblemsstemmingfromlimitationsofmetadataencodingtechnologiesconcerntheobjectmodelsofrepositorysystemsthemselves.Modelingdecisionsinrepositorydesigncancreatedescriptiveartifactsthatleavetheirmarkevenafterrecordmigration.Forexample,arepositorymaymingleinformationaboutrepositoryobjectswiththeinformationthattherepositoryobjectsaremeanttopreserve,creatingproblemswhenthoserecordsarefurtherprocessedandcontextualinformationisnolongeravailabletohelpinterprettherecordsandmakefurtherpreservationdecisions.AgoodillustrationofthisissuecanbeseenbyrevisitingthemetadatadescriptioninFigure10,whichwasserializedfromtriplesthatwereextractedfromtheRDFdatabasebackingaFedorarepositoryinstallation.InFigure10,noticethatinRDFtermsthisentiremetadatadescriptionis"about"anobjectidentifiedasinfo:fedora/changeme:97(line1).Thisrepositorysoftwareobjectistheonlyresourceidentifiedbyanrdf:typearc,andisthereforetheonlyentitywithanobjectclassidentification.Barringanyexplicittypeidentificationinaresourcedescription,FedoraobjectsseemtobetheonlykindofthingthattheFedorarepositoryknowsabout.Expressedinthatform,wecannotpreserveanyinformationexceptFedorarecords,andthoserecordsassertnoexplicitpreservationtargets.AsystemlikeFedoracanpreserveobjectswithinthecontextofitsowntransactions,buttheimplicitknowledgedirectingsuchoperationsdependsontheinterpretationofprogrammers,withalltheproblemsdiscussedsofar.Ontheotherhand,itisnotadesignflawofFedorathatitsmetadatarecordiscenteredinternallyontheFedoradigitalobject.Preservationontologyisproperlyamatterofdescriptivepractice,notsoftwareengineering.Infairness,ourmetadataexamplecomesfromamigrationscenarioinwhichRDFtriplesareextractedfromFedora'sRDFstoredirectly,ratherthanthroughaconventionalexportprocess.ButthisexampleservestoremindusthatobjectmodelinginasystemsuchasFedoraplaysthesameroletothesameendsaswithotherkindsofsoftware:efficientsourcecodemanagementbyandforthesystemdevelopers.Objectmodelingdecisionsarenotintendedandcannotbeexpectedtoaddresstheweaknessesof

ECHODEPositoryTechnicalArchitecturePhase1FinalReportNarrativeReportNationalDigitalInformationInfrastructurePreservationProgramUniversityofIllinoisatUrbana‐ChampaignwithOCLC

63

resourceanalysisanddescription.Forlong‐termpreservation,therefore,itisimportanttoreduceambiguousorimplicitsemanticsinrepositoryobjectmodels.Thatcanmeaneithermodifyingthosemodelsor,aswehaveattempted,providingtoolsandtechniquesformigratingfromrepositoryobjectmodelstomodelsthatincludebetterrepresentationsofpreservationtargets.

4.2.1.3. Semanticproblemswithpublishedmetadataschemas.Finally,inadditiontosemanticproblemsrelatingtodescriptivepracticetoencodingstandards,weseesemanticproblemsstemmingfromlimitationsofpublishedmetadataschemasthemselves.Publishedschemasformalizeelementsetsonwhichtheproperty/valueascriptionsarebased.Eachofthesemetadataschemesnotonlyexpressesitsuniqueviewoftheuniversebutisitselfgroundedinbasicontologicalassumptions.Avarietyofambiguitiescanstillarise,asillustratedbelow,drawingagainonourrunningexamplefromFigure10.Weneedtobeginbyunderstandingthelogicalpartsofthemetadatarecordandtheirrelationshipstooneanother.Ametadatarecorddescribessomeentity‐‐aninstanceofaclassliketheclassofbooks,images,oraudiorecordings.Metadatadescriptionslistpropertiesofthatentity,eachofwhichhasavalue.Forinstancethe“author”propertyofthebookmighttakeasitsvaluethenameoftheauthor.Membershipinaclassrequiresthattheinstancerespectdefinedclassconstraints(movies,forexample,haverunningtimes,butbooksdonot).Considerthemetadatastatementdc:type>image</dc:type>(line9)fromouroriginalexampleinFigure10.Weeasilyrecognizethattheword"image"pointsustoaninstanceofaclass,justasthenameofanauthorpointsustoaparticularperson.Ahumanreaderwouldneverconcludethatabookwasauthoredbyanameorbyastringexpressinganame.Similarly,theword"image"licensesourinferencethatthepropertyvaluefordc:typeisaclassofentitiesintheworldratherthan,forexample,aquantity(suchas14centimeters)oraquality(likemonochrome).Inthiscasewearecuedtotheexistenceofnotjustanyentity,buttotheverytargetofourpreservationefforts‐‐somethingmuchmoreimportanttousinthelongrunthanthedigitalfileorthebitsequencethatonlyexpressesthisimagecontingently.Computersoftwarecannotmakethosekindsofmeaningfuldistinctionswithouthelp.Onekindofhelpwouldbeaconstraintontherangeofallowablepropertyvalues,buttheDublinCoreelementschemaenforcesnosuchconstraint:dc:typecantakeanyvaluethatindicatesthe"natureorgenreoftheresource."(DCMINamespacefortheDublinCoreMetadataElementSet,Version1.1,2008)Asecondkindofhelpwouldbeavaluestringthat,throughitsmachine‐readablestructureornotation,indentifiesaclass.InanRDFexpressionthiswouldbeaURIlinkedbyanrdf:typepropertytosomeclassdeclaration.TheDCMITypeVocabulary

ECHODEPositoryTechnicalArchitecturePhase1FinalReportNarrativeReportNationalDigitalInformationInfrastructurePreservationProgramUniversityofIllinoisatUrbana‐ChampaignwithOCLC

64

hasthisstructure(http://dublincore.org/2008/01/14/dctype.rdf),andinterestingly,thescopenotefordc:typeintheDCElementsRDFschemarecommendstheuseofthatvocabulary(http://dublincore.org/2008/01/14/dcelements.rdf).HadtheauthorofourmetadatadescriptionusedtheURIdcmitype:Image,insteadoftheword"image,"wewouldbeonestepclosertoidentifyingtheabstractimageasanentity.Theword"image,"althoughitcontainsthesamesequenceofletters,isnotlinkedinastandardizedwaytothedeclarationofaclass.AssigningaDCMItyperesourcetothedc:typeelementsimplifiestheinferencethatanimageexists,andthatoneormoreofthemetadatastatementsinthatdescriptionareascribingpropertiesofanimage–semantically,asignificantstep.Butasourrunningexamplestands,theschema'sflexibilityinvitesambiguity,andadditionalinformationisnecessarytoconnecttheliteralvalue"image,"withaformalizedclasssuchasdcmitype:Image.

4.2.2. UnderstandingtheSemanticPreservationProblem:SummaryWehaveseeninthissectionthatdescriptivepractice,encodingstandards,andpublishedspecificationsmayallcomplicatedigitalobjectpreservation.Impreciseresourcedescriptionscanmakeitimpossibletodeterminethelevelatwhichaparticularpropertyapplies.Theflexibilityofferedbyencodingstandardsbringsrisksaswellasbenefits.We’vealsoseenhowobjectmodelingdecisionsandsemanticallyunderspecifiedmetadataschemascanleadtoincorrectorambiguoususage.Inthenextsection,wemovefromunderstandingthecoresemanticproblemsassociatedwithdescriptivepracticeandstructurestolookingattheresourcesandtoolsbeingdevelopedbytheECHODEPositoryprojecttoidentifysemanticambiguityinreal‐worldmetadatadescriptions,andhighlightpotentialpreservationrisks.

4.3. TowardMoreCapableArchivesandRepositories

4.3.1. Recap:TheneedforautomatedinferencecapabilityDigitalresourcepreservationeffortsaredistributednotonlyovertimebuttypicallyacrosstheresponsibilitiesofpeoplewhomayneverconsultwithoneanother.Transactionslikemigrationbetweensystemsareexecutedoverlargecollectionswherecloseattentiontoindividualrecordsistooexpensive,butwherecorrecttreatmentofaresourceoftendependsonknowledgethatisincompletelyorimpreciselyrepresentedinpreservationmetadata.Suchambiguitiespresentfewproblemsforhumanbeings:ourflexiblemindsmakecorrectinferenceswithoutconsciouseffort.Butthedatatosupportthoseinferencesarenotexpressedinaformthatcanguidetheexecutionofourprogramsandutilities.Wethereforeneedtoolsandmethodsthatsupportthediscoveryandcorrectionofpreservationrisks.

ECHODEPositoryTechnicalArchitecturePhase1FinalReportNarrativeReportNationalDigitalInformationInfrastructurePreservationProgramUniversityofIllinoisatUrbana‐ChampaignwithOCLC

65

Thenextsectiondescribesourexperimentsindevelopingthesemethodsandtools.Specifically,welookatourdevelopmentofanontologyofmetadatadescriptions,theResourceDescriptionVocabulary,andhowthismaybeappliedtoidentifysemanticambiguityinmetadatadescriptionsusingthereasoningtoolBECHAMEL.

4.3.2. BECHAMELandBuildingaMetadataOntologyBECHAMELisatoolforexpressingandtestingsemanticmodelsofdigitalresources.(Dubinatal,2003)IthasbeendevelopedbyresearchersattheUniversityofIllinois,theWorldWideWebConsortium,andtheUniversityofBergen.ABECHAMELapplicationcan,forexample,translatethebibliographicmetadataforajournalarticlefromonestandardformatintoanotherbyconstructingamodeloftheauthor'saffiliationwithaninstitution.(Renear&Dubin,2003)InourrecentandcurrentexperimentstheinputtoBECHAMELaremetadatadescriptionsretrievedfromanRDFrepository(Tupelo),togetherwithschemasdefinedintheOWLWebOntologyLanguage(OWL,2004).Newfactsdeducedfromthoseinputsareaddedbacktotherepositoryasannotationstothedescription.Atechnicaloverviewofthisapproachispresentedinthefollowingsections.

4.3.3. OvercomingSemanticProblemsinMetadataEncoding:AResourceandDescriptionVocabulary

Ouraimistoenrichmetadatawithnewassertionsinferredfromexistingresourcedescriptions.Towardthataimwehaveidentifiedclasses,properties,andrelationshipsforovercomingencodingproblems,andwehaveexpressedtheseinaschema.Thisvocabularydoesnotrepresentclassesorpropertiesforspecifictypesofresources.Instead,itoffersanontologyofmetadatadescriptionsthemselves.Simplystated,thevocabularyincludestermsthatcanbeusedtodescriberecords,metadatadescriptions,andrelationshipsbetweenthemandpreservationtargets.(TheResourceandDescriptionVocabularyisprovidedinAppendix6.7.)Morespecifically,thevocabularyisdividedintothefollowingsections:

• W3Cstandardclassesandpropertieso Theseincludeclassesandpropertiessuchasrdfs:Resource,

rdf:Statement,andowl:ObjectProperty.• Alternatereificationclasses

o ConventionaluseoftheRDFreificationvocabularyisbasedonanunderstandingthattriplesstandinatype/instancerelationshipwith"tokens"appearinginRDFdocuments(RDFSemantics,2004).Butthisinterpretation,intendedtosupportprovenancedocumentation,presentspuzzlesforunderstandinghowaserializedexpressioncanstandindirectrelationshipswithresourcesreferredtobyanabstracttriple.(ForthoseanalystswhomaybeconcernedwithabusingtheofficialaccountofRDFreification,thevocabularyincludesseparate

ECHODEPositoryTechnicalArchitecturePhase1FinalReportNarrativeReportNationalDigitalInformationInfrastructurePreservationProgramUniversityofIllinoisatUrbana‐ChampaignwithOCLC

66

classesforgeneralizedstatements,RDFstatements,andabstracttriples.)

• Indicationrelationshipso Thissectionincludesagroupofhierarchicallyorganized

relationships,basedonrecommendationsinPiotrKaminiski's2002thesis.Therelationshipsincludeindication,representation,denotation,identification,description,depiction,ascription,expression,andencoding.

• ClassesbasedontheDCMIAbstractModelo IntheDublinCoreAbstractModel,theterm"metadataelement"is

usedsynonymouslywiththeterm"property."Butourclasses,thoughbasedonthatmodel,representmetadataelementsasspecializednamesofproperties,ratherthanaspropertiesthemselves.Classesinthissectionincludemetadataelement,metadataelementset,metadatastatement,andmetadatadescription.

• Markupstructureso AthirdalternativetoreifyingRDFstatementsundertheofficialW3

interpretation,orthroughuseofalternateclasses,istoreifythenotationexpressingtheRDF.ThissectionofthevocabularyincludesclassesforXMLelements,XMLdocuments,XMLschemas,XMLattributes,andURIs.

Insummary,theResourceDescriptionVocabularyisanontologyofmetadatadescriptionsthemselves.ItsaimistoprovideasemanticallysoundframeworkforovercomingtheencodingproblemsdescribedinSection4.2ofthisreport.Thenextsectionwalksusthroughademonstrationofhowthisontology,asusedbyBECHAMEL,canhelptohighlightpotentialpreservationrisks.

4.3.4. ResolvingSemanticAmbiguity:anInferenceExampleInSection4.2ofthispaperwediscussedproblemsofdescriptivepractice,encodingstandards,andschemadesign.Nowwepresentanillustrationofhowourinferencingsoftwarerespondstothoseproblems.Intheexamplebelow,anambiguousmetadatastatementfromtherecordshowninFigure10isidentifiedandassociatedwiththeimpliedpreservationtargetitdescribes.Figure11belowshowsoneRDFstatementextractedfromFigure10,ourrunningexample:

Figure11:AfragmentoftherecordshowninFigure10.

ECHODEPositoryTechnicalArchitecturePhase1FinalReportNarrativeReportNationalDigitalInformationInfrastructurePreservationProgramUniversityofIllinoisatUrbana‐ChampaignwithOCLC

67

ThisRDFstatementviewshowsthestringvalue"image"assignedtotheDublinCoretypepropertyfortheFedoraObjectidentifiedasinfo:fedora/changeme:97–aswediscussedearlierwhenviewingtheoriginalmarkuprecord(Figure10).Themainissueisoneofclearlyidentifyingthetargetofourpreservationefforts:animageinthiscase.Summarizingtheconcernsdiscussedearlier:

o TheFedoraobjectisanamorphousresource,whichseemstosharepropertiesoftheimageitself,theimagecontent,andthebitstreamencodingtheimage.TheFedoraobjectcannot,therefore,beourpreservationtarget.

o Accordingtotheformalschemadefinition,theDublinCoreTypepropertyindicatesthe"thenatureorgenreofaresource,"butneednotidentifytheexistenceofanyparticularconcreteobjectorabstractentity.Asalreadyseen,thisvaguenessintheformalschemaopensthedoortotheuseofvalues(suchastheliteralstring“image”)thatarecleartohumanreadersbutwhichposeproblemsformachineprocessing.

o Althoughtheword"image"invitesahumanreadertoinferthatourpreservationtargetisanimage,thatinformationisnotexplicitenoughtosupportautomatedprocessing.Theinferencedependsnotonlyonwordmeaningbutalsoonthetacitbackgroundknowledgethatthepropertyvaluemustinthiscasebeaclass(ratherthan,forexample,aquality,quantity,orname).

Torecapthen,thisimage(Figure11)illustratestherelationshipsbetweentheFedoraobject,theDCelement“type”,andthevalue“image”ambiguouslyexpressedintheoriginalrecord(Figure10).Inthenextstep,webegintoclarifytheserelationships.Figure12belowshowsthefirstinferencestage:

Figure12:BECHAMELhasidentifiedthefragmentasametadatastatement.

ECHODEPositoryTechnicalArchitecturePhase1FinalReportNarrativeReportNationalDigitalInformationInfrastructurePreservationProgramUniversityofIllinoisatUrbana‐ChampaignwithOCLC

68

ThisRDFstatementshowsthattheoriginaldc:typearchasbeenidentifiedasametadatastatement,andanewTAGURIhasbeengeneratedtodenotethatstatement.AlthoughthisstageoftheprocessingbeganwithconventionalRDFreification,ourassignmentofrdf:subject,rdf:predicate,andrdf:objectpropertiestoournewMetadata_StatementinstanceisadeparturefromorthodoxRDFsemantics.Thisfirststageofinferenceprocessinghasidentifiedthemetadatastatement.Inthenextstagewetakethisastepfurthertoidentifythepreservationtarget.

Figure13:BECHAMELhasidentifiedthemetadatastatementasadescriptionofanimage.Figure13showstheidentificationofthepreservationtarget.ThesysteminfersthatthismetadatastatementmustbedescribinganabstractimagethathasbothaclassidentityandanobjectidentitydistinctfromtheJPEGfile,thebitstreamencodingthatfile,thegeographydepictedintheimage,andtheFedoraobjectthatservesasthelocusforpropertyattributionsatallthoselevels.Inaddition,themetadatastatementisidentifiedasbelongingtoametadatadescription.Identifyingthepreservationtargetshouldsimplifythevalidationoflaterpreservationtransactions,makingiteasiertoverifythatessentialpropertiespersistacrossmigrationsandthroughtranslationsfromoneformattoanother.

4.3.5. AutomatedInferenceasaPreservationServiceTheontologyandinferencesthatitsupportsallowus,evenincaseswheremetadatarecordsareterseandincomplete,torecoverimportantdistinctions,suchasthe

ECHODEPositoryTechnicalArchitecturePhase1FinalReportNarrativeReportNationalDigitalInformationInfrastructurePreservationProgramUniversityofIllinoisatUrbana‐ChampaignwithOCLC

69

distinctionbetweenapersonandthemetadatarecorddescribingthatperson.Thisknowledgeisexpressedinaportablesyntax(RDF/OWL)withexplicitly‐definedsemantics,soitcanbemaintainedwithouthavingtomodifytheoriginalrecordortransformitintoanothersyntax(eitherofwhichcouldintroducefurtherpreservationrisks).Indeed,BECHAMEL’sabilitytoreadandwritefromRDFdatabases(usingTupelo)meansthatitcanreadmetadatarecords,applyrulesandinfernewassertions,andwritethoseassertionsbacktotheRDFdatabasewithoutalteringtheoriginalrecordsinanyway.The“openworld”ofRDF/OWLmeansthatautomatedinferencecanbecomeapartofthepreservationprocesswithoutrequiringthatweredesignandreimplementinstitutionalrepositoriestoaccommodateit.Instead,inferenceisakindofservicethatcanbeusedalongsidethosetoolstoheadoffpreservationrisksandfillgapsinrepresentation.Thenextsectionlooksmorecloselyatthearchitectureandproof‐of‐conceptimplementationofanarchivethataugmentsaninstitutionalrepositorywithinferencecapabilitiesandservices.

4.4. SystemArchitectureWerespondtothepractice,standardization,andtechnicalproblemspreviouslyoutlinedintwoways:

o First,wedesignoursystemsforaworldwheremetadatawillvarygreatlyintheircompleteness,expressivity,andconsistency.Preservationriskswillarise,andwebuildtoolswiththeaimofamelioratingthoseproblems.

o Second,weproposeanarchitectureforrepositoriesthatwehopewillsupportmoreeffectiveresourcedescriptionandencoding:onethatincludescapabilitiesandservicesthatwillbeneededinthenextgenerationofdigitalcontentmanagementsystems.

4.4.1. Architecture:OverviewTheproposedarchitectureaugmentstypicalinstitutionalrepositoryarchitectureswithtwonewcapabilities:

o Theabilitytomanagenotjustbitstreamsandassociatedmetadata,butalsoassociatedsemantics,expressedinstandardRDFandOWLsyntax.

o Automatedservicesfordetectingand/orcorrectingsemanticambiguityinmetadatadescriptions.

4.4.1.1. Architecture:theTupelomodelTupeloisamiddlewarecomponentprovidingsemanticcontentmanagementfordistributed,heterogeneousapplications.Bymiddleware,wemeanthatTupeloprovidesabstractions(knownascontexts)thatencapsulatedifferentstorageandretrievaltechnologiesfordataandmetadata,includingfilesystems,webservices,

ECHODEPositoryTechnicalArchitecturePhase1FinalReportNarrativeReportNationalDigitalInformationInfrastructurePreservationProgramUniversityofIllinoisatUrbana‐ChampaignwithOCLC

70

relationaldatabases,andRDFstores.Bywayofthesecontexts,applicationscanexchangeRDFstatementsandaccessrawoctetstreamsassociatedwiththem.Tupelocanthereforeservethesameroleasacontentmanagementsystem(CMS)orinstitutionalrepository.ButTupelodiffersfromthesesystemsinmakingonlyminimalassumptionsaboutthestructureoftheinformationitmanages,allowingapplicationstoencodethatstructureasexplicitRDFstatements.RDF'sopen‐worldassumptionanduseofUniformResourceIdentifiersmeansthatTupelocanassembledescriptionsfrommultiple,independentsources,evenifthosesourcesarenototherwisecoordinated.Tupelohasoriginallybeendesignedtosupportscienceapplicationswheredataisproduced,processed,andtransformedbymultiplepeopleandsoftwarecomponents.Suchapplicationsrequirepreservationofworkflowtracesandthetrackingofrelationshipsbetweenrawinputandoutputresultsacrossdistributedsystems.Thesesamechallengesariseindigitalpreservation,wherecriticaltransformationsmayoccuroutsideofthecontrolofarepositorysystem,orwithinmetadatawhosesemanticsareknownatonestageoftheprocessandunknownatanother.Suchtransformationsaredistributed,heterogeneousprocesses,andtyingadigitalartifacttotheprocessinwhichitparticipatedrequiresportable,globally‐scopedidentifiersthatcanbemanagedindependentlyoftheprocessitself.RDFusageenforcestheglobalscopeofidentifiersbyusingURIstoidentifynodes.

4.4.1.2. ConnectingBECHAMELtoTupeloOurBECHAMELclientapplicationretrievesanXML‐serializedsubgraphoftherepositorycontentsfromTupeloviaTupelo’sHTTP‐basedclient/serverprotocol,whichisbasedonextendingNokia’sproposedURIQAprotocol(http://sw.nokia.com/uriqa/URIQA.html)ThesubgraphissubmittedtoBECHAMEL,togetherwithsupportingOWLOntologiesandstandardizedRDFSvocabularies(e.g.,DublinCore).NewRDFstatementsandannotationsemergingfromBECHAMEL'sexecution(seetheinferenceexampleinFigures2‐4)arethendeliveredbacktotheTupeloserver.

4.4.1.3. ObservationsonImplementationLikethecharacteristicsoftheTupeloarchitecture,wepredictthatinferentialcapabilities(suchasthoseillustratedearlierinthissection)willbebasicservicesprovidedbyandforfuturedigitalrepositories.Butthefunctionalcomponentsofthoserepositorieswillbelooselycoupledanddistributed.Interpretiveservicesare,furthermore,neededrightawayforsystemsbasedoncurrentContentManagementSystemtechnologies,andtoaidinreformingdescriptivepracticeasitstandstoday.Forallthesereasons,wehavesoughtinourimplementationtomaketheinterpretationcomponentastructurallydistinctlayer,communicatingwiththeTupelomiddlewareviageneral‐purposeclient/serverprotocolssuchasHTTP.Whileweassumetheresourcedescriptionsandinferredknowledgewillconformto

ECHODEPositoryTechnicalArchitecturePhase1FinalReportNarrativeReportNationalDigitalInformationInfrastructurePreservationProgramUniversityofIllinoisatUrbana‐ChampaignwithOCLC

71

theRDFabstractmodel,wehavechosentodelivertheminconventionalserializedforms,suchasRDF/XML.Aswithanysimilarproject,avarietyofengineeringchallengesrequirefurtherexperimentationandimprovement.Forexample,attheBECHAMELapplicationlayer,allRDFstatementsareexpressedasifpartofasingleglobalgraph,whetherretrievedfromTupelo,parsedfromanRDFSvocabulary,inferredbyBECHAMELitself,ordrawnfromanyothersource.Butobviously,onlyafiniteamountofinputknowledgecanbeefficientlysharedoverthenetworkbetweenclientandserver.Ourinterpretiverulesarethemselvesseparatefromthestrategiesforselecting,retrieving,andstoringRDFstatements,butpragmaticallytheycannotbetotallyindependentofeachother.

4.5. LessonsLearnedandNextStepsOurresearchcontributioncanbeseenfromoneperspectiveasthetechnicalgroundworkforafuturegenerationofimprovedautomateddigitalpreservationsystemsandmethods.Butonecanalsounderstandourfindingsasopportunitiestoapplyhumanintelligencemoreeffectivelywithexistingtoolsandstandards.Itmightneveroccurtoadigitallibrarianthathispreservationmethodsarebeingexecutedwithoutclearlyidentifiabletargets,orthatasimplechange(suchasdcmitype:Imageinsteadof"image")coulddramaticallyreducetheworkrequiredtocorrectthatproblem.Theexerciseofencodingsemanticknowledgewithenoughclarityandprecisionforacomputerrevealscomplexitiesthatourremarkablehumanmindswouldotherwiseallowustoignore.Withtheaidofthatinsight,muchprogresscouldbemadeinreformingthepracticesthatpromptourdevelopmentandresearch.

4.6. ConclusionInstitutionalrepositoriesandothercurrenteffortsforpreservingdigitalartifactsfacechallengesresultingfromunderspecifiedmetadataschemas,ambiguoususage,andmetadatamodelsthatrelatemoretorepositoryimplementationthantoissuesofmeaning.Theseentailveryrealriskstotheintegrityandusefulnessofpreserveddigitalartifactsastheyarestored,managed,andretrieved.Descriptivepracticesthatseemcorrectmayintroduceinconsistenciesthatareundetectablewithoutmanualinspectionofeachrecord‐‐anunreasonablerequirementforcollectionsofevenmoderatesize.Improvedmetadatastandardsandrepositorymetadatamodelsarepartofthesolutiontotheseproblems,butwealsoseearoleforautomationindetectingandmitigatingpreservationrisks.Ourexperimentalarchivingtechnologies,BECHAMEL

ECHODEPositoryTechnicalArchitecturePhase1FinalReportNarrativeReportNationalDigitalInformationInfrastructurePreservationProgramUniversityofIllinoisatUrbana‐ChampaignwithOCLC

72

andTupelo,demonstratethatwecanlocateandcorrectambiguousmetadataexpressionsinthecontextoftransactionssuchasimportandexport.Asbestpracticesevolvefordigitalpreservation,weseereasoningcapabilitieslikethosedemonstratedbyBECHAMELbecominganintegralcomponentofdigitalpreservationsystems,allowingcuratorstotransformlargecollectionswithgreaterconfidencethatrecordswillfaithfullyrepresenttheinformationtheyareintendedtopreserve.ComplementinginteroperabilitymodelslikeECHODEPository'sHubandSpoketoolsuite,webelievethetechniquesdescribedherepointtoanewgenerationofpreservationtools,andrevealwaystouseexistingtoolswithmoresuccess.

ECHODEPositoryTechnicalArchitecturePhase1FinalReportNarrativeReportNationalDigitalInformationInfrastructurePreservationProgramUniversityofIllinoisatUrbana‐ChampaignwithOCLC

73

5. ANotefromthePIsWhentheUniversityofIllinoisLibraryandtheGraduateSchoolofLibraryandInformationSciencesubmittedtheproposalfortheECHODEPprojecttotheLibraryofCongressin2003,thedigitalpreservationlandscapewasradicallydifferent.Thenumberofwebarchivingtoolswasstillsmall;farfewerinstitutionsthannowhadinstancesofrepositorysoftwareapplicationsintheirlibraries;theproblemspaceofinteroperabilitybetweenrepositoryplatformswasjustgainingground;andtechniquesformigratingthesemanticcontentofdocumentsovertimeandthroughvariousencodingschemeswerestillonthehorizon.TheaccomplishmentsofECHODEPPhase1projects,intheformoftechnicalframeworksandsoftwareapplications,aswellasofpublishedresearchandenduringpartnerships,havecontributedtotheredesignofthislandscapeforthericherandmoresustainable.Forexample,oureffortsatenablingrepositoryinteroperabilityhaveresultedintheregistrationoftheECHODEPGenericMETSProfileforPreservationandDigitalRepositoryInteroperabilitywiththeLibraryofCongress.Becauseofourworkinthisarea,institutionssuchasHarvardUniversity,theArizonaStateLibrary,andtheGeorgiaInstituteofTechnologyhavecontactedustolearnmoreaboutthetechnicalarchitectureissuesinvolvedinourframework.Thesecontactsbespeakknowledgesharingandcommunitybuildingtowardapublicgood–interactionsthatareintegraltothedevelopmentofanetworkedapproachtodigitalcontentstewardship.Anotherbeneficialoutcomehasbeenthepartnershipsthemselves,initiallyestablishedduringPhase1,suchaswithOCLC;IllinoisandOCLCarecollaboratingagaininECHODEPPhase2,thistimeonanamed‐entityextractionandrecognitiontooldevelopmentprojectthatseekstoautomatecreationandextractionofmetadataforpreservationpurposesandcontexts.Indeed,theworkofstartingandsustainingcross‐organizationalcollaborationforaproject’speriodofperformanceshouldnotbeoverlooked.Asourteamshavelearnedintheprocess,effectivecollaborationentails–butisnotlimitedto–layingafoundationforacommunicationinfrastructurethatdrawsonanarrayoftools,suchaswikisandvirtualmeetingapplications;nurturingahealthybalancebetweenencouragementofnewdirectionsinresearchanddevelopmentandmeetingthedeliverablestowhichtheprojectiscommitted;andunderstandingfromthestartthattheoutcomeofoureffortswillonlybeasmeaningfulandsuccessfulasthecollaborationsthemselvesarerichandproductive.TheUniversityofIllinoisisgratefultotheLibraryofCongressforfundingitsdigitalpreservationresearchactivitiesunderNDIIPP.TheworkachievedduringPhase1hasaffordedusagreaterunderstandingofthechallengessurroundingpreservationstrategies,whichwehopetheNDIIPPcommunityatlargewillcontinuetolearnfromanddrawuponinfuturestewardshipendeavors.

ECHODEPositoryTechnicalArchitecturePhase1FinalReportNarrativeReportNationalDigitalInformationInfrastructurePreservationProgramUniversityofIllinoisatUrbana‐ChampaignwithOCLC

74

6. References

6.1. ArchivingtheWeb:theWebArchivesWorkbench

6.1.1. ResourcesCobb,J.,Pearce‐Moses,R.&Surface,T.(2005).ECHODEPositoryProject.In

Archiving2005:finalprogramandproceedings,April26,2005,Washington,D.C.,(175‐178).Springfield,VA:TheSocietyforImagingScienceandTechnology,2005.RetrieveJuly5,2008,fromhttp://www.ndiipp.uiuc.edu/pdfs/IST2005paper_final.pdf/.

TheISOReferenceModelforOpenDistributedProcessing–AnIntroductionECHODepGenericMETSProfileforPreservationandDigitalRepository

Interoperability.(2005).RetrievedAugust27,2008,fromhttp://www.loc.gov/standards/mets/profiles/00000015.html.

ECHODepMETSProfileforWebSiteCaptures(2006).RetrievedAugust27,2008,

fromhttp://www.loc.gov/standards/mets/profiles/00000016.html.TheECHODEPository:AnNDIIPP‐PartnerProjectoftheUniversityofIllinoisat

Urbana‐ChampaignwithOCLCandtheLibraryofCongress.(n.d.).RetrievedJuly5,2008,fromhttp://www.ndiipp.uiuc.edu/.

“TheISOReferenceModelforOpenDistributedProcessing–AnIntroduction.”

(1996).RetrievedAugust27,2008,fromhttp://www.enterprise‐architecture.info/Images/Documents/RM‐ODP2.pdf.

OCLCDigitalManagementServices.(2008).RetrievedJuly5,2008,fromhttp://www.oclc.org/us/en/services/collection/default.htm.TheNationalDigitalInformationandInfrastructurePreservationProgram.(n.d.).

RetrievedJuly5,2008,fromhttp://www.digitalpreservation.gov/.Pearce‐Moses,R.&Kaczmarek,J.(2005).AnArizonaModelforPreservationand

AccessofWebDocuments.DttP:DocumentstothePeople.33(1),17‐24.RetrievedJuly5,2008,fromhttp://www.ndiipp.uiuc.edu/pdfs/azmodel.pdf/.

ECHODEPositoryTechnicalArchitecturePhase1FinalReportNarrativeReportNationalDigitalInformationInfrastructurePreservationProgramUniversityofIllinoisatUrbana‐ChampaignwithOCLC

75

Rani,S.,Goodkind,J.,Cobb,J.,Habing,T.,Eke,J.,Urban,R.&Pearce‐Moses,R.(2006).Technicalarchitectureoverview:toolsforacquisition,packaging,andingestofwebobjectsintomultiplerepositories(poster).Openinginformationhorizons:6thACM/IEEE‐CSJointConferenceonDigitalLibraries:June11‐15,2006,ChapelHill,NC,USA:JCDL2006/sponsoredbyACMSIGonInformationRetrieval,ACMSIGonHypertext,HypermediaandtheWeb,IEEETechnicalCommitteeforDigitalLibraries,(360‐360).NewYork:ACM,2006.

WebArchivesWorkbench.(2008).RetrievedJuly5,2008fromhttp://sourceforge.net/projects/webarchivwkbnch/.Webarchiving.(2008,August21).InWikipedia,thefreeencyclopedia.Retrieved

August27,2008,fromhttp://en.wikipedia.org/wiki/Web_archiving.

6.2. RepositoryEvaluationandInteroperability

6.2.1. RepositoryEvaluationDLIFull‐TextJournalCollection.RetrievedApril7,2009,from

http://forseti.grainger.uiuc.edu/pubs/tocdli.asp.HistoricalAerialPhotoImageDatabase.RetrievedApril7,2009,from

http://images.library.uiuc.edu/projects/aerial_photos/.IllinoisDigitalOrthophotoQuarterQuadrangleData.RetrievedApril7,2009,from

http://www.isgs.illinois.edu/nsdihome/webdocs/doq05/.RLG.(2005).Anauditchecklistforthecertificationoftrusteddigitalrepositories.

MountainView,CA:RLG.RetrievedSeptember10,2008,fromhttp://worldcat.org/arcviewer/1/OCC/2007/08/08/0000070511/viewer/file2416.pdf.

VincentVoiceAudioLibraryatMichiganStateUniversityLibraries.RetrievedApril7,2009,fromhttp://vvl.lib.msu.edu/showfindingaid.cfm?findaidid=CoolidgeC.

6.2.2. HandSToolsSuiteAllinson,F.,François,S.,&Lewis,S.(January,2008).SWORD:SimpleWeb‐service

OfferingRepositoryDeposit.Ariadne,(54).RetrievedSeptember11,2008,fromhttp://www.ariadne.ac.uk/issue54/allinson-et-al/.

Boyko,A.,Kunze,J.,Littman,J.,&Madden,L.(2008).TheBagItFilePackageFormat(V0.95).RetrievedSeptember15,2008,fromhttp://www.cdlib.org/inside/diglib/bagit/bagitspec.html.

ECHODEPositoryTechnicalArchitecturePhase1FinalReportNarrativeReportNationalDigitalInformationInfrastructurePreservationProgramUniversityofIllinoisatUrbana‐ChampaignwithOCLC

76

ConsultativeCommitteeforSpaceDataStandards.(2002).ReferenceModelforanOpenArchivalInformationSystem(OAIS).CCSDS650.0‐B‐1.BlueBook.RetrievedSeptember12,2008,fromhttp://public.ccsds.org/publications/archive/650x0b1.pdf.

DLFAquiferMetadataWorkingGroup.(2006).DigitalLibraryFederation/AquiferImplementationGuidelinesforShareableMODSRecords.RetrievedSeptember10,2008,fromhttp://wiki.dlib.indiana.edu/confluence/download/attachments/24288/DLFMODS_ImplementationGuidelines_Version1-2.pdf?version=1.

GlobalDigitalFormatRegistry(GDFR)InformationSite.(n.d.).RetrievedSeptember15,2008,fromhttp://www.gdfr.info/.

Guenther,R.(2008).GuidelinesforusingPREMISwithMETSforexchange.RetrievedSeptember11,2008,fromhttp://www.loc.gov/standards/premis/guidelines-premismets.pdf.

Guenther,R.(2008).BattleoftheBuzzwords:Flexibilityvs.InteroperabilityWhenImplementingPREMISinMETS.D­LibMagazine,14(7/8).RetrievedSeptember12,2008,fromhttp://www.dlib.org/dlib/july08/guenther/07guenther.html.

Habing,T.G.(2005).ECHODepgenericMETSprofileforpreservationanddigitalrepositoryinteroperability.RetrievedSeptember10,2008,fromhttp://www.loc.gov/standards/mets/profiles/00000015.xml.

Habing,T.G.(2006).ECHODepMETSprofileforwebsitecaptures.RetrievedSeptember10,2008,fromhttp://www.loc.gov/standards/mets/profiles/00000016.xml.

Habing,T.G.(2007).LightweightrepositoryCRUDService(LRCRUDS).RetrievedSeptember10,2008,fromhttp://dli.grainger.uiuc.edu/echodep/hands/LRCRUDS.htm.

JHOVE:JSTOR/HarvardObjectValidationEnvironment.(2007).RetrievedSeptember11,2008,fromhttp://hul.harvard.edu/jhove/.

Kaczmarek,J.,Habing,T.G.,&Eke,J.(2006).Repositorysoftwareevaluationusingtheauditchecklistforcertificationoftrusteddigitalrepositories.InProceedingsofthe6thACM/IEEE­CSjointconferenceondigitallibraries2006,ChapelHill,NC,USAJune11­15,2006.NewYork:AssociationforComputingMachinery.RetrievedSeptember10,2008,fromhttp://doi.acm.org/10.1145/1141753.1141774.

Kaczmarek,J.,Hswe,P.,Eke,J.,&Habing,T.G.(2006).Usingthe‘Auditchecklistforthecertificationofatrusteddigitalrepository’asaframeworkforevaluating

ECHODEPositoryTechnicalArchitecturePhase1FinalReportNarrativeReportNationalDigitalInformationInfrastructurePreservationProgramUniversityofIllinoisatUrbana‐ChampaignwithOCLC

77

repositorysoftwareapplications.D­LibMagazine,12(12).RetrievedSeptember10,2008,fromhttp://www.dlib.org/dlib/december06/kaczmarek/12kaczmarek.html.

METS:Metadataencoding&transmissionstandard,officialwebsite(2008).RetrievedSeptember10,2008,fromhttp://www.loc.gov/standards/mets/.

MIX:NISOmetadataforimagesinXMLschema,technicalmetadatafordigitalstillimagesstandard,officialwebsite.(2008).RetrievedSeptember10,2008,fromhttp://www.loc.gov/standards/mix/.

MODS:Metadataobjectdescriptionschema,officialwebsite.(2008).RetrievedSeptember10,2008,fromhttp://www.loc.gov/standards/mods/.

OAI­PMH:OpenArchivesInitiative–ProtocolforMetadataHarvesting.(2008).RetrievedSeptember12,2008,fromhttp://www.openarchives.org/pmh/.

PREMIS:Preservationmetadatamaintenanceactivity,officialwebsite.(2008).RetrievedSeptember10,2008,fromhttp://www.loc.gov/standards/premis/.

PREMISWorkingGroup.(2005).Datadictionaryforpreservationmetadata.Dublin,OH:OCLCandRLG.RetreivedSeptember10,2005,fromhttp://www.oclc.org/research/projects/pmwg/premis-final.pdf.

RLG.(2005).Anauditchecklistforthecertificationoftrusteddigitalrepositories.MountainView,CA:RLG.RetrievedSeptember10,2008,fromhttp://worldcat.org/arcviewer/1/OCC/2007/08/08/0000070511/viewer/file2416.pdf.

JISC.(2008).SWORD.RetrievedSeptember11,2008,fromhttp://www.ukoln.ac.uk/repositories/digirep/index/SWORD.

textMD:TechnicalMetadataforText,OfficialWebSite.(2008).RetrievedSeptember10,2008,fromhttp://www.loc.gov/standards/textMD/.

TheECHODEPositoryproject.(n.d.).RetrievedSeptember9,2008,fromhttp://ndiipp.uiuc.edu/.

TheApacheSoftwareFoundation.(2008).WelcometoXMLBeans.RetrievedSeptember11,2008,fromhttp://xmlbeans.apache.org/.

UIUCEchodepHubandSpokeFrameworkToolSuite.(n.d.).RetrievedSeptember12,2008,fromhttp://dli.grainger.uiuc.edu/echodep/hands/

6.3. PreservingMeaning,NotJustObjects:SemanticsandDigitalPreservation

ECHODEPositoryTechnicalArchitecturePhase1FinalReportNarrativeReportNationalDigitalInformationInfrastructurePreservationProgramUniversityofIllinoisatUrbana‐ChampaignwithOCLC

78

DCMINamespacefortheDublinCoreMetadataElementSet,Version1.1.(2008).RetrievedOctober2,2008,fromhttp://dublincore.org/2008/01/14/dcelements.rdf

DCMITypeSchema.(2008).RetrievedOctober3,2008,from

http://dublincore.org/2008/01/14/dctype.rdfDubin,D.,Sperberg‐McQueen,C.M.,Renear,A.,andHuitfeldt,C.(2003).Alogic

programmingenvironmentfordocumentsemanticsandinference.LiteraryandLinguisticComputing,18(2):225–233.

Habing,T.,Ingram,W.,Cordial,M.,Manaster,R.andEke.J.(2008).Developmentsin

digitalpreservationattheUniversityofIllinois:theHubandSpokearchitectureforsupportingrepositoryinteroperabilityandemergingpreservationstandards.LibraryTrends,57(4),[pagenos.].

Kaczmarek,J.,Hswe,P.,Hauser,L.,andEke.J.(2008).TheWebArchivesWorkbench:

takinganarchivalapproachtothepreservationofWebcontent.LibraryTrends,57(4),[pagenos].

Kaminski.P.(2002).Integratinginformationonthesemanticwebusingpartially

orderedmultihypersets.Unpublishedmaster’sthesis.UniversityofWaterloo.RetrievedSeptember12,2008,fromhttp://www.ideanest.com/braque/Thesis-web.pdf.

OWLWebOntologyLanguage.(2004).RetrievedOctober2,2008,fromhttp://www.w3.org/TR/owl‐features/

RDFSemantics.(2004).RetrievedOctober2,2008,from

http://www.w3.org/TR/rdf-mt/Renear,A.,Dubin,D.,Sperberg‐McQueen,C.M.,andHuitfeldt,C.(2002).Towardsa

semanticsforXMLmarkup.InE.Munson,R.Furuta,andJ.I.Maletic(eds.)Proceedingsofthe2002ACMSymposiumonDocumentEngineering(119‐126).NewYork:ACM.

Renear,A.andDubin,D.(2003).Towardsidentityconditionsfordigitaldocuments.

InS.Sutton,editor,Proceedingsofthe2003DublinCoreConference.UniversityofWashington,Seattle,WA.

Tupelo.(2008).RetrievedOctober2,2008,from

http://dlt.ncsa.uiuc.edu/wiki/index.php/Main_Page.

ECHODEPositoryTechnicalArchitecturePhase1FinalReportNarrativeReportNationalDigitalInformationInfrastructurePreservationProgramUniversityofIllinoisatUrbana‐ChampaignwithOCLC

79

URIQA.TheURIQueryAgentModel:ASemanticWebEnabler.(2003‐2008).RetrievedOctober2,2008,fromhttp://sw.nokia.com/uriqa/URIQA.html.

ECHODEPositoryTechnicalArchitecturePhase1FinalReportNarrativeReportNationalDigitalInformationInfrastructurePreservationProgramUniversityofIllinoisatUrbana‐ChampaignwithOCLC

80

7. Appendices

7.1. WebArchivesUserGuide

7.2. WebArchivesWorkbenchImplementationGuide

7.3. AnnotatedTrustedDigitalRepositoryChecklist

7.4. UsingtheAuditChecklistfortheCertificationofaTrustedDigitalRepositoryasaFrameworkforEvaluatingRepositorySoftwareApplications(DLibarticle)

7.5. RepositoryTestingFindings:Narrative

7.6. RepositoryFindingsCommentaryUsingtheAnnotatedTrustedDigitalRepositoryChecklist

7.7. ResourceDescriptionVocabulary:AnOntologyofMetadataDescriptions

7.8. SustainedAccesstoEjournals:ContextValue,andFutureProspectus