Archiving in the.pdf

download Archiving in the.pdf

of 9

Transcript of Archiving in the.pdf

  • 8/9/2019 Archiving in the.pdf

    1/9

    Library Hi TechArchiving in the networked world: authenticity and integrity

    Michael SeadleArticle in format ion:

    To cite this document:Michael Seadle, (2012),"Archiving in the networked world: authenticity and integrity", Library Hi Tech, Vol.30 Iss 3 pp. 545 - 552Permanent link to this document:http://dx.doi.org/10.1108/07378831211266654

    Downloaded on: 03 November 2014, At: 11:28 (PT)

    References: this document contains references to 5 other documents.

    To copy this document: [email protected]

    The fulltext of this document has been downloaded 610 times since 2012*

    Users who downloaded this article also downloaded:

    Michael Seadle, (2012),"Archiving in the digital world: the scholarly literature", Library Hi Tech, Vol. 30 Iss 2pp. 367-375

    Access to this document was granted through an Emerald subscription provided by 434496 []

    For Authors

    If you would like to write for this, or any other Emerald publication, then please use our Emerald forAuthors service information about how to choose which publication to write for and submission guidelineare available for all. Please visit www.emeraldinsight.com/authors for more information.

    About Emerald www.emeraldinsight .com

    Emerald is a global publisher linking research and practice to the benefit of society. The companymanages a portfolio of more than 290 journals and over 2,350 books and book series volumes, as well asproviding an extensive range of online products and additional customer resources and services.

    Emerald is both COUNTER 4 and TRANSFER compliant. The organization is a partner of the Committeeon Publication Ethics (COPE) and also works with Portico and the LOCKSS initiative for digital archivepreservation.

    *Related content and download information correct at time of download.

    http://dx.doi.org/10.1108/07378831211266654

  • 8/9/2019 Archiving in the.pdf

    2/9

    REGULAR PAPER

    Archiving in the networkedworld: authenticity and integrity

    Michael Seadle Berlin School of Library and Information Science, Humboldt-Universitä t zu

     Berlin, Berlin, Germany

    Abstract

    Purpose  – This article aims to discuss how concepts from the analog world apply to a purely digitalenvironment, and look in particular at how authenticity needs to be viewed in the digital world in orderto make some form of validation possible.

    Design/methodology/approach  – The article describes authenticity and integrity in the analogworld and looks at how to measure it in a digital environment.

    Findings  – Authenticity in the digital world generally means, in a purely technical sense, that adocument’s integrity has been checked using mathematical algorithms against other copies onindependently managed servers, and that provenance records show that the document has a clearlyestablished succession from a clearly defined original. Readers should recognize that this is differentthan how one defines authenticity and integrity in the analog world.

    Originality/value – Most of the key issues surrounding digital authenticity have not yet been tested,but they will be when the economic value of an authentic digital work reaches the courts.

    Keywords Archiving, Digital preservation, Digital documents, Electronic publishing,Information technology, Preservation, Computer networks, Digital libraries

    Paper type Research paper

    IntroductionAuthenticity and integrity are concepts that lie at the heart of long-term archiving,whether digital or analog. Essentially every long-term archiving system claims to payattention to guaranteeing the authenticity of content. Portico, for example, listsauthenticity as one of the “key goals of digital preservation” on its web site:

    Authenticity – the provenance of the content must be proven and the content an authenticreplica of the original (Portico, 2012).

    While the goal of maintaining authenticity is clear, the steps for achieving the goal arenot, despite long years of discussion. This article does not pretend to a systematicanalysis of the literature in our field about authenticity, but begins with highlights

    from a few key figures to show the progression of thought on the topic.A number of authors refer to Clifford Lynch’s (2000) report about authenticity as the

    defining work. In fact he is careful to point out the problems with our attempts at aprecise definition and why they tend to fail:

    This distrust of the immaterial world of digital information has forced us to closely andrigorously examine definitions of authenticity and integrity – definitions that we havehistorically been rather glib about – using the requirements for verifiable proofs as abenchmark. As this paper will demonstrate, authenticity and integrity, when held to this

    The current issue and full text archive of this journal is available at

    www.emeraldinsight.com/0737-8831.htm

    Archiving in thenetworked world

    545

    Received June 2012Revised June 2012

    Accepted June 2012

    Library Hi Tech

    Vol. 30 No. 3, 2012

    pp. 545-552

    q Emerald Group Publishing Limited

    0737-8831

    DOI 10.1108/07378831211266654

  • 8/9/2019 Archiving in the.pdf

    3/9

    standard, are elusive properties. It is much easier to devise abstract definitions than testableones. When we try to define integrity and authenticity with precision and rigor, thedefinitions recurse into a wilderness of mirrors, of questions about trust and identity in thenetworked information world (Lynch, 2000).

    Seamus Ross (2002) recommends that the community needs “... to investigate the roletrust plays in authenticity and integrity of digital objects.” He goes on to raise seriousquestions about the role of trust in establishing authenticity:

    In many instances users and preservers establish authenticity on the grounds of trust in theorganization involved or technology used in the preservation of the digital object. The currentunderstanding of the major factors that drive trust decisions in the digital world, as well asthe risks involved with having and implementing this sort of trust is limited (Ross, 2002).

    Instead of relying on trust, Regan Moore (2007) and McKenzie Smith link claims of both integrity and authenticity to a verification process:

    A trustworthy preservation environment is one in which all assertions on integrity and

    authenticity have been verified within a specified time period. The verification process mustbe repeated periodically. Preservation environments are live systems that require constantappraisal and validation.

    This is also the position that David Rosenthal (2011) takes:

    Note that because each LOCKSS box collects content independently, and then audits thecontent against the other LOCKSS boxes with the same content, we have the kind of automated authenticity check discussed earlier by Howard Besser and Seamus Ross, albeitonly for static Web content (Rosenthal, 2011).

    This article will discuss how concepts from the analog world apply to a purely digitalenvironment, and looks in particular at how authenticity needs to be viewed in thedigital world in order to make some form of validation possible without either resorting

    to trust or to Cliff Lynch’s “wilderness of mirrors”.

    The meaning of authenticityAuthenticity in the physical world implies genuineness, and with physical objects asense that it is the actual original, but the concepts of authentic, genuine, and originalgrow less clear the more closely an object is examined. Is, for example, a contemporaryprinted copy of Charles Dickens’ novel “Oliver Twist” authentic? Likely it containsmany of the original words, but some words and phrases from the initial publicationwere corrected in later editions. The format and type-fonts would be different than theoriginal, which appeared in serial form in periodicals. Notes may have been added.

    In one sense, the true authentic version of the novel might be the one that Dickenswrote by hand. In another sense, the genuine original might be the one first made

    available to the public. In a third sense, any subsequent edition is arguably authentic if it faithfully reproduces the author’s intent – in so far as that is known. Even in thephysical world the concept of authenticity rapidly becomes open-ended once it isdivorced from a specific object: this work in this version at this time.

    Mutability and provenanceAuthenticity in the analog world relies, in part, on the relative immutability of physicalmaterials such as print on paper or paint on canvas. This immutability is only partially

    LHT30,3

    546

  • 8/9/2019 Archiving in the.pdf

    4/9

    reliable. The block of a book could be taken apart and individual pages replaced withaltered texts printed on old paper using old ink and a hand-operated press. This doesnot happen in part because many copies of most important works exist, and because noobvious market for faked or altered versions exists. The situation for paintings is

    different because of the high market value of works by certain artists.Authenticity in the physical world generally relies on the chain of provenance, for

    example on evidence that a particular printed book passed from its creator (in this casethe publisher, not the author) to the vendor (a reliable bookstore or, in the case of libraries, a company like Harrassowitz or Yankee) to the library itself. For mostcontemporary library books this chain can usually be documented with little trouble.

    Older and more unique materials in “special collections“ departments may have lessclear provenance. The source of a valuable book picked up in a reputable usedbookstore may be traceable. The provenance of a book bought at an auction or from anindividual collector may be harder to establish. In fact, however, few librarians worryabout the authenticity of even their most valuable works. A strong assumption existsthat a printed book is what it claims to be and thus far there is little evidence to suggestthat this assumption is wrong.

    The situation with medieval or ancient manuscripts is far more complex, becauseprovenance is much harder to establish. For many manuscripts no original exists andpaleographers must often struggle to reconstruct a virtual original by comparingmanuscripts. Copying by hand was prone to error and subject to willful changes on thepart of the person doing the copying.

    Authenticity matters whenever there is a market or other (for example, intellectual)value in having a clearly defined original. The situation for paintings shows thatclearly, because most paintings exist in only one physical original, which may havesuch high market value that the effort to create imitations becomes worthwhile. Themarket for “undiscovered” paintings by famous artists has a long history, and

    temptation among collectors often overcomes their desire for reliable proofs of theprovenance – even proofs can, of course, also be faked. Nonetheless, an originalpainting with a well-established provenance may not be authentic, in the sense thattime has aged its colors and well-meaning restorers may have introduced changes notpresent in the original. In this case the physical object may be authentic, but perhapsnot the image as the painter saw or intended it.

    Style has proven to be a particularly unreliable basis for authenticity judgments,because style can be imitated. The general rule-of-thumb is that fakes become moreobvious over time because of subtle anachronisms that are culturally invisible at thetime of creation, but grow more and more obvious over time. Provenance has beenmore reliable, but the actual chain of provenance for older works can be lost forlegitimate reasons: wars, thefts, fires, and the like.

    Authenticity problems with books tend to involve plagiarism or mass copying. Anexample comes from the market in China for illegal copies of English-languagetextbooks from US publishers. Sometimes the content is genuinely what the US editionhas, and the main problem is that the rights owner does not get paid. More often thecopies contain inaccuracies or are missing features. Sometimes the cover claims to be acurrent edition, but the contents come from earlier versions. In these cases thelibrarian’s usual authenticity tests (title page, for example) fail. This problem is neithernew nor limited to Asia. In the nineteenth century US publishers notoriously copied

    Archiving in thenetworked world

    547

  • 8/9/2019 Archiving in the.pdf

    5/9

    British (and other) best-selling works, often with inaccuracies that embarrassed theoriginal authors. Some of these works are in libraries today, though generally labeledas fakes.

    Digital authenticityIn the digital world, there are no originals, only copies, and the mutability of digitalobjects makes authenticity especially challenging. The web pages that readers see ontheir screens are not precisely what sits on the server. They are rather copies of codesent in packets via the internet and rendered according to sets of rules by the browserson the client computers. Different browsers may render web pages differently anddifferent computer and screen types will almost certainly render colors differently thanthe original, unless someone has taken the trouble to calibrate the colors. Fortext-based works these minor variations probably matter little, but for full-color art theissue is significant. Some variations in paper-based works may exist too, but generallyfewer.

    This type of mutability is not even a matter of deliberate changes to content, but itgoes to the heart of the question: what is authentic? If an authentic original looksdifferent to different users because of their browsers and screens, what remains that ismeasurable? The answer, from a computing perspective, is the code.

    The idea of authenticity in the digital world is more closely related to integrity thanin the analog world. A digital work that exists measurably unchanged in multipleindependent copies possesses a form of integrity that can support its claim toauthenticity. A digital object whose integrity is lost through changes couldtheoretically still be authentic in some meanings of the word in an analogenvironment, but its authenticity becomes harder to prove.

    Provenance is also a measure of authenticity in the digital world, but one that needsto be judged carefully on the basis of controlled conditions. A digital object on a secure

    server with a proven record of resistance to viruses and external attacks can make areasonable claim to authenticity when it comes (with appropriate measures forintegrity checking) from another secure server. A digital object sitting on a server thathad previously been hacked or had had viruses could transfer an authentic copy – butit could also transfer an infected or altered one. A problematic provenance raisesquestions, as it would in the analog world, but an integrity test could allay doubts, aslong as a secure and genuine comparison copy could be found.

    Authenticity in the digital world can mean an exact copy of a text or image. It couldequally well mean that text and image in the original context. An example of thisproblem comes from a copyright case in which the defendant used an in-line-link todisplay an authentic version of a Dilbert cartoon from the publisher’s web site, butdeliberately provided a different context that he felt was more appropriate to the work.

    Was this the equivalent of hanging an authentic painting in a different museum, ormore like altering a portion of a published work? Both answers could be reasonable inthe digital world, but the original publisher argued that the altered context created anillegal and thus inauthentic copy (www.cs.rice.edu/,dwallach/dilbert/).

    While no clear and established measures of authenticity exist for digital objects, areasonable argument can be made that digital objects have a claim to authenticitywhen their integrity can be measured and can be shown to be the same as other digitalcontent on a secure server, for example that of the original publisher. The same need

    LHT30,3

    548

  • 8/9/2019 Archiving in the.pdf

    6/9

    for integrity should apply to digital proofs of provenance. It also seems reasonable toclaim that digital content cannot be considered authentic if its integrity is questionable.

    Integrity is, however, also no simple concept. For this reason it is worth looking athow integrity is judged, first in the analog and then in the digital world.

    Integrity in the analog worldThe integrity of a published book is easy to take for granted. Librarians tend to assumethat all of the books in their collections have uncompromised integrity unless theydiscover that pages are missing or have been damaged beyond readability. Even thenthe integrity loss is generally considered to be limited to the specifically damaged area.Agreements with other libraries allow them to get copies of the missing or damagedpages and libraries generally have staff who know how to insert (“tip in”) the newpages so that the work is whole again.

    In practice, these integrity checks occur only during use: a reader or a librarianwants to read a work and finds pages missing. No systematic integrity checks takeplace for works that sit unread in the stacks. If the threat to integrity were to come onlyfrom external users, and not library staff or from environmental factors like insects ormold, this lack of systematic checking might raise no problems. In fact staff dosometimes steal valuable pictures or articles from works in the library, and insects andmold are major problems in moist, warm climates, especially in libraries with limitedclimate control. Problems with insects and mold can also occur in well-managedNorthern libraries, though they are more rare.

    Larger but more infrequent disasters such as fire, water, and building collapsedamage the integrity of physical works. Fire is rarely as much of a problem as water,since books do not burn readily, but their paper does quickly absorb water. Modernquick freezing methods can recover damaged works if action is taken fast enough.

    Some visible damage is still likely, but not generally so much that the integrity of ananalog work is considered to be compromised. A lost or missing volume is an integrityproblem for a series and for a collection. When a work is misshelved, it can be missingfor a long time and may be hard to find. In a large collection, misplaced books become asignificant problem. This is especially true in open stacks libraries where readerssometimes deliberately put volumes in the wrong place in order to hide them for theirown use.

    While these integrity problems are serious, other libraries generally duplicate theholdings. Truly unique works (medieval manuscripts, for example) tend to be held inspecial locations under relatively tight security. They are still vulnerable to majordisasters, such as fire, flood, or building collapse (such as occurred in Cologne someyears ago). More insidious is the damage to works where librarians assume plenty of 

    other copies exist. Early in JSTOR’s history it discovered that some articles had beenstolen from so many libraries that it became difficult to find an undamaged copy.Incautious assumptions about the long-term integrity of print-on-paper works may infact endanger them. Paper may sit undamaged on a shelf for hundreds of years, butonly if the right environmental and usage conditions are in place. Banning readersfrom ever touching a book may be the most important way to protect the work, sinceusers do most of the intentional and unintentional damage, but that fits poorly with helibrary mission of making information available.

    Archiving in thenetworked world

    549

  • 8/9/2019 Archiving in the.pdf

    7/9

    Digital integrityIntegrity in the digital world receives considerable attention, since digital objects are intheir nature highly mutable and can change in unexpected ways due to conditions inthe storage medium that alter the bitstream. A bitstream is what defines a digital

    object. It is the sequence of binary 0s and 1s that computers interpret to create imagesof letter, numbers, and other formatting characteristics on a computer screen. A singlechange in a bitstream may be as harmless as a printing error in a book. If an ASCIIhexadecimal 3A becomes hexadecimal 3B, the visual rendering only changes from “:”to “;”. Such changes may not be always be harmless or meaningless, though. “Bit rot”(bits changing due to environmental or media changes) can damage parts of abitstream in ways that make it impossible for normal programs to render the filecorrectly on the computer screen. Such changes may appear to destroy a bitstreamcompletely (though modern tools may still be able to recover damaged files) or theymay just make it not work in a particular program, or they may only inflict minor,hardly visible, damage.

    When a digital object loses its integrity, a sequence of problems arises. The simplestis that the object has undergone some change that makes it different than before. This

    may be as innocent as a marginal comment in a PDF document, the equivalent of areader writing a pencil note in a book and theoretically equally reversible. The changecould also represent some deliberate tampering to censor or change meanings. In somecontemporary societies this could be a real problem. There are also hackers who wouldwillingly alter digital versions of documents to disprove the facts of history: those inIran who claim the holocaust never happened, for example. Digital objects with theseintegrity problems remain readable.

    Integrity loss may also make a digital object unreadable. Many librarians view thisas the most serious problem, though deliberate changes in meaning – if undetected – are arguably more serious. Readability is not a simple binary state in which a file is oris not readable, but a broad range of conditions and problems that are often solvablewith enough time and resource.

    The simple solutions for integrity loss of the latter sort fall into two kinds. The firstis to look for a medium that promises stability over long periods such as decades, butthere is good reason to doubt that any physical medium is a good long-term carrier fordigital information and the concern about having reading devices for contemporarystorage devices in 100 years is real, though of course the devices could be reengineeredat need (cost permitting). A second kind of simple solution is to believe the assurance of commercial firms that say their professional storage systems address all the keyintegrity problems. The commercial firms do not necessarily lie when they give suchassurances, and commercial operations have considerable experience with digital

    archiving. Nonetheless, the time perspective of commercial firms tends to be shorterthan that of librarians – five or ten years rather than 100 – and their tolerance for errorgreater. As long as the integrity problems remain within an acceptable margin, it isgenerally not worthwhile for businesses to push for greater assurance.

    The fact is that data left on any contemporary storage medium can rot, that is, canlose integrity. Systematic and frequent checking against multiple copies can detect andavoid loss by replacing damaged files, but this cannot effectively be done for offlinestorage and it cannot be done without appropriate algorithms. Since proprietary

    LHT30,3

    550

  • 8/9/2019 Archiving in the.pdf

    8/9

    archiving systems tend against full disclosure of their back-end storage mechanisms,librarians who choose them have no choice except to trust to their assurances.

    Testing standards, or at least insisting on detailed technical answers to questionsabout integrity checking, could address this problem. It would also help to force

    commercial firms to prove to what degree they can really guarantee integrity. At thesame time everyone should realize that no system can provide 100 percent assurance,only a probability that content will be unchanged. Analog media, such as paper, cannotprovide 100 percent integrity guarantees over time: every object is liable to threats.

    Digital measures of integrityDigital integrity needs an appropriate measure. Generally systems use checksums orhashes to give a reasonable approximation of whether two files are identical.Checksums, as the name suggests, add up the number of bytes or bits in a file or part of a file. The checksum from a file ought to be identical with its copy. Any changeindicates an integrity loss. The internet uses checksums to detect packet damage. Not

    all checksum algorithms will necessarily detect a simple situation where two bits haveflipped, but most bit-rot problems and almost any deliberate alteration of the digitalobject tend to create changes on a larger scale, making checksums an effective meansof integrity assurance.

    The major problem with using checksum and hash algorithms as integritymeasures is that they are either / or character: a file either has integrity according to themeasure, or it does not. This means that integrity in the digital world has none of thegray area uncertainties and flexibilities that are a common feature in judging analogintegrity, which offers advantages (clarity) but also problems (rejecting files withminor problems).

    Waiting to see whether a standard program can open a digital file is another meansof integrity checking at a crude level. Today virtually any word processing program

    can open a pure ASCII file and would detect changes only if the ASCII encoding itself changed into non-ASCII characters. In other words, deliberate additions or deletionswith correct encoding would be OK, which they would not be with a checksum. Minordamage may also go undetected if a human did not scan the whole file on the screen.

    One of the considerations for new measures of integrity would be to separate theequivalent of a penciled note in a margin from more substantial changes. This isconceptually difficult, since it involves an element of human judgment about theamount of the change. Potentially, a system could isolate the changed area andcompare other parts of a file to check integrity. Also potentially, an intelligent systemcould attempt to judge the significance of a change to give a qualified measure of integrity.

    ConclusionAuthenticity in the digital world generally means, in a purely technical sense, that adocument’s integrity has been checked using mathematical algorithms against othercopies on independently managed servers, and that provenance records show that thedocument has a clearly established succession from a clearly defined original. Readersshould recognize that this is different than how one defines authenticity and integrityin the analog world. Digital integrity is not therefore better or worse, but it is differentand the differences need to be understood.

    Archiving in thenetworked world

    551

  • 8/9/2019 Archiving in the.pdf

    9/9

    Librarians, archivists, museum staff and others who deal regularly withauthenticity issues in the analog world may well want digital definitions of authenticity and integrity more closely to approximate analog practices. This is notrealistic. The circumstances in the analog and digital worlds are simply different and

    new authenticity standards and practices need to be established for digital content.This does not mean that the digital definition of authenticity needs to remain

    inflexibly binary: yes, authentic, no, not authentic. Shades of meaning are possiblewhere certain kinds of integrity flaws or possibly provenance losses occur, but theseconditions need a measurable (calculable) digital definition, where, for example, onecould explain time-gaps or compare copies by stripping back the layers of notes andcomments. Whether this is possible depends on the software and the nature of thechange.

    It may also be that in the digital world shades of authenticity and integrity do notmatter to the extent that they do in the analog world, because a genuine document withthe digital equivalent of a pencil note simply becomes a new document that isauthentically itself, and not inauthentically something older. Most of the key issues

    surrounding digital authenticity have not yet been tested, but they will be when theeconomic value of an authentic digital work reaches the courts.

    References

    Lynch, C. (2000), “Authenticity and integrity in the digital environment: an exploratory analysisof the central role of trust”, Council on Library and Information Resources, Washington,DC, available at: www.clir.org/pubs/reports/pub92/lynch.html (accessed May 28, 2012).

    Moore, R. and Smith, M. (2007), “Automated validation of trusted digital repository assessmentcriteria”,  Journal of Digital Information, Vol. 8 No. 2, available at: http://dspace.mit.edu/bitstream/handle/1721.1/39091/Moore-Smith.htm?sequence¼1

    Portico (2012), “Preservation approach: digital preservation defined, available at: www.portico.org/digital-preservation/services/preservation-approach

    Rosenthal, D.S.H. (2011), “How few copies?”, DSHR’s Blog, available at: http://blog.dshr.org/2011/03/how-few-copies.html (accessed May 28, 2012).

    Ross, S. (2002), “Position paper on integrity and authenticity of digital cultural heritage objects”, DigitCULT: Integrity and Authenticity of Digital Cultural Heritage Objects, Vol. 1, availableat: www.digicult.info/downloads/thematic_issue_1_final.pdf 

    Corresponding authorMichael Seadle can be contacted at: [email protected]

    LHT30,3

    552

    To purchase reprints of this article please e-mail:  [email protected] visit our web site for further details:  www.emeraldinsight.com/reprints