1 Gerhard Schneider – Rechenzentrum der Universität Freiburg Aspects of Long Term Preservation of...

24
Gerhard Schneider – Rechenzentrum der Universität Freiburg Aspects of Long Term Preservation of Digital Libraries [email protected] Gerhard Schneider Computing Centre & CS Department University of Freiburg
  • date post

    19-Dec-2015
  • Category

    Documents

  • view

    214
  • download

    0

Transcript of 1 Gerhard Schneider – Rechenzentrum der Universität Freiburg Aspects of Long Term Preservation of...

Page 1: 1 Gerhard Schneider – Rechenzentrum der Universität Freiburg Aspects of Long Term Preservation of Digital Libraries gerhard.schneider@rz.uni-freiburg.de.

1

Ger

har

d S

chn

eid

er –

Rec

hen

zen

tru

m d

er U

niv

ersi

tät

Fre

ibu

rg

Aspects of Long Term Preservation

of Digital Libraries

[email protected]

Gerhard SchneiderComputing Centre & CS Department

University of Freiburg

Page 2: 1 Gerhard Schneider – Rechenzentrum der Universität Freiburg Aspects of Long Term Preservation of Digital Libraries gerhard.schneider@rz.uni-freiburg.de.

2

Ger

har

d S

chn

eid

er –

Rec

hen

zen

tru

m d

er U

niv

ersi

tät

Fre

ibu

rg

Storage on Paper

• Longevity of the media– paper lasts for centuries, no special care required

– except perhaps: acid in paper, water from burst pipes, fire, etc

• Longevity of the description language– except perhaps: old English or the old German alphabet

– abstract terms: decoding is possible, as related information is available

– how about old assyrian writings?

• Loss of information is a well known phenomenon– loss of old information is not so relevant to current society

• 5th book of Aristotle

– loss of new information is more or less impossible through the distribution of knowledge to many places

• thanks to Gutenberg

Page 3: 1 Gerhard Schneider – Rechenzentrum der Universität Freiburg Aspects of Long Term Preservation of Digital Libraries gerhard.schneider@rz.uni-freiburg.de.

3

Ger

har

d S

chn

eid

er –

Rec

hen

zen

tru

m d

er U

niv

ersi

tät

Fre

ibu

rg

Storage on Paper

• Accessibility of printed information– no special device is needed, except perhaps glasses

– no technical knowledge is required: “hands on”

• Outsourcing of the handling of knowledge distribution to publishers– economically very successful - so successful that we can no longer

afford to buy the books we wrote

• long term storage of information has been centralised in libraries– high running costs

• library building, maintenance, staff required to manage books

– cost of storage may by far exceed the cost of acquisition

Page 4: 1 Gerhard Schneider – Rechenzentrum der Universität Freiburg Aspects of Long Term Preservation of Digital Libraries gerhard.schneider@rz.uni-freiburg.de.

4

Ger

har

d S

chn

eid

er –

Rec

hen

zen

tru

m d

er U

niv

ersi

tät

Fre

ibu

rg

Storage on Paper

• If you don‘t live close to the library, accessing information can be very difficult (3rd world countries)

• a rather costly machinery has been set up to ease the problem– long distance inter-library loans

• staff intensive, cost of transportation

• photocopies of articles vs. copyright

• now: scanning articles and delivery via fax (sic!) or email

• the user is charged with a nominal fee – nominal w.r.t. the cost of operation, not w.r.t. the user’s own budget

• Information is produced electronically– most features are lost when the information is brought to paper

• It is only natural that scientists are asking for electronic libraries - given all the benefits

Page 5: 1 Gerhard Schneider – Rechenzentrum der Universität Freiburg Aspects of Long Term Preservation of Digital Libraries gerhard.schneider@rz.uni-freiburg.de.

5

Ger

har

d S

chn

eid

er –

Rec

hen

zen

tru

m d

er U

niv

ersi

tät

Fre

ibu

rg

Electronic storage

• There are a few pitfalls when it comes to digital storage– can you still read your old 5 1/4“ - floppies?

• Do you still have a device to read them?– Well known problem in other areas:

• record players are rare these days.

• And if so is there still anything on them?– Magnets can erase information, and each information bit is a little

magnet interfering with the others

– well known phenomenon also in other areas of magnetic storage• music cassettes, tape recorders, video tapes

• Solution: digitally stored information can be copied to new media without any loss!– The problem old fashioned industry is now facing w.r.t. to CD-writers

Page 6: 1 Gerhard Schneider – Rechenzentrum der Universität Freiburg Aspects of Long Term Preservation of Digital Libraries gerhard.schneider@rz.uni-freiburg.de.

6

Ger

har

d S

chn

eid

er –

Rec

hen

zen

tru

m d

er U

niv

ersi

tät

Fre

ibu

rg

Electronic storage

• Thus in principle we have a solution to the media problem:– keep converting

– conversion can be done in a fully automated way, using robots

– the technology is available in most computer centres and used for automated backup and archive.

– Typical archive software recycles tapes which have been overused and copies the information onto new tapes, ejecting the old tapes.

• Interpretation of the contents– bits carry no real information, interpretation by software is required

before it can be presented to the human eye/ear

• New problem: convert the software that was used to generate the information.– Well known problem: word processors can’t read old files

Page 7: 1 Gerhard Schneider – Rechenzentrum der Universität Freiburg Aspects of Long Term Preservation of Digital Libraries gerhard.schneider@rz.uni-freiburg.de.

7

Ger

har

d S

chn

eid

er –

Rec

hen

zen

tru

m d

er U

niv

ersi

tät

Fre

ibu

rg

Format issues

• What do the bits mean?– Simple, but good example: TeX

• Information and control commands are stored in plain ASCII

• The functionality of the control commands are exactly described in the TeX manual

• So, if you sit on an island with nothing but the bits and the TeX manual, you can find out what the paper is supposed to look like

– Try this with MS-Word – or, even better, MS-Powerpoint

• Putting data into electronic libraries only makes sense if the format is 100% specified– Whether this description has to be in the document or in an

accompanying file is of secondary interest.

– Keep it simple: the original document should be understandable even if additional structure information gets lost (or is difficult to retrieve)

Page 8: 1 Gerhard Schneider – Rechenzentrum der Universität Freiburg Aspects of Long Term Preservation of Digital Libraries gerhard.schneider@rz.uni-freiburg.de.

8

Ger

har

d S

chn

eid

er –

Rec

hen

zen

tru

m d

er U

niv

ersi

tät

Fre

ibu

rg

Format issues

• Example: the Kodak imaging software in MS-Windows allows the annotation of TIFF-files– Can only be read with the Kodak software

– Or the annotation can be added permanently to the document, thus making it visible (and not removable) to any other TIFF reader.

• Text formats are precise – i.e. we know what has been typed

• Image formats are different, as information is lost during the scanning process– By the lens itself

– By the sensor (300 dpi means that only 300 dots of an inch are stored)

– By the storage format (i.e. do we get back what we stored?)• Lossy vs faithful

Page 9: 1 Gerhard Schneider – Rechenzentrum der Universität Freiburg Aspects of Long Term Preservation of Digital Libraries gerhard.schneider@rz.uni-freiburg.de.

9

Ger

har

d S

chn

eid

er –

Rec

hen

zen

tru

m d

er U

niv

ersi

tät

Fre

ibu

rg

Format issues

• What does „lossy“ mean?– We do not get back every information that we stored – sounds scaring

– Did we see it in the first place? Is Fax lossy? (fax = 100 dpi or 200 dpi)

– Analog recording vs. CD vs. MP3• CD is a lossy process, but what is really lost?

• MP3 is good enough, even for the young generation.

– What we lose depends on the algorithm• Doctor‘s scare: „vital Xray data is lost“ – completely wrong

– Why lossy? Keeping the original information needs too much space and does not give any gain in knowledge.

• In addition „writing things down“ is already a lossy process.

• „lossy“ does not imply that we lose more and more information over time

Page 10: 1 Gerhard Schneider – Rechenzentrum der Universität Freiburg Aspects of Long Term Preservation of Digital Libraries gerhard.schneider@rz.uni-freiburg.de.

10

Ger

har

d S

chn

eid

er –

Rec

hen

zen

tru

m d

er U

niv

ersi

tät

Fre

ibu

rg

Complexity of storage

• When it comes to electronic media, we tend to ask for overkill, forgetting that we cannot do anything like that on paper

• When moving to the paperless office (my office does!!)– after having solved the format issue in favour of RTF and TIFF

how do we store the documents?

– We use the filesystem and nothing else as it pretty well reflects the current structure of an office.

– Thus we are independent of the operating system and the management software

• all I need are long filenames and a tree structure, possibly access rights

• Thus we can get quite far before running into another really hard problem

Page 11: 1 Gerhard Schneider – Rechenzentrum der Universität Freiburg Aspects of Long Term Preservation of Digital Libraries gerhard.schneider@rz.uni-freiburg.de.

11

Ger

har

d S

chn

eid

er –

Rec

hen

zen

tru

m d

er U

niv

ersi

tät

Fre

ibu

rg

Software issues

• In a multimedia environment, it may not be enough to convert the media, the software has to be recompiled– a standard job in science, whenever a new computer architecture

appears, just recompile and run.

– Most scientific software has little sophisticated I/O

– what happens if the software is intimately married to the underlying operation system

• like Word to Windows???

– Can we really afford to store our information in proprietary systems?• i.e. systems which we cannot look into?

• Use system-independent data storage– even if a loss of information occurs

• don‘t put this information in in the first place....

Page 12: 1 Gerhard Schneider – Rechenzentrum der Universität Freiburg Aspects of Long Term Preservation of Digital Libraries gerhard.schneider@rz.uni-freiburg.de.

12

Ger

har

d S

chn

eid

er –

Rec

hen

zen

tru

m d

er U

niv

ersi

tät

Fre

ibu

rg

Live documents

• After all we want live documents– query and retrieval

• how many libraries are locked into old fashioned systems because their data cannot be converted?

– hyperlinks– computer games – simulation

• upgrades to new versions on new operating systems are upward compatible - hopefully

• a manufacturer may decide NOT to move to a new platform– make as much money as possible and vanish

• reimplemenation may not be a solution– incompatibilities, copyright issues, errors become historic features

Page 13: 1 Gerhard Schneider – Rechenzentrum der Universität Freiburg Aspects of Long Term Preservation of Digital Libraries gerhard.schneider@rz.uni-freiburg.de.

13

Ger

har

d S

chn

eid

er –

Rec

hen

zen

tru

m d

er U

niv

ersi

tät

Fre

ibu

rg

Solution

• Why not specify the programming environment along the lines of the file format discussion?

• Use JAVA !– Port the java engine to a new environment and you are „done“

• Unfortunately:– Users like their own programming environment

– Environments are made for performance (data bases)

– And not for long term storage

• So we have to face the real world

Page 14: 1 Gerhard Schneider – Rechenzentrum der Universität Freiburg Aspects of Long Term Preservation of Digital Libraries gerhard.schneider@rz.uni-freiburg.de.

14

Ger

har

d S

chn

eid

er –

Rec

hen

zen

tru

m d

er U

niv

ersi

tät

Fre

ibu

rg

Solutions??

• Keep a museum of running machinery?

• Emulation?? (Idea of Rothenberger, Rand Corp)

• during a phase of transition emulators are typically available

• Example: Lots of games were available for the C64 and are still kept (collected) in libraries, without a working environment

• emulators are available:CCS64 v 1.09 runs under Windows

Page 15: 1 Gerhard Schneider – Rechenzentrum der Universität Freiburg Aspects of Long Term Preservation of Digital Libraries gerhard.schneider@rz.uni-freiburg.de.

15

Ger

har

d S

chn

eid

er –

Rec

hen

zen

tru

m d

er U

niv

ersi

tät

Fre

ibu

rg

Emulators under Windows

Sinclair ZXSpectrum

Atari emulator

Even emulators for modern PalmPilots

Page 16: 1 Gerhard Schneider – Rechenzentrum der Universität Freiburg Aspects of Long Term Preservation of Digital Libraries gerhard.schneider@rz.uni-freiburg.de.

16

Ger

har

d S

chn

eid

er –

Rec

hen

zen

tru

m d

er U

niv

ersi

tät

Fre

ibu

rg

Even more emulators

• Even for Sony‘s Playstation, there is an emulator under Win98

• There is a Palm emulator for the Gameboy, running in the Windows emulator of the Palm, which runs…

Page 17: 1 Gerhard Schneider – Rechenzentrum der Universität Freiburg Aspects of Long Term Preservation of Digital Libraries gerhard.schneider@rz.uni-freiburg.de.

17

Ger

har

d S

chn

eid

er –

Rec

hen

zen

tru

m d

er U

niv

ersi

tät

Fre

ibu

rg

What about Windows?

• Running NT under Linux on an Intel machine..

• Or:• Running

Linux under NT on an Intel machine

• Or:• Running NT

under Windows XP

Page 18: 1 Gerhard Schneider – Rechenzentrum der Universität Freiburg Aspects of Long Term Preservation of Digital Libraries gerhard.schneider@rz.uni-freiburg.de.

18

Ger

har

d S

chn

eid

er –

Rec

hen

zen

tru

m d

er U

niv

ersi

tät

Fre

ibu

rg

What about other hardware?

• Emulate Windows on other hardware (Macintosh):

Page 19: 1 Gerhard Schneider – Rechenzentrum der Universität Freiburg Aspects of Long Term Preservation of Digital Libraries gerhard.schneider@rz.uni-freiburg.de.

19

Ger

har

d S

chn

eid

er –

Rec

hen

zen

tru

m d

er U

niv

ersi

tät

Fre

ibu

rg

Observation

• Many software developers use emulators to cross-compile applications for new environments

• Thus emulators do exist in most environments

• Can we obtain them from the manufacturers?– Copyright issues

– company secrets

– maybe enforce a deposit of software emulators in a safe??• For later use?

Page 20: 1 Gerhard Schneider – Rechenzentrum der Universität Freiburg Aspects of Long Term Preservation of Digital Libraries gerhard.schneider@rz.uni-freiburg.de.

20

Ger

har

d S

chn

eid

er –

Rec

hen

zen

tru

m d

er U

niv

ersi

tät

Fre

ibu

rg

Using emulators

• Emulators typically store an environment in one special file

• Application example (tested) for VMWARE– install Windows 98 in a VMWARE box

• keep the resulting file as a reference installation

– install one computer game (or one programme setup) under a copy of the reference installation

– store the resulting file in a digital library with the name of the game as metadata

– to play the game, start your computer (either NT, Win2k or Linux), start VMWARE with that specific file and .... Play!

• the file can be exchanged between operating systems

– to convert the file from one storage medium to another, use the standard process

Page 21: 1 Gerhard Schneider – Rechenzentrum der Universität Freiburg Aspects of Long Term Preservation of Digital Libraries gerhard.schneider@rz.uni-freiburg.de.

21

Ger

har

d S

chn

eid

er –

Rec

hen

zen

tru

m d

er U

niv

ersi

tät

Fre

ibu

rg

Using emulators

• At some stage, the PC technology will die. Very likely there will be an emulator for the old fashioned PC on the new hardware, at least for a limited time.

• During this time, set up a scheme to use that emulator to run your favourite operating system and install your favourite emulator under the emulated environment.

• If this works, continue to use all the old files

• If it fails, some development has to be carried out– money on such projects is wisely spent: one local solution is a solution

for the whole world

• Performance is not an issue!

Page 22: 1 Gerhard Schneider – Rechenzentrum der Universität Freiburg Aspects of Long Term Preservation of Digital Libraries gerhard.schneider@rz.uni-freiburg.de.

22

Ger

har

d S

chn

eid

er –

Rec

hen

zen

tru

m d

er U

niv

ersi

tät

Fre

ibu

rg

Performance of emulators

• Machines get faster– VMWARE loses a factor of 2, so on an 800 MHz machine it appears as

if the original code were running on a 350 MHz machine

• we will thus keep even the original „feeling“ of the software• For some time, before machines get faster

• Experience:a whole server setup can be run under emulators– VMWARE even has network and USB connection

– a complete digital library system, when installed under VMWARE can be kept in one (huge) file and preserved for the future, at least for a limited time

• which is better than losing it right away

– The hardest part is to convince a sysadmin not to use the real machine

Page 23: 1 Gerhard Schneider – Rechenzentrum der Universität Freiburg Aspects of Long Term Preservation of Digital Libraries gerhard.schneider@rz.uni-freiburg.de.

23

Ger

har

d S

chn

eid

er –

Rec

hen

zen

tru

m d

er U

niv

ersi

tät

Fre

ibu

rg

Using emulated environments even today

• A typical „library loan“ requires the retrieval of the software and the handing over to the customer– customer may lose parts of the software (diskette, documentation)

– customer may have problems with the installation and the librarian cannot help, since a computer expert is required

• using the emulated version means the retrieval of a file from a digital library (electronic storage) and its installation (i.e. a copy process) on the library computer (which has an emulator installed)– no manpower involved, instant service to the customer

• it suffices to have one reference installation in the world– libraries could trade the files, provided they own the copyright of the

“computer game”

Page 24: 1 Gerhard Schneider – Rechenzentrum der Universität Freiburg Aspects of Long Term Preservation of Digital Libraries gerhard.schneider@rz.uni-freiburg.de.

24

Ger

har

d S

chn

eid

er –

Rec

hen

zen

tru

m d

er U

niv

ersi

tät

Fre

ibu

rg

Summary

• Emulators may be the only way to preserve a complex software environment– a „living“ environment in contrast to a „dead“ environment like a book

(text or image)

• Digital libraries themselves are complex software environments, which depend on hardware and operating systems

• This is a current Ph.D.-project at the University of Freiburg.– How far can we go?

– Apparently very far…..