OAIS Rathachai Chawuthai [email protected] Information Management CSIM / AIT Issued document 1.0.
Semantic Digital Preservation Rathachai Chawuthai [email protected] Information...
-
Upload
carmel-mills -
Category
Documents
-
view
215 -
download
0
Transcript of Semantic Digital Preservation Rathachai Chawuthai [email protected] Information...
Semantic Digital Preservation
Rathachai [email protected]
Information ManagementCSIM / AIT
Introduction
Issued document 1.0
2
Agenda
• 22nd Century• Digital Preservation• Needs of Archive in IR• Knowledge Preservation• Technology Review
3
22nd Century
4
Scenario
Assume that incoming scenario is happening in
22nd century
5
Present
Imagine that how a man in the future is able to read
your today digital document.
Alice BobReader Archivist
6
22nd Century
Hi Bob, do you have information about
USA president “Barack Obama”
Oh! It is hard to find out.Because the information is older than 100 years.
7
22nd Century
What is a DVD?
Hi Alice. Luckily, I found a DVD containing
his information
?
8
Present
Do you believe that you current media will
be useful in the future?
9
22nd Century
No !!! That thing is unreadable
!Error: DVDunreadable
Don’t be silly, Alice. It was popular in 100 years ago.
It can be read by DVD reader.See it !!
10
Present
An age of digital media is quite short. Do you have
plant to move your data to a freshly new media?
11
22nd Century
Hey, … How to open PDF file?
!
Fortunately, I can get that file.Can you open “obama2009.pdf”
Error: No program can open file format PDF
12
Present
Do you inform them about software, hardware, and
version to open your file?
13
22nd Century
How I know the password?
As I see, it need Adobe Reader 9.0 to open it.
File is read protectedPlease key password
14
Present
Your file might be secured.Do you inform them how
to access your file?
15
22nd Century
!7rò??àÕ??ߟ²ÂÚ
Õ??ߟ²ÂÚðŽɳ
!Z?g! Õr/ÕŸ/?rò?
Why the author documented in alien language?
? !
16
Present
It still has issues about encoding; such as, ASCII, ANSI, ISO-8859, UTF7, big-endian,little-endian,
and font; such as, Tahoma, Venada.
How do you tell them what it require to render?
17
22nd Century
BarackObama
44th presidentof USA
Born 08/04 /1961
Confuse!!! When he was born?
4th August or 8th April ?
No idea !!!!You need to ask the
author living 100 years ago.
18
Present
Knowledge of today creator and future reader might be
different.How to ensure that reader
understand it correctly?
19
22nd Century
What should I do if I need to find more information
relevance to Barack Obama’s family?
You may have to browse every file
from here.Good luck …
20
Present
Many of files have relationship to other files.How to let them know?
21
22nd Century
It would be good if an older generation has a good plan for
digital preservation
22
Digital Preservation
Age of Information
• Printed Age– Paper is durable format– Store under proper condition
• Digital Age– Information is fragile• Technological obsolescence• Deterioration of media
24
Preservation Object
• A digital object that copy from a printed document.
• Store in common format format such as TIFF
Digitized Object
25
Preservation Object
Born-Digital Object
• A digital object that create from software
• It needs to keep versioning rather than finalized document
26
To be digital
Capacities v.s. Age
1000 Years
15 Years
A digital media can contain much more information than printed paper at the same volume. But the digital media’s life is shorter than printed paper.Fortunately, content in digital media is duplicated to another one easily.
27
• An active management of digital information to ensure its – Maintainability
Bitstream is still be existing originally
– AccessibilityBitstream forming a file is able to be opened
– RenderabilityAn opened file presents a digital object originally
– UnderstandabilityA reader understand a digital object originally
over the time
Objective
Digital Preservation
wikipedia.org
28
Maintainability
Do you have these?
How to preserve bitstream whether life of digital media is short and itself becomes old fashion?
Issue
29
Maintainability
Current solution is migration.To migrate bitstream by duplicating itself from one media to anther media every interval time.
Propose Solution
Challenge• How to notify that it is time to migrate?• Do anyone have Right that intellectual property owner
allow to copy the work?• How to guarantee that nothing is lost during the
migration process?• How to keep change of the migration process?
30
Accessibility
A bitstream need to be represent as a file in order to be opened by software.
Issue• In order to form an accessible file, it need to construct bitsream
to be object structure that make software understand.- Datatype: number, string, array, ….- Format: text, image, video, audio, …
• To open file, it requires environment including hardware, software, and version.
• Furthermore, some of files cannot be accessible because issue about protection from security concern
31
Accessibility
Propose Solution
• Use metadata to record information that anyone need to know in order to access the file, such as– Byte encoding– File format– Hardware & Software, and their version– Password to open file
• Provide the way to access file– Use virtual environment to access file– Migrate file according to newer software
32
Accessibility
Challenge
• How to make a common metadata structure?– Which information that every organization agree to include.
• How to notify that it is time to migrate to a new software?
• Do anyone have Right that intellectual property owner allow to copy and modify the work in order to support a newer software?
• How to guarantee that nothing is lost during the migration process?
• How to keep change of the migration process?
33
Renderability
Although digital object is able to opened, how to guarantee that it is rendered originally or not?
Issue
34
Renderability
Purpose Solution• Use metadata to record information about
look and feel of digital object, such as, – Character Code– Font– Color template
Challenge
• Which information is necessary to include in metadata?• Does it has process to verify the correctness of rendered
object?
35
Understandability
Issue• How to ensure that our today digital
object characteristics including:– Documentation style
• Date format• Number format• Grammar, Sentence, Phrase, Vocabulary, Symbol
– Contemporary knowledge• Commonsense• Contextual knowledge• Knowledge automatically understanding in
community
are understanding by future readers who have difference knowledge?
36
Understandability
Purpose Solution• Preserve underlying community knowledge as
well as digital preservation• Link relevance digital objects and its contents
to explore original knowledge and new knowledge– Using semantic technology
37
Understandability
Challenge• How to model and implement theory of
underlying community knowledge?• How to collect context knowledge for each
period?• How to claim correctness of knowledge?
38
Archive Information System
To accomplish the preservation requirements, an archive information system seems answer the solution. Thus, a good system should supports:
– Flexible information model– Long-term storage– Well-formed metadata– Preservation activities– Browsing and searching– Knowledge exploration– Preservation policy– Access policy– Right and agreement policy
39
Stakeholder
To complete full features of system, it needs to support following roles:
• Provider– One who ingest digital objects to archive
• Consumer– One who retrieve preservation information.
• Management– One who provide preservation strategies and do
preservation activities such as migration
A good system should support each of uses of these roles as well
40
Summary
• The goal of preservation is to maintain knowledge over the time.
• To do preservation, it needs well established metadata and system.
• A preservation system should serve functionalities to provider, consumer, and management
Institutional Repositories and Digital
Preservation: Assessing Current Practices at Research Libraries
Yuan LiSyracuse University
Meghan BanachUniversity of Massachusetts Amherst
Need of Archive in IR
• Archive– Is a collection of historical records, or the physical place they
are located. – contain primary source documents that have accumulated
over the course of an individual or organization's lifetime, and are kept to show the function of an organization.
• Digital Archive– Is a digital format of archive that need to do digital
preservation• Digital Media• Environment to render
Digital Archive
wikipedia.org
Institutional Repository
• An Institutional Repository is an online locus for collecting, preserving, and disseminating - in digital form - the intellectual output of an institution, particularly a research institution.
• For a university, this would include materials such as research journal articles, before (preprints) and after (postprints) undergoing peer review, and digital versions of theses and dissertations, but it might also include other digital assets generated by normal academic life, such as administrative documents, course notes, or learning objects.
• The four main objectives for having an institutional repository are:– to provide open access to institutional research output by self-archiving it;– to create global visibility for an institution's scholarly research;– to collect content in a single location;– to store and preserve other institutional digital assets, including unpublished or
otherwise easily lost ("grey") literature (e.g., theses or technical reports).wikipedia.org
44
Introduction
• Review– Be archive with in IRs– Manage digital content– Produce copies being digital
45
Introduction
• Preservation system requires– Natural and juridical people– Institutions– Applications– Infrastructure– Procedure
46
Introduction
• Issues of Preservation– Little control over ingestion process– Less-optimal formats– Poor metadata– Insufficient intellectual property rights clearance– Difficult or costly to preserve
47
Objective
• Analyze needs of digital preservation (digital archive) in domain of intuitional repository
48
• Is preservation part of the mission and goal of IRs?• What preservation policies exists for IRs?• What preservation strategies are IRs currently
implementing?• Are the necessary rights and agreements in place
to preserve the content of IRs?• Are all of the materials in IRs of sufficient quality
and importance to warrant long-term preservation (Content policies)?
• Do IRs currently have the necessary sustainability in terms of funding and staffing to carry out long-term preservation of their contents?
Question?
49
Is preservation part of the mission and goal of IRs?
Question?
50
97.4%
2.6%
Answers
NO
YES
Is preservation part of IRs?
51
What preservation policies exists for IRs?
Question?
52
• Duration– Short | Medium | Long
• Recommended file formats– Text formats : pdf, txt, rtf, xml, odb, ods,
odp– Image file formats : tiff, jp2, jpg– Audio formats : aif, aiff, wav– Video formats: avi, mj2, mjp2
Answers
Preservation Policies
53
What preservation strategies are IRs currently implementing?
Question?
54
Answers
Preservation StrategiesBackup System
Security Storage System
Checksum
55
Answers
Preservation Strategies
By IR system
By external system
Preservation metadata
56
• Metadata varies based on the sophistication of the collection
• Working on standard and best practices address all type of metadata
Answers
Preservation Strategies
57
Are the necessary rights and agreements in place to preserve the content of IRs?
Question?
58
Answers
Rights and Agreements
• Digital content may be changed if technology change
• Does this impact copyright?• Players– Content contributor– Copyright holder
59
Answers
Rights and Agreements
• What is Agreement?– Click through– Written– Policies– MOUs– Verbal
Most AgreementContributor needs permission to submit work that is own by
another party
60
Are all of the materials in IRs of sufficient quality and importance to warrant long-term preservation
(Content policies)?
Question?
61
Answers
Content Policies
Collect
Manage
Disseminate
62
• Problem– Format obsolescence– Poor quality– Unreadable– Insufficient metadata• To manage• To preserve
Answers
Content Policies
63
• It should– Track user activities e.g. submit work– Peer review before deposit in IRs
(To ensure quality)• Journal article• Conference proceeding
Answers
Content Policies
64
Do IRs currently have the necessary sustainability in terms of funding and staffing to carry out long-term preservation of
their contents?
Question?
65
Answers
Sustainability
PeriodTime
TechnologyChange
Infinity
Short-term
Medium-term
Long-term
66
• To realize to implement Digital Archive in Institutional Repository
• To Make Agreements and secure permissions for preserving IR contents
• To have guidance of digital format preservation to content contributors
• To plan for Long-term digital preservation• To solve issue of lack of preservation funding
Summary
Terminology and Wish List
for a Formal Theory of PreservationGiorgos Flouris
FORTH or ICSCNR of ISTI
[email protected]@isti.cnr.it
Meghan BanachCNR of ISTI
Knowledge Preservation
68
Introduction
BarackObama
44th presidentof USA
Born 04/08 /1961
Bit Preservation
Currently, the system can do
Object Preservation
Bit stream is preserved for long-term by modern media
Bit stream are able to be rendered and display to user originally.
69
Introduction
BarackObama
44th presidentof USA
Born 08/04 /1961
Information Preservation
Currently, the system may not focus
It becomes a new challenge that the system can preserve ability of understanding the rendered object over the time.
To achieve this challenge, the reader is able to understand rendered object’s content by understanding the terms, concepts, or other information that appears in it, by placing it in its correct context.Currently, this feature is not exist in existing preservation approaches.
70
Objective
BarackObama
44th presidentof USA
Born 04-Aug-1961
Producer
Consumer
Archive SystemIngest
Render
The objective is that a reader (consumer) is able to perceive information context following his/her background knowledge and understand it originally.
71
Discussion
Terms
Producer
The creator of the digital object
P D Digital Object
An object that present knowledge in understood-language
C DCConsumer Designated Community
A reader who read digital object
A group of readers who have shared common characteristics and knowledge
72
1. Producer produced Digital Object and stored in storage media.2. Consumer opens Digital Object from storage media by rendering
sequence of bit values represent the document.3. Consumer obtains Digital Object by light from output device taking
to his eyes.4. Consumer understands meaning of Digital Object by D itself and
contextual knowledge from his/her Designated Community
Discussion
Understanding Process
Goal
Consumer is able to understand Digital Object originally over the time
73
• The key is “meaning” of digital knowledge.– The meaning of a digital object can be
viewed as a special kind of mapping that associates a symbol with a particular real-world concept.
– This association is not always clear by looking at the digital object alone.
• A date format is a good example that make people confuse.– If European notation, he was born on 8th
of April.– If American notation, he was born on 4th
of August.
Underlying Community Knowledge
BarackObama
44th presidentof USA
Born 08/04 /1961
Flouris & Meghan
74
• In order to capture the “meaning” of a Digital Object, the Digital Object needs to be described in Language .
Underlying Community Knowledge
L Language
An arrangement symbols that associate with real-world concept
• Language should be a formal language that can be interpreted by both Producer and Consumer.
• Purposes of Language are– Providing formulation rules that
encode real-world concept to be symbols.
– Providing logic’s semantic that use contextual, background, or commonsense information in order to decode symbol to be real-world concept
75
Underlying Community Knowledge
08/04August
4P DL
• The producer need to represent “4th of August” in a common language. Thus, she need to use contextual, knowledge, or commonsense information that she agree with her community in order to write a symbol representing “4th of August”.
• She decides to use “08/04” because everyone in the same community understand this and can interpret to “4th of August”.
• It means that she, and readers in the same community at that period understand the same meaning.
76
From simple Math function f(x) = y
Underlying Community Knowledge
Every people use Interpret function to understand meaning of language
producer.interpret( “08/04” ) = “4th of August”
reader01.interpret( “08/04” ) = “4th of August”reader02.interpret( “08/04” ) = “4th of August”
In this case, everyone interprets language “08/04” to be “4th of August” because inside the interpret process has formula.
Formula comes from knowledge. If knowledge is agreed in community, formula is produced from
community knowledge.It means that Producer and all reader have the same formula, so they understand the same thing together.
77
Underlying Community Knowledge
Underlying Community Knowledge
Knowledge from designated community (DC) that help members to similarly understand association between language and real-world concept.Therefore, key feature of UCK is to produce formulas that are able to - Encode real-world concept to be language- Decode language to be real-world concept
UCK
78
Evaluation of DC
08/04April
C DL
8producer.interpret( “08/04” ) = “4th of August”
consumer.interpret( “08/04” ) = “8th of April”
Why consumer understand incorrectly?
79
• When the time change, designated community may be changed, and knowledge may be changed.
• Thus, “understanding” may be changed, too.• The critical cause is a change of UCK.– Because difference UCK makes difference formula
that makes difference understanding. • Next challenge is “How to capture change of UCK”
Evaluation of DC
80
Evaluation of DC
UCK Evolution Structure
A structure that represent difference (delta) of UCKs. UCKES captures change of UCK’s language from change of UCK’s theory such as ontology evolution.
UCKES
UCKES represent a gap of each UCK
CP
81
Evaluation of DC
CP
UCK Mapping Structure
A complex mechanism that use UCKES to produce relationship between Consumer’s formula and Producer’s formula. The main function is to change language in order to make the same understanding of real-world concept
UCKMS
82
Is it possible?
Evaluation of DC
producer.interpret( “08/04” ) = “4th of August”
consumer.interpret( “04/08” ) = “4th of Auguse”
83
Evaluation of DC
ConsumerProducer
Right now, Consumer get incorrect understanding from language that Producer need to present.
UCKFormula
Formula
UCK
08/04Read Read
Digital Object D
84
Evaluation of DC
ConsumerProducer
08/04
The system should understand knowledge from Consumer’s side and generate mapping between Producer’s formula and Consumer’s formula using UCKES and UCKMS mechanism
UCKFormula
Formula
UCKUCKES
UCKMS
Digital Object D
85
Evaluation of DC
ConsumerProducer
08/04
Then, the system transform the digital object D to be D’. D’ contains language that make Consumer understand same thing as Producer
UCKFormula
Formula
UCKUCKES
UCKMS
04/08
Digital Object D Digital Object D’
Read Read
86
Summary
BarackObama
44th presidentof USA
Born 08/04 /1961
BarackObama
44th presidentof USA
Born 04/08 /1961
Consumer understand D’ as same thing as Producer understand D.
It means that D’ has preservability relation with D.
D D’
D’ D
87
Summary
Next step
How to preserve underlying community knowledge as well as digital object?
• It needs to think of “Reader” when do preservation by providing information to ensure that the reader can understand digital object originally from their knowledge.
88
Technology Review
89
• The PREMIS Data Dictionary defines preservation metadata as "the information a repository uses to support the digital preservation process”
• The metadata including– Intellectual information
• Intellectual unit such as book, map, movie, song, …
– Digital object information• A digital object that actualize from intellectual information. • E.g. pdf, image, video, audio, …
– Agent information• Person or system involving with digital object
– Event information• Record of activities of an digital object
– Right information• Agreement of the digital object
PREMIS
wikipedia.org, LOC.gov
90
• An Open Archival Information System (or OAIS) is a reference model of archive, consisting of an organization of people and systems, that has accepted the responsibility to preserve information and make it available for a Designated Community.
• Features– Ingest, Archive, Preservation Plan, Administration,
Dissemination, and Access• End users– Provider, Consumer, and Management
OAIS
wikipedia.org, OLCL.org
91
?
92
References
• http://www.dlib.org/dlib/may11/yuanli/05yuanli.html• http
://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.93.9681&rep=rep1&type=pdf
• http://www.loc.gov/standards/premis/• http://en.wikipedia.org/wiki/Preservation_Metadata:_Implementation_Strategies
_(PREMIS)• http://www.oclc.org• http://public.ccsds.org/publications/archive/650x0b1.pdf• http://en.wikipedia.org/wiki/Open_Archival_Information_System
1