“Think. Learn. Succeed.”Ver 1.2
Methods for Knowledge Management & Digital Preservation
The Theory and Practice of Digital History
Carl A. Young, M.A. in waiting1 December 2009
“Think. Learn. Succeed.”Ver 1.2
Project Overview
Resource and skill-constrained historians and archivists require efficient methods for capturing, analyzing, and sharing original
artifacts.
• Multi-phase project • Develop a low-cost process for
digitally archiving documents• Store them in a standards-based
data storage platform• Set the conditions to scale with future
phases • Creating a collaborative, accessible,
online digital repository fully leveraging the optionality of the digital domain.
Phase I – PrototypingPhase II- Capture
Phase III- Web AccessPhase IV- Initial ExpansionPhase V- Infinite Expansion
Major PhasesMethodology
Challenge
“Think. Learn. Succeed.”Ver 1.2
Completed in November 2009, this phase established a usable, affordable methodology
for project development by prototyping the capture and conversion of an original artifact
for testing and exploration purposes.
3
Phase I: Prototype
“Think. Learn. Succeed.”Ver 1.2 4
Demonstration
Phase I: Prototype (cont.)
Original Digital Camera .JPG file format2 MB
Treatment w/Photoshop.TIFF29 MB
Adobe Conversion.pdf278 KB
Time elapsed:Photo: <1 minTreatment: ~3 minConversion: <1min
“Think. Learn. Succeed.”Ver 1.2 5
“Think. Learn. Succeed.”Ver 1.2 6
Phase I: Prototype (cont.)
Process Flowchart
Legend
“Think. Learn. Succeed.”Ver 1.2
Completed in November 2009, this phase performed and documented a low-budget
document capture, artifact preservation, and conversion to a distributable format where a
historic text is extracted from the original document, archived, and presented to the user
in both the original capture (.jpg or .tiff) and distributable (.pdf and .xml) format with an
evaluation of optical character recognition (OCR) and transcription requirements.
7
Phase II: Capture
“Think. Learn. Succeed.”Ver 1.2
Select Area• Image
– Adjustments– Curves “Digitization”
• Channel - RGB• Output-203• Input-160
8
Phase II: Capture (cont.)
Image Treatment
FilterBlur
Smart BlurRadius-100Threshold-100Quality- HighMode- Normal
Surface BlurRadius-100Threshold-25
Surface Blur (if needed)Radius-100Threshold-25
Lens BlurShape - OctagonRadius - 5Blade Curve - 50Rotation - 300Brightness -10Threshold - 75Noise- 3Distro –Uniform Select
SelectColor Range
Modify ShadowsNo Invert
ModifyExpand 2
CutFile
New *Width-1600Height - 2500Resolution- 300CM - RGB 16bit* Recommend saving as a preset.
PasteFlattenClean up as neededSave As .TIFF
“Think. Learn. Succeed.”Ver 1.2 10
OCR and Transcription Demo
Phase II: Capture (cont.)
OCR TranscriptionTime elapsed:OCR: <1 minTranscription: ~5min
“Think. Learn. Succeed.”Ver 1.2 11
OCRTranscription
“Think. Learn. Succeed.”Ver 1.2 12
TEI Demo
Phase II: Capture (cont.)
Time elapsed:Preliminary Data: ~45 minPage: ~5 minLook at UVA’s TEI How To
“Think. Learn. Succeed.”Ver 1.2 13
Phase II: Capture (cont.)
Methodology Flow Chart
Legend
“Think. Learn. Succeed.”Ver 1.2
Phase II: Capture (cont.)
Militiaman’s Guide155 pages total, type text, fair condition
40 hours (optimal) / 5 GbsPer Page Estimates
• Photography: – ~30 sec– 2.5 Mbs @ 5Mpxl
• .tiff Conversion– ~3 min– 23 Mbs
• .pdf Conversion– ~1 min– 300 Kbs
• OCR - ~45 sec• Error Correction/Transcription: ~5 min• TEI - ~5 min (~45 min overhead)
14
Labor Estimates
Case Estimates• Photography:
– ~1:15– ~ 400 Mbs
• .tiff Conversion– ~7:45– 3.5 Gbs
• .pdf Conversion– ~2:30– 50 Mbs
• OCR - ~2 hours• Error Correction/Transcription: ~13 hrs• TEI - ~14 hrs
“Think. Learn. Succeed.”Ver 1.2
• Consumer-grade HP 5Mpxl digital camera ($125)• Slightly above consumer-grade PC ($1100)
– 4 GB RAM– 1 GB VRAM– 500 GB, SATA HD– Dual Screens
• Consumer Software ($600)– Adobe Creative Suite 3
15
Equipment Baseline
“Think. Learn. Succeed.”Ver 1.2
• Use a Tripod/Mount• Use consistent lighting• Safely flatten pages as much as possible• Use a mounting frame• Highest Resolution available• OCR is NOT reliable• Need an efficient method for TEI
16
Lessons Learned
“Think. Learn. Succeed.”Ver 1.2
This phase is the subject of this grant funding request. A team of professional developers will construct a
suitable multi-media database for storage and access of original artifact captures, distributable .pdf versions, and XML-based data and metadata derived from the
original. The team will also develop a working prototype web site
to access the data. Fundamental to this phase will be data archiving and disaster recovery for the data.
Successful conclusion of this phase will yield a working version 1.0 available for release and continued
development.
17
Phase III: Web-Access
“Think. Learn. Succeed.”Ver 1.2 18
Phase III: Web-Access (cont.)
Flow Chart
“Think. Learn. Succeed.”Ver 1.2 19
Work Breakdown Structure
Phase III: Web-Access (cont.)
Database Development
Prototype Evaluation
Prototype Web Development
AlphaTest & Mod
Beta
Test & Mod
RC1Test & Mod
v1.0
DocumentationDisaster Recovery
TestingEstimated Cost:
$52,000
“Think. Learn. Succeed.”Ver 1.2 20
Project Gantt Chart
Phase III: Web-Access (cont.)
“Think. Learn. Succeed.”Ver 1.2
Beyond the scope of this grant request, this phase seeks to develop partnerships and data shares across multiple institutions with similar projects
in development or production. The level of participation directly influences the
scale of this phase. It is anticipated that the minimal costs will be shared across participating
institutions.
21
Phase IV: Initial Expansion
“Think. Learn. Succeed.”Ver 1.2
Conduct Lifecycle Management Review
DocumentationDisaster Recover
Testing
Publish Methodology
Find Partners
Large Scale Capture
Leverage v1.0
Update Code and Processes
22
Work Breakdown Structure
Phase IV: Initial Expansion (cont.)
Estimated Cost: $8,000
“Think. Learn. Succeed.”Ver 1.2
Optionally, and depending on the success of the earlier phases, this phase will greatly expand collaborative efforts by potentially make this capability available to amateur and resource-
constrained archivists and historians by providing a standards-based methodology and
data capture technique and a collaborative platform to share the data once stored.
This aspect of the final phase will be limited only by technology maintenance and scalability
costs.
23
Phase V: Infinite Expansion
“Think. Learn. Succeed.”Ver 1.2 24
Work Breakdown Structure
Phase V: Infinite Expansion (cont.)
Publish Updated Methodology
Publish Membership Schema
Open Data Models
Leverage Current Version
Conduct Lifecycle Management Review
DocumentationDisaster Recover
TestingEstimated
Cost: $82,000
Release New Version(s)
“Think. Learn. Succeed.”Ver 1.2
Summary
• 5-Phase Approach• “How-To”
– Digitization– TEI– Manage the project
• Sets the stage– Broad/ambitious goals and
plan– Manageable pieces– Flexible optionality
• Phase III support:– $51,733.33– Prototype Validation– Database Development– Web Development– Hosting– Disaster Recovery
• Phase IV and V templates– Future expansion as desired– Flexible Planning
25
Project Summary Grant Request / Funding Summary
“Think. Learn. Succeed.”Ver 1.2
QUESTIONS
26
“Think. Learn. Succeed.”Ver 1.2
CONCLUSION
27
“Think. Learn. Succeed.”Ver 1.2
Man had always assumed that he was more intelligent than dolphins because he had achieved so much... the wheel,
New York, wars, and so on, whilst all the dolphins had ever done was muck about in the water having a good
time. But conversely the dolphins believed themselves to be
more intelligent than man for precisely the same reasons.
- Douglas Adams
28
Dead Guy Quote
“Think. Learn. Succeed.”Ver 1.2
BACKUP
29
“Think. Learn. Succeed.”Ver 1.2 30
Phase I: Prototype (cont.)
Work Breakdown Structure
Image Capture
Image Preservation
Image Manipulation
Database Development
TEI Process Development
Data Development
Static Web-Page
Prototyping
DocumentationDisaster
Recovery TestingEstimated Cost:
$5,000
“Think. Learn. Succeed.”Ver 1.2 31
Gantt Chart
Phase I: Prototype (cont.)
“Think. Learn. Succeed.”Ver 1.2 32
Phase II: Capture (cont.)
Work Breakdown Structure
Image Capture
TEI
Prototype Database Input
DocumentationDisaster
Recovery TestingEstimated
Cost: $2,000
“Think. Learn. Succeed.”Ver 1.2 33
Phase II: Capture (cont.)
Gantt Chart
Top Related