Million Book Project @ Bibliotheca Alexandrina Noha Adly 20 November 2006.

Post on 30-Dec-2015

226 views 4 download

Tags:

Transcript of Million Book Project @ Bibliotheca Alexandrina Noha Adly 20 November 2006.

Million Book Project @ Bibliotheca Alexandrina

Noha Adly20 November 2006

Bibliotheca Alexandrina 2

Bibliotheca Alexandrina 3

Bibliotheca Alexandrina 4

BA Digitization Workflow

Bibliotheca Alexandrina 5

Statistics - November 2006

  Arabic Latin Total

Scanned

Books 22,023 4,646 26,669

Pages 7,003,185 1,350,688 8,353,873

ProcessedBooks 21,947 4,642 26,589

Pages 6,987,392 1,348,900 8,336,292

OCRedBooks 16,652 4,600 21,252

Pages 5,248,337 1,327,385 6,575,722

Total Archived Data 1,500 GB

Bibliotheca Alexandrina 6

Statistics (Contd) Daily Rates

– Scan: ≈ 1800 pages/person

– Process: ≈ 1800 pages/person

– Latin OCR: ≈ 4000 pages/person

– Arabic OCR: ≈ 1500 pages/person

Five Minolta scanners 2 shifts – 7 days a week

OCR

Image to Text

Bibliotheca Alexandrina 8

OCR - Arabic Poses unique challenges

– Written cursively, with blocks of connected characters

– a ‘block of characters’ can have more than one base line.

– Uses external objects such as dots, 'Hamza' and 'Madda'.

– Diacritization

– Characters can have more than one shape according to their position

– Overlapping makes it difficult to determine the spacing

Sakhr Automatic reader is used Tricky with old books Requires learning

Bibliotheca Alexandrina 9

Arabic Script Is Cursive

Bibliotheca Alexandrina 10

Old, Smudgy, and Sticked Together

Bibliotheca Alexandrina 11

Use of Diacritics

Bibliotheca Alexandrina 12

Font Low Bound High Point % BooksAR-H1 97.70% 99.50% 0.24%AR-H2 97.60% 99.50% 2.66%AR-H3 97.04% 99.10% 8.01%AR-H4AR-L4 92.70% 96.70% 6.62%DT-M1DT-L2 88.40% 96.80% 7.24%TA-H1 97.30% 99.10% 1.26%TA-H2 97.60% 99.20% 11.89%TA-H3TA-H4 96.50% 97.74% 2.99%TA-L1 94.00% 97.70% 1.65%TA-L4 94.00% 97.90% 8.68%TA-M2 95.80% 98.80% 23.47%TA-M4 94.50% 97.50% 15.57%X 9.72%

Under construction

Under construction

Under construction

16 Font Groups

Bibliotheca Alexandrina 13

Evaluation of VERUS and AR

Challenge Set

95%

5%

VERUS AR

Normal Set

38.67%

61.33%

VERUS AR

Research agreement with NovoDynamics Preliminary evaluation on two data sets is promising

– Challenge: difficult to OCR, degraded images

– Normal: known to return acceptable accuracy

Encoding

Image on Text

Bibliotheca Alexandrina 15

Image-on-Text

Multilayered:– Visible page image– Hidden OCR text

View exact original layout while searching and highlighting

Supported with some OCR suites only

Supported format: DJVU and PDF

Bibliotheca Alexandrina 16

Quality Assurance

No missing cover or pages

All pages are in order

Text quality

Images quality

PDF quality

DAR

Digital Assets Repository

Bibliotheca Alexandrina 18

System Architecture

DAF/DAK APIs

Digital Assets Keeper(DAK)

RepositoryDatabase

Authentication and Authorization Subsystem

Users/groups/permissionsDatabase

Storage Subsystem

OfflineStorage

OnlineStorage

Integrated LibrarySystem

CatalogDatabase

User Interface

AdministrationTool

DigitizationClient

ArchivingTool

CatalogingTool

PublishingInterface

OAIGateway

Digital Assets Factory(DAF)

DigitizationDatabase

EncodingTool

Bibliotheca Alexandrina 19

DAK Publishing Module

Bibliotheca Alexandrina 20

DAK Publishing Module

Bibliotheca Alexandrina 21

DAK Publishing Module

Bibliotheca Alexandrina 22

DAK Publishing Module

Bibliotheca Alexandrina 23

Bibliotheca Alexandrina 24

Show notes

Bibliotheca Alexandrina 25

Bibliotheca Alexandrina 26

Transfer of Digitized Books

Challenges

– Storage: CD vs Online

– Bandwidth: 10 Mbps vs 155 Mbps

– Copyright: not published Actions:

– Transferred 8,500+ books to the Internet Archive

– Process is still going on

Books From India

Towards better collaboration

Bibliotheca Alexandrina 28

Books From India

Language Number Books

Arabic 832

Arabic + French 3

Arabic + German 1

Persian 101

French 2

English 1

Spanish 1

German 1

Total 942

Bibliotheca Alexandrina 29

ProgressPhase Name Done as of

November 1, 2006

Expected to finished by

Comments

Cataloging 801 - 35 have metadata problems

Processing 742 November 20, 2006

OCRing 200 March 1, 2007

Encoding 171 - -

Publishing 171 - -

Bibliotheca Alexandrina 30

Metadata Problems

Bibliotheca Alexandrina 31

Processing

Bibliotheca Alexandrina 32

OCR Using VERUS or AR?

Calculated accuracy for a small sample

– Images processed once with darkening effect and once without

– VERUS likes darkening, AR does not

– Overall, AR won 70% of cases

30%

70%

VERUS AR

Bibliotheca Alexandrina 33

Bibliotheca Alexandrina 34

Bibliotheca Alexandrina 35

Bibliotheca Alexandrina 36

Bibliotheca Alexandrina 37

Bibliotheca Alexandrina 38

Bibliotheca Alexandrina 39

Bibliotheca Alexandrina 40

Bibliotheca Alexandrina 41

Bibliotheca Alexandrina 42

Thank You