Million Book Project @ Bibliotheca Alexandrina Noha Adly 20 November 2006.

42
Million Book Project @ Bibliotheca Alexandrina Noha Adly 20 November 2006

Transcript of Million Book Project @ Bibliotheca Alexandrina Noha Adly 20 November 2006.

Page 1: Million Book Project @ Bibliotheca Alexandrina Noha Adly 20 November 2006.

Million Book Project @ Bibliotheca Alexandrina

Noha Adly20 November 2006

Page 2: Million Book Project @ Bibliotheca Alexandrina Noha Adly 20 November 2006.

Bibliotheca Alexandrina 2

Page 3: Million Book Project @ Bibliotheca Alexandrina Noha Adly 20 November 2006.

Bibliotheca Alexandrina 3

Page 4: Million Book Project @ Bibliotheca Alexandrina Noha Adly 20 November 2006.

Bibliotheca Alexandrina 4

BA Digitization Workflow

Page 5: Million Book Project @ Bibliotheca Alexandrina Noha Adly 20 November 2006.

Bibliotheca Alexandrina 5

Statistics - November 2006

  Arabic Latin Total

Scanned

Books 22,023 4,646 26,669

Pages 7,003,185 1,350,688 8,353,873

ProcessedBooks 21,947 4,642 26,589

Pages 6,987,392 1,348,900 8,336,292

OCRedBooks 16,652 4,600 21,252

Pages 5,248,337 1,327,385 6,575,722

Total Archived Data 1,500 GB

Page 6: Million Book Project @ Bibliotheca Alexandrina Noha Adly 20 November 2006.

Bibliotheca Alexandrina 6

Statistics (Contd) Daily Rates

– Scan: ≈ 1800 pages/person

– Process: ≈ 1800 pages/person

– Latin OCR: ≈ 4000 pages/person

– Arabic OCR: ≈ 1500 pages/person

Five Minolta scanners 2 shifts – 7 days a week

Page 7: Million Book Project @ Bibliotheca Alexandrina Noha Adly 20 November 2006.

OCR

Image to Text

Page 8: Million Book Project @ Bibliotheca Alexandrina Noha Adly 20 November 2006.

Bibliotheca Alexandrina 8

OCR - Arabic Poses unique challenges

– Written cursively, with blocks of connected characters

– a ‘block of characters’ can have more than one base line.

– Uses external objects such as dots, 'Hamza' and 'Madda'.

– Diacritization

– Characters can have more than one shape according to their position

– Overlapping makes it difficult to determine the spacing

Sakhr Automatic reader is used Tricky with old books Requires learning

Page 9: Million Book Project @ Bibliotheca Alexandrina Noha Adly 20 November 2006.

Bibliotheca Alexandrina 9

Arabic Script Is Cursive

Page 10: Million Book Project @ Bibliotheca Alexandrina Noha Adly 20 November 2006.

Bibliotheca Alexandrina 10

Old, Smudgy, and Sticked Together

Page 11: Million Book Project @ Bibliotheca Alexandrina Noha Adly 20 November 2006.

Bibliotheca Alexandrina 11

Use of Diacritics

Page 12: Million Book Project @ Bibliotheca Alexandrina Noha Adly 20 November 2006.

Bibliotheca Alexandrina 12

Font Low Bound High Point % BooksAR-H1 97.70% 99.50% 0.24%AR-H2 97.60% 99.50% 2.66%AR-H3 97.04% 99.10% 8.01%AR-H4AR-L4 92.70% 96.70% 6.62%DT-M1DT-L2 88.40% 96.80% 7.24%TA-H1 97.30% 99.10% 1.26%TA-H2 97.60% 99.20% 11.89%TA-H3TA-H4 96.50% 97.74% 2.99%TA-L1 94.00% 97.70% 1.65%TA-L4 94.00% 97.90% 8.68%TA-M2 95.80% 98.80% 23.47%TA-M4 94.50% 97.50% 15.57%X 9.72%

Under construction

Under construction

Under construction

16 Font Groups

Page 13: Million Book Project @ Bibliotheca Alexandrina Noha Adly 20 November 2006.

Bibliotheca Alexandrina 13

Evaluation of VERUS and AR

Challenge Set

95%

5%

VERUS AR

Normal Set

38.67%

61.33%

VERUS AR

Research agreement with NovoDynamics Preliminary evaluation on two data sets is promising

– Challenge: difficult to OCR, degraded images

– Normal: known to return acceptable accuracy

Page 14: Million Book Project @ Bibliotheca Alexandrina Noha Adly 20 November 2006.

Encoding

Image on Text

Page 15: Million Book Project @ Bibliotheca Alexandrina Noha Adly 20 November 2006.

Bibliotheca Alexandrina 15

Image-on-Text

Multilayered:– Visible page image– Hidden OCR text

View exact original layout while searching and highlighting

Supported with some OCR suites only

Supported format: DJVU and PDF

Page 16: Million Book Project @ Bibliotheca Alexandrina Noha Adly 20 November 2006.

Bibliotheca Alexandrina 16

Quality Assurance

No missing cover or pages

All pages are in order

Text quality

Images quality

PDF quality

Page 17: Million Book Project @ Bibliotheca Alexandrina Noha Adly 20 November 2006.

DAR

Digital Assets Repository

Page 18: Million Book Project @ Bibliotheca Alexandrina Noha Adly 20 November 2006.

Bibliotheca Alexandrina 18

System Architecture

DAF/DAK APIs

Digital Assets Keeper(DAK)

RepositoryDatabase

Authentication and Authorization Subsystem

Users/groups/permissionsDatabase

Storage Subsystem

OfflineStorage

OnlineStorage

Integrated LibrarySystem

CatalogDatabase

User Interface

AdministrationTool

DigitizationClient

ArchivingTool

CatalogingTool

PublishingInterface

OAIGateway

Digital Assets Factory(DAF)

DigitizationDatabase

EncodingTool

Page 19: Million Book Project @ Bibliotheca Alexandrina Noha Adly 20 November 2006.

Bibliotheca Alexandrina 19

DAK Publishing Module

Page 20: Million Book Project @ Bibliotheca Alexandrina Noha Adly 20 November 2006.

Bibliotheca Alexandrina 20

DAK Publishing Module

Page 21: Million Book Project @ Bibliotheca Alexandrina Noha Adly 20 November 2006.

Bibliotheca Alexandrina 21

DAK Publishing Module

Page 22: Million Book Project @ Bibliotheca Alexandrina Noha Adly 20 November 2006.

Bibliotheca Alexandrina 22

DAK Publishing Module

Page 23: Million Book Project @ Bibliotheca Alexandrina Noha Adly 20 November 2006.

Bibliotheca Alexandrina 23

Page 24: Million Book Project @ Bibliotheca Alexandrina Noha Adly 20 November 2006.

Bibliotheca Alexandrina 24

Show notes

Page 25: Million Book Project @ Bibliotheca Alexandrina Noha Adly 20 November 2006.

Bibliotheca Alexandrina 25

Page 26: Million Book Project @ Bibliotheca Alexandrina Noha Adly 20 November 2006.

Bibliotheca Alexandrina 26

Transfer of Digitized Books

Challenges

– Storage: CD vs Online

– Bandwidth: 10 Mbps vs 155 Mbps

– Copyright: not published Actions:

– Transferred 8,500+ books to the Internet Archive

– Process is still going on

Page 27: Million Book Project @ Bibliotheca Alexandrina Noha Adly 20 November 2006.

Books From India

Towards better collaboration

Page 28: Million Book Project @ Bibliotheca Alexandrina Noha Adly 20 November 2006.

Bibliotheca Alexandrina 28

Books From India

Language Number Books

Arabic 832

Arabic + French 3

Arabic + German 1

Persian 101

French 2

English 1

Spanish 1

German 1

Total 942

Page 29: Million Book Project @ Bibliotheca Alexandrina Noha Adly 20 November 2006.

Bibliotheca Alexandrina 29

ProgressPhase Name Done as of

November 1, 2006

Expected to finished by

Comments

Cataloging 801 - 35 have metadata problems

Processing 742 November 20, 2006

OCRing 200 March 1, 2007

Encoding 171 - -

Publishing 171 - -

Page 30: Million Book Project @ Bibliotheca Alexandrina Noha Adly 20 November 2006.

Bibliotheca Alexandrina 30

Metadata Problems

Page 31: Million Book Project @ Bibliotheca Alexandrina Noha Adly 20 November 2006.

Bibliotheca Alexandrina 31

Processing

Page 32: Million Book Project @ Bibliotheca Alexandrina Noha Adly 20 November 2006.

Bibliotheca Alexandrina 32

OCR Using VERUS or AR?

Calculated accuracy for a small sample

– Images processed once with darkening effect and once without

– VERUS likes darkening, AR does not

– Overall, AR won 70% of cases

30%

70%

VERUS AR

Page 33: Million Book Project @ Bibliotheca Alexandrina Noha Adly 20 November 2006.

Bibliotheca Alexandrina 33

Page 34: Million Book Project @ Bibliotheca Alexandrina Noha Adly 20 November 2006.

Bibliotheca Alexandrina 34

Page 35: Million Book Project @ Bibliotheca Alexandrina Noha Adly 20 November 2006.

Bibliotheca Alexandrina 35

Page 36: Million Book Project @ Bibliotheca Alexandrina Noha Adly 20 November 2006.

Bibliotheca Alexandrina 36

Page 37: Million Book Project @ Bibliotheca Alexandrina Noha Adly 20 November 2006.

Bibliotheca Alexandrina 37

Page 38: Million Book Project @ Bibliotheca Alexandrina Noha Adly 20 November 2006.

Bibliotheca Alexandrina 38

Page 39: Million Book Project @ Bibliotheca Alexandrina Noha Adly 20 November 2006.

Bibliotheca Alexandrina 39

Page 40: Million Book Project @ Bibliotheca Alexandrina Noha Adly 20 November 2006.

Bibliotheca Alexandrina 40

Page 41: Million Book Project @ Bibliotheca Alexandrina Noha Adly 20 November 2006.

Bibliotheca Alexandrina 41

Page 42: Million Book Project @ Bibliotheca Alexandrina Noha Adly 20 November 2006.

Bibliotheca Alexandrina 42

Thank You