Improving Accessibility of Archived Raster Dictionaries of Complex Script Languages

30
Improving Accessibility of Archived Raster Dictionaries of Complex Script Languages Computer Science Department, Old Dominion University Norfolk, Virginia - 23529 Sawood Alam National University of Sciences and Technology Islamabad, Pakistan Fateh ud din B Mehmood Computer Science Department, Old Dominion University Norfolk, Virginia - 23529 Michael L. Nelson

Transcript of Improving Accessibility of Archived Raster Dictionaries of Complex Script Languages

Improving Accessibility ofArchived Raster Dictionaries of

Complex Script Languages

Computer Science Department, Old Dominion UniversityNorfolk, Virginia - 23529

Sawood Alam

National University of Sciences and TechnologyIslamabad, Pakistan

Fateh ud din B Mehmood

Computer Science Department, Old Dominion UniversityNorfolk, Virginia - 23529

Michael L. Nelson

The Time Travel

OK Google, Define Dictionarya book or electronic resource that liststhe words of a language (typically inalphabetical order) and gives theirmeaning, or gives the equivalent wordsin a different language, often alsoproviding information aboutpronunciation, origin, and usage.

Dictionaries Are DifferentRead: random accessWrite: maintain sort orderThe most compact mode topreserve a language

Problem: English Dictionary

Johnson's English dictionary

Problem: Urdu Dictionary

Farhang-e-Asifiyah

Related Work

Unicode CollationOrdered assembly of written informationUnicode values != natural collationArabic script: U+0600 to U+06FFOut of order alphabets in derived languagesCommon Locale Data Repository (CLDR)

Collation DiscrepanciesCompound lettersDiacritical marksHalf lettersPrefixes

Nested OrderingRoot word sorting (Arabic)

Morphological derivationDerived word simplification

Radicals and strokes (Chinese)

Indexing: Ordered Pages

Indexing: Sparse Index

Indexing: Full Index

Indexing: Location Index

Indexing State Transition

Annotation

Digitization

Dictionary ExplorerMultilingual Multi-dictionary LookupSearching and ExploringAnnotation and digitizationUser Contribution and FeedbackOpen Source => GitHub:/urduweb/DictionaryExplorer

Dictionary Explorer: English

Dictionary Explorer: English

Dictionary Explorer: Urdu

Dictionary Explorer: Urdu

Indexing TimeDictionary Pages Index Mode Time

English toUrdu

180 Sparse Manual andScript

10minutes

MonolingualUrdu

2,500 Sparse Manual 2 hours

MonolingualClassic Urdu

3,200 Full* Crowdsource** 60 days

* 75,000 words, phrases, proverbs, and idioms** 13 contributors

Prefix Permutations

Prefix: One

Prefix: Two

Prefix: Three

Prefix: Four

Prefix: Five

Prefix: Six

Conclusions and Future WorkIdentified issues

Too many matchesLack of fielded searchingLack of OCR supportNo input method assistance

Collation chalangesAccessibility levels: Ordered Pages, Sparse, Full, andLocation indexes, annotation, and digitizationImplemented a multi-lingual multi-dictionary explorerEffort and prefix evaluationIn future: elastic index and automatic region estimsteGitHub:/urduweb/DictionaryExplorer