پاکستانی زبانوں کی ترتیب تہجی اور متعلقہ مسائل

43
ل ئ سا م ہ ق ل ع ت م ی اور ج ہ ت ب ی ت ر ت ی ک وں# ن ا ب# ی ر# ن ا ت س ک ا+ ب ل ئ سا م ہ ق ل ع ت م ی اور ج ہ ت ب ی ت ر ت ی ک وں# ن ا ب# ی ر# ن ا ت س ک ا+ ب, ن سی ح رمد س, ن ی س ح رمد سF Collation Sequences and Related Issues for Pakistani Languages Center For Research in Urdu Language Processing National University of Computer and Emerging

description

پاکستانی زبانوں کی ترتیب تہجی اور متعلقہ مسائل. Collation Sequences and Related Issues for Pakistani Languages. سرمد حسین. F. Center For Research in Urdu Language Processing National University of Computer and Emerging Sciences. Purpose of Presentation. Briefly discuss character sets - PowerPoint PPT Presentation

Transcript of پاکستانی زبانوں کی ترتیب تہجی اور متعلقہ مسائل

پاکستانی زبانوں کی ترتیب تہجی اور متعلقہ مسائلپاکستانی زبانوں کی ترتیب تہجی اور متعلقہ مسائل

سرمد حسینسرمد حسین

F

Collation Sequences and Related Issues for Pakistani Languages

Center For Research in Urdu Language Processing

National University of Computer and Emerging Sciences

Purpose of PresentationPurpose of Presentation

► Briefly discuss character setsBriefly discuss character sets

► Discuss Urdu Collating sequenceDiscuss Urdu Collating sequence

► Propose a possible Urdu collation sequencePropose a possible Urdu collation sequence

► Overview collation of other languages of Overview collation of other languages of

PakistanPakistan

اردواردول ف س ر د ج ا

م ق ش رھ دھ جھ آ

مھ ک ص ڑ ڈ چ ب

ں کھ ض ڑھ ڈھ چھ بھ

ںھ گ ط ز ذ ح پ

ن گھ ظ ژ   خ پھ

نھ   ع       ت

و   غ       تھ

وھ           ٹ

ہ           ٹھ

ة           ث

ء            

ی            

ے            

بلوچیبلوچیل ف س ر د ج ا

م ق ش ڑ ڈ چ آ

ن ک ص ز ذ ح ب

و گ ض ژ خ پ

ہ ط ت

ء ظ   ٹ

ی   ع       ث

ے   غ      

ۓ          

پشتوپشتول ف س ر د ج ا

م ق ش ړ ډ ب ځ

ن ک ښ ز ذ چ پ

ڼ ص ګ ژ   څ ت

و   ض ږ   ح ټ

ہ   ط     خ ث

ي   ظ        

ې   ع        

ۍ   غ        

ٸ            

ے            

پنجابیپنجابیل ف س ر د ج ا

لھ ک ش رھ دھ جھ ب

م کھ ص ڑ ڈ چ بھ

مھ ق ض ڑھ ڈھ چھ پ

ن گ ط ز ذ ح پھ

نھ گھ ظ ژ   خ ت

ڼ   ع       تھ

و   غ       ٹ

ہ           ٹھ

ء           ث

ی            

ے            

سندھیسندھیل ف س ر د ج ا

لھ ڦ ش ڙ ڌ ڄ آ

م ق ص ڙھ ڏ جھ ب

مھ ڪ ض ز ڊ ڃ ٻ

ن ک ط   ڍ چ ڀ

نھ گ ظ   ذ ڇ ت

ڻ ڳ ع     ح ٿ

ڻھ گھ غ     خ ٽ

و ڱ         ٺ

ھ           ث

ہ           پ

ء            

ي            

SourcesSources

► UrduUrdu Akhbar-e-Urdu (Special Supplement on Urdu Software; Jan-Feb. Akhbar-e-Urdu (Special Supplement on Urdu Software; Jan-Feb.

2002), National Language Authority, Islamabad2002), National Language Authority, Islamabad

► BalochiBalochi Fax communication (Sept. 2002), Balochi Academy, QuettaFax communication (Sept. 2002), Balochi Academy, Quetta

► PashtoPashto Fax communication (Sept. 2002), Pashto Academy, PeshawarFax communication (Sept. 2002), Pashto Academy, Peshawar

► PunjabiPunjabi Punjabi Qaida (Experimental), Punjabi Adabi Board, LahorePunjabi Qaida (Experimental), Punjabi Adabi Board, Lahore

► SindhiSindhi Sindhi Boli (July-Dec. 2001) and SLA Letter Circulation of Sindhi Sindhi Boli (July-Dec. 2001) and SLA Letter Circulation of Sindhi

Collation (June 2002), Sindhi Language Authority, HyderabadCollation (June 2002), Sindhi Language Authority, Hyderabad

اردواردو

آ ا ب پ ت ٹ ث ج چ ح خآ ا ب پ ت ٹ ث ج چ ح خ

س ش ص ض س ش ص ض د ڈ ذ ر ڑ ز ژد ڈ ذ ر ڑ ز ژ

ف ق ک گف ق ک گ ط ظ ع غط ظ ع غ

ےن و ء ی ل مل م ےن و ء ی ہ ہ

-اردو قائدہ ، فیروز سنز ، لاہور-اردو قائدہ ، فیروز سنز ، لاہور

Urdu Alphabet: State of AffairsUrdu Alphabet: State of Affairs

Are the following letters of Urdu?Are the following letters of Urdu?

آاآا ٶٶ أاأا ۔ بھ پھ تھ بھ پھ تھ ۔ ۔ ۔ ۔ ۔ ...... ںں ةة لھ مھ نھ ںھ وھلھ مھ نھ ںھ وھ

If yes, where are they placed in the alphabet?If yes, where are they placed in the alphabet?

SourcesSources► Data from eight dictionaries of UrduData from eight dictionaries of Urdu

( ( FLJFLJ))فیروزاللغات جامع، فیروز سنز، لاہورفیروزاللغات جامع، فیروز سنز، لاہور1.1.

.2.2Standard Twentieth Century Dictionary: Urdu to English, Standard Twentieth Century Dictionary: Urdu to English,

Educational Publishing House, New Dehli, India (STCD)Educational Publishing House, New Dehli, India (STCD)

( ( FTFT))آابادآاباد تلفظ ، مقتدرہ قومی زبان، اسلام تلفظ ، مقتدرہ قومی زبان، اسلام��������فرہنگفرہنگ3.3.

( ( JULJUL ) )آابادآاباد جدید اردو لغت ، مقتدرہ قومی زبان، اسلامجدید اردو لغت ، مقتدرہ قومی زبان، اسلام4.4.

((ULUL)) اردو لغت ، اردو لغت بورڈ ، کراچیاردو لغت ، اردو لغت بورڈ ، کراچی5.5.

.6.6A Dictionary of Urdu, Classical Hindi and English, Crosby A Dictionary of Urdu, Classical Hindi and English, Crosby

Lockwood and Son, London (1911) (UHE)Lockwood and Son, London (1911) (UHE)

آاصفیہ، دہلی7.7. آاصفیہ، دہلیفرہنگ ((FAFA))((19181918 ) )فرہنگ

((NLNL))نوراللغات، سنگ میل، لاہور نوراللغات، سنگ میل، لاہور 8.8.

Urdu Alphabet: State of AffairsUrdu Alphabet: State of Affairs FT, JUL , ULFT, JUL , UL

دد ا آ ب بھ پ پھ ت تھ ٹ ٹھ ث ج جھ چ چھ ح خا آ ب بھ پ پھ ت تھ ٹ ٹھ ث ج جھ چ چھ ح خ

فف س ش ص ض ط ظ ع غس ش ص ض ط ظ ع غ دھ ڈ ڈھ ذ ر رھ ڑ ڑھ ز ژ دھ ڈ ڈھ ذ ر رھ ڑ ڑھ ز ژ

ے ل لھ م مھ ں ںھ ن نھ و ء ی ق ک کھ گ گھ ق ک کھ گ گھ ے ل لھ م مھ ں ںھ ن نھ و ء ی ہ ہ FLJ, NLFLJ, NL

سس د ڈ ذ ر ڑ ز ژد ڈ ذ ر ڑ ز ژ آ ا ب پ ت ٹ ث ج چ ح خآ ا ب پ ت ٹ ث ج چ ح خ

ہہ ف ق ک گ ل م ں ن وف ق ک گ ل م ں ن و ش ص ض ط ظ ع غ ش ص ض ط ظ ع غے ھ ء ی ے ھ ء ی

UHE, FA , STCDUHE, FA , STCDسس د ڈ ذ ر ڑ ز ژد ڈ ذ ر ڑ ز ژ ا ب پ ت ٹ ث ج چ ح خا ب پ ت ٹ ث ج چ ح خ

ہہ ف ق ک گ ل م ن وف ق ک گ ل م ن و ش ص ض ط ظ ع غ ش ص ض ط ظ ع غ

ےی ےی ھ ء ھ ء

Conclusions: Urdu Character SetConclusions: Urdu Character Set

► No general agreement on Urdu Character Set by No general agreement on Urdu Character Set by dictionary publishersdictionary publishers

► Standard Character Set defined by National Standard Character Set defined by National Language Authority and Urdu Dictionary BoardLanguage Authority and Urdu Dictionary Board not traditionalnot traditional not well-publicized not well-publicized not completely adoptednot completely adopted

► GoP Computing Standard for Computing, UZT 1.01 GoP Computing Standard for Computing, UZT 1.01 implements the NLA-defined character and symbol implements the NLA-defined character and symbol set set

► UZT 1.01 will soon be fully represented in UZT 1.01 will soon be fully represented in Unicode/ISO IEC 10646Unicode/ISO IEC 10646

Character SetCharacter Set

►AlphabetAlphabet

►Harakat (Aerab)Harakat (Aerab)

►Other SymbolsOther Symbols

Do zabar Do zabar دد Do zerDo zer دد

Do peshDo pesh دد Tashdeed Tashdeed دد Noon ghunnaNoon ghunna نن

““Familiar” Harakaat (Aerab)Familiar” Harakaat (Aerab)

JazmJazm ددZabarZabar دد ZerZer دد�� PeshPesh دد Khari zabarKhari zabar دد Khari zerKhari zer ددUlta peshUlta pesh دد

““Common” Other SymbolsCommon” Other SymbolsNumbersNumbers

00 ۰۰11 ١١22 ٢٢33 ٣٣44

55 ۵۵66 ٦٦77

88 ٨٨9 9 ٩٩

Punctuation Punctuation

؟؟؛؛٬٬--

HonorificsHonorifics

Other SymbolsOther Symbols

ס

Cu

rrent G

oP S

tan

dard

: UZ

T 1

.01

Cu

rrent G

oP S

tan

dard

: UZ

T 1

.01

Logical Sections of UZT 1.01Logical Sections of UZT 1.01► Alphabet (80 – 122)Alphabet (80 – 122)► Aerab/diacritics/harakat (66 – 79, 123 – 126)Aerab/diacritics/harakat (66 – 79, 123 – 126)► Other charactersOther characters

Punctuation and arithmetic symbols (32 – 47, 58 – Punctuation and arithmetic symbols (32 – 47, 58 – 65)65)

Digits (48 – 57)Digits (48 – 57) Special symbols (160 – 176, 192 – 199)Special symbols (160 – 176, 192 – 199) MiscellaneousMiscellaneous

► Control characters (0 – 31, 127) Control characters (0 – 31, 127) ► Reserved control space (128 – 159, 255)Reserved control space (128 – 159, 255)► Reserved expansion space (177 – 191, 200 – 207, 240 – Reserved expansion space (177 – 191, 200 – 207, 240 –

253)253)► Vendor area (208 – 239)Vendor area (208 – 239)► Toggle character (254)Toggle character (254)

Urdu Collation SequenceUrdu Collation Sequence

►How do the following figure in?How do the following figure in? Basic LettersBasic Letters Other LettersOther Letters Basic AerabBasic Aerab Other AerabOther Aerab OthersOthers

►Arguments should be Arguments should be consistentconsistent and and simplesimple

Character vs. PhonemeCharacter vs. Phoneme

► Character = written content = lettersCharacter = written content = letters► Phoneme = linguistic contentPhoneme = linguistic content

► in word “phone” in word “phone” 5 Characters 5 Characters = = p h o n ep h o n e 3 Phonemes 3 Phonemes == f o nf o n

Urdu Collating Sequence: Urdu Collating Sequence: LettersLetters

What is the status and sequence of following What is the status and sequence of following characters?characters?

آا آاا ا ٶٶ أاأا ن ںن ں ہ ھہ ھ ہ تہ ت ةة ی ےی ے

آا آاا VariationVariation ا

► FLJFLJ

بب اا اا= = آابآابپپ اا اا= = آاپآاپ ابابایوانایوان

► FT, JUL, ULFT, JUL, UL

اباب ایوانایوانبب اا اا= = آابآابپپ اا اا= = آاپآاپ

آا = ا ا •

• stylistic variation of ا ا • adds a character to single alif• not a character in the pure sense

► STCD, UHE, FA, NLSTCD, UHE, FA, NL

ااآابآابآاپآاپ ابابایوانایوان

ٶٶ أاأا StatusStatus

► Not a character in ANY dictionary including Not a character in ANY dictionary including dictionaries bydictionaries by National Language AuthorityNational Language Authority Urdu Dictionary BoardUrdu Dictionary Board

► Has same bearing on collation sequences as Has same bearing on collation sequences as ا ا ء ء و و ء ء

► Included in UZT 1.01 as per terms of reference Included in UZT 1.01 as per terms of reference given by NLAgiven by NLA

► May be made by combination of May be made by combination of ءء followed by followed by و ، و ا ، ا► Should be taken out of UZT1.01 in its next versionShould be taken out of UZT1.01 in its next version

VariationVariation ن ںن ں

► FLJ, FT, STCD, NL, FA, UHEFLJ, FT, STCD, NL, FA, UHEماںماںمانمان

► JUL, ULJUL, ULمانمانماںماں

is a vowel modifier which nasalizes the vowel but ں •DOES NOT add any “phonemic content”

not a phoneme is a character

does not represent any other character or combination

written adjacent to نن lighter goes up! would come before ن

ماما C VC V == ماںماں C VC V = = مانمان C V CC V C= =

ھھ ہہ VariationVariation

► FLJ, UHE, FA, NLFLJ, UHE, FA, NL

((ھھ then then ہہ ;not character; not character بھبھ ) )

باپباپبھابیبھابیبہنبہنبہنگیبہنگیبھنگیبھنگی بیٹا بیٹا

► STCD STCD

((ہہ then then ھھ ;not character; not character بھبھ ) )

باپباپبھابیبھابیبہنبہنبھنگیبھنگیبہنگیبہنگی بیٹا بیٹا

► FT, JUL, ULFT, JUL, UL

((charactercharacter بھبھ ) )

باپباپبہنبہنبہنگیبہنگیبیٹابیٹابھابیبھابیبھنگیبھنگی

ھھ ہہ VariationVariation

► Like ں is a vowel modifier ھ is a consonant modifier and DOES NOT add any “phonemic content”

as with as with ھ , , ں not a phonemenot a phoneme

written adjacent to written adjacent to ہ lighter goes up!lighter goes up!

would come before ہ

بب CC = = بھبھ CC = = بہبہ C V CC V C = =

، پھ،۔۔۔، پھ،۔۔۔ھھبب Status as “Character”Status as “Character”

► Urdu Dictionary Board and National Language Authority assert that these are phonemes therefore the character combination should be made a character

► If character combinations which are phonemes are to be promoted as characters then the following combinations should also be made characters to be consistent یں، وں ، اں

► However, it is common in languages that character combinations represent phonemes p h f (in English), so (in Urdu) پھ پ ھ

► even if it is not a phoneme ,ں may remain a character like ھ► not characters but character combinations بھ ، پھ، ۔۔۔

”Status as “Character”Status as “Character ةة

► Not a character in ANY dictionary including Not a character in ANY dictionary including

dictionaries bydictionaries by

National Language AuthorityNational Language Authority

Urdu Dictionary BoardUrdu Dictionary Board

► Stylistic variation of ت (e.g. STCD, NL, …)

ةزکو زکوت► Not a character

ےے یی VariationVariation

► FJL,FJL, FT, JUL, UL, NLFT, JUL, UL, NLبیبیبی بیبی بیبےبےبیابانبیابان

► STCD, UHE, FASTCD, UHE, FAبیبیبےبےبیابانبیابانبیبی بیبی

► Middle Middle ےے or or یی predicamentpredicament کار کارےےکار = بکار = بییبب وژن وژنییوژن = ٹیلوژن = ٹیلییٹیلٹیل

ےے یی VariationVariation

► Like ا،و،یthe character ے is a vowel (phoneme)

► unlike ے ,ں is not a vowel modifier

because ں different from ے

ی : replaces ے► بی بےا adds onto ں► : ماں ما

► placed at the end of the alphabet (based on traditional

collation)

► Collated as “heavier” than ی at ligature endings but “equal

to” ی ligature medially

Role of Aerab in SortingRole of Aerab in Sorting

► Aerab ignored in the first (primary) pass of Aerab ignored in the first (primary) pass of sorting an Urdu stringsorting an Urdu string only characters are consideredonly characters are considered

ہار( ہار( ہار )= بہار )= ببب ہانہ( ہانہ( ہانہ )= بہانہ )= ببب ہاءی( ہاءی( )= ب )= بئئہاہابب

► However, aerab are relevant in second pass, However, aerab are relevant in second pass, when first pass gives an exact matchwhen first pass gives an exact match

ب بب ب ن ب ن ب ن نننس سس س ن س ن س ن ننن

Vocalic Aerab - Zabar, Zer, Vocalic Aerab - Zabar, Zer, PeshPesh

►FT, FLJ, JUL, ULFT, FLJ, JUL, ULننببننببننبب

یریربب ررییبب یریربب بیربیر

►STCDSTCDننببننببننبب

ننسسننسسننسس

یریرب ب بیر بیر

ر ر ہہببر ر ہہببر ر ہہببر ر ہہببررہہبب

(UL(UL))

Vocalic Aerab – Khari ZabarVocalic Aerab – Khari Zabar

► No effect at primary level sortingNo effect at primary level sorting وسیوسیمماعلااعلا وسیوسیمماعلان اعلاناعلماعلماعلیاعلی

► No minimal pairs found on secondary No minimal pairs found on secondary level so involvement could not be level so involvement could not be determineddetermined

Consonantal Aerab - Consonantal Aerab - TashdeedTashdeed

► Ignored are primary level (FT, UL, NL, Ignored are primary level (FT, UL, NL, …)…)

►Effects secondary level sorting Effects secondary level sorting ““heavier” heavier” lighter goes uplighter goes up

رانارانابب اناانابر بر رایارایاب ب

بدیبدی بدی بدی بدیا بدیا

پتاپتا تا تاپ پ پتاپتا

Ligature-Break (Half Space) Ligature-Break (Half Space)

► Hex 41 (UZT) and Hex 200B (Unicode)Hex 41 (UZT) and Hex 200B (Unicode)► Ignored at primary level and secondary levelIgnored at primary level and secondary level

ٹیلیوژن ، ٹیلی وژنٹیلیوژن ، ٹیلی وژنٹیلیفون ، ٹیلی فونٹیلیفون ، ٹیلی فونبے کار ، بیکاربے کار ، بیکار

► But given each pair, which word first?But given each pair, which word first? Tertiary level decisionTertiary level decision

► lighter goes up!lighter goes up!► single word without break comes first?single word without break comes first?

Word-Break (Normal Space)Word-Break (Normal Space)

► Ignored at primary level ? Ignored at primary level ? ►American Heritage Dictionary (2American Heritage Dictionary (2ndnd Collegiate Collegiate

ed.)ed.) black artblack art black bearblack bear blackberryblackberry black boxblack box blackenblacken Black DeathBlack Death black goldblack gold

►Space ignored at primary levelSpace ignored at primary level

Word-Break (Normal Space)Word-Break (Normal Space)

► FLJ, ULFLJ, UL

بانگبانگ1.1.

درا درا بانگبانگ2.2.

دینا 3.3. دینا بانگ بانگ

If sorting is done at word break then 1,3,2 If sorting is done at word break then 1,3,2 So sorting ignores word break So sorting ignores word break

Conclusions: Urdu Character SetConclusions: Urdu Character Set

د ڈ ذ د ڈ ذ ا ب پ ت ٹ ث ج چ ح خا ب پ ت ٹ ث ج چ ح خ آاآا

غ غ ط ظ عط ظ ع ص ضص ض س شس ش ر ڑ ز ژر ڑ ز ژ

ے ء ی ھھن و ن و ںں ل مل م ف ق ک گف ق ک گ ے ء ی ہ ہ

•Two levels of characters• Core Characters• Non-core characters

Conclusions: Urdu Collating Conclusions: Urdu Collating SequenceSequence

► Multi-level Complex Multi-level Complex ProblemProblem

► Pre-processingPre-processing Contractions (Contractions (ب ھب ھ بھبھ)) Insert un-written aerabInsert un-written aerab

► Primary LevelPrimary Level characterscharacters

► Secondary LevelSecondary Level aerabaerab Others (?)Others (?)

► Tertiary LevelTertiary Level Ligature BreakLigature Break Others (?)Others (?)

► IgnorableIgnorable SpaceSpace secondary aerab (?)secondary aerab (?) Symbols (?)Symbols (?) Others (?)Others (?)

What Needs to be Done for What Needs to be Done for UrduUrdu

►Debate and standardizeDebate and standardize Character Set Character Set

►Develop computational model to Develop computational model to implement sorting implement sorting Culturally acceptableCulturally acceptable Collation Element Collation Element

Table to generate sort keysTable to generate sort keys

►Standardize and publicize this Standardize and publicize this computational model for Urdu sortingcomputational model for Urdu sorting

What Needs to be DoneWhat Needs to be Done

►Take national standards to Take national standards to International forums: Unicode/ISOInternational forums: Unicode/ISO

►Complete similar work for all other Complete similar work for all other local languages of Pakistanlocal languages of Pakistan Character setCharacter set ScriptScript Collating SequenceCollating Sequence

Relevant National and Provincial Relevant National and Provincial Government OrganizationsGovernment Organizations

► NationalNational Urdu and Regional Languages’ Software Development Urdu and Regional Languages’ Software Development

Forum (URLSDF), Ministry of Science and Technology Forum (URLSDF), Ministry of Science and Technology (MoST), Islamabad(MoST), Islamabad

National Language Authority (NLA), Islamabad (Urdu)National Language Authority (NLA), Islamabad (Urdu) Pakistan Standards and Quality Control Authority (PSQCA), Pakistan Standards and Quality Control Authority (PSQCA),

KarachiKarachi

► ProvincialProvincial Balochi Academy, QuettaBalochi Academy, Quetta Pashto Academy, PeshawarPashto Academy, Peshawar Punjabi Adabi Board, LahorePunjabi Adabi Board, Lahore Sindhi Language Authority (SLA), HyderabadSindhi Language Authority (SLA), Hyderabad

شکر یہشکر یہ