lresisi - huji.ac.il › ~ai › projects › 2013 › NaturalLanguageDet… · mileki ebeqe...

30

Transcript of lresisi - huji.ac.il › ~ai › projects › 2013 › NaturalLanguageDet… · mileki ebeqe...

  • :zizek`ln dpial `ean qxewa meiq hwiiext

    zirah dty iedif

    201564895 ,lresisi ,iqiqx xe`il

    200790111, mikab4, owxa dwin

    2014 uxna 23

    1

  • mipiipr okez

    4 `ean I

    6 megza zexeyw zeeare dtyd iedif zniyn II

    7 ogand xnegle oeni`d xnegl mipezpd seqi` III

    7 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . zetyd zxiga 1

    7 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . oeni`d xnega letihde mihqwhd xewn 2

    8 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . miihixw`i mipniqa letih 3

    9 minzixebl` IV

    9 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . stif weg - ziai`pd dyibd 4

    9 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . dlibx dxitq 4.1

    9 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . "Borda Count" zhiy 4.2

    10 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . zihqihhq dyib 5

    10 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . n-grams t"r 5.1

    10 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . mxbipei 5.1.1

    10 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . dlin y`xa ze` zegiky 5.1.2

    11 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . dlin seqa ze` zegiky 5.1.3

    11 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . (zeize` inv) mxbia 5.1.4

    11 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . dty lk ly mixehwed egi` ote` 5.2

    11 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . ig` lwyn 5.2.1

    11 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . ilniqwnd jxrd zxiga 5.2.2

    11 . . . . . . . . . . . . . . . . . . . . . ogapd xehwel dtyd xehwe oia wgxnd zin ote` 5.3

    11 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . "heyt" wgxn 5.3.1

    12 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . iilwe` wgxn 5.3.2

    12 . . . . . . . . . . . . . . . . . . . . . . . mixehwed oia zieefd qepiqew t"r oein 5.3.3

    12 . . . . . . . . . (zixhniq-`l dqxbe zixhniq dqxb) Kullback-Leibler wgxn 5.3.4

    12 . . . . . . . . . . . . . . . . . . . . . . . . . . . (Ranks) mewina miyxtdd mekq 5.3.5

    13 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . seqpi` znxep 5.3.6

    13 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . dhlgd ivr 6

    14 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . n-grams t"r 6.1

    14 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . mxbipei 6.1.1

    14 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . dlin y`xa ze` zegiky 6.1.2

    14 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . dlin seqa ze` zegiky 6.1.3

    14 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . (zeize` inv) mxbia 6.1.4

    14 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . zeveawl epwlig ea ote`d 6.2

    14 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . (geexd) dn`zdd zeivwpet 6.3

    14 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Entropy itl 6.3.1

    2

  • 15 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Information Gain itl 6.3.2

    15 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Information Gain Ratio itl 6.3.3

    15 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Gini Gain itl 6.3.4

    16 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Train Error itl 6.3.5

    16 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . urd ziipal mzixebl`d 6.4

    16 . . . . . . . . . . . . . . . . . . . . . . . . . . . . (classification) dtyd beeiql mzixebl`d 6.5

    17 ze`vezd V

    17 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . (stif weg) ziai`p dyib 7

    17 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . zihqihhq dyib 8

    17 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . mixehwed oia wgxnd zin ote` t"r 8.1

    19 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . n-gram t"r 8.2

    20 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . (aeyig onf) enild avw 8.3

    21 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . overfitting-e zeiawr 8.4

    21 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . recall, precision, F1 i

    n 8.5

    22 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . dhlgd ivr 9

    22 . . . . . . . . . . . . . . . . . . . . . . . . (geexd) dn`zdd zeivwpetl qgia n-gram t"r 9.1

    24 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . (aeyig onf) enild avw 9.2

    26 izrl zeaygne zepwqn ,oei VI

    28 zexewn VII

    29 zeihqihhqd ze`vezd ixwir hexit - '` gtqp VIII

    30 dhlgdd ivra ze`vezd ixwir hexit - 'a gtqp IX

    3

  • I wlg

    `ean

    zepey zety 120-n xzei ly ozaizkl ynyn `ede ,mlera zevetpd azkd zekxrnn zg` `ed ipihld ziatl`d

    zetya etqep odil` ,zeiqiqa zeize` 26 llek `ed .('eke zeia`lq ,zeip`nxb ,zeip`nex) zepey zety zegtynn

    -l`a agxpd yeniyd .dtyl ziatl`d z` mi`zdl erepy ,(miihixw`i mipniq") migein mipniq zenieqn

    zedfl ozip vike m`d ,zeipihl zeize`a aezkd hqwh ozpida - zxbz`ne zpiiprn ziaeyig dira dlrn df ziat

    ?aezk `ed zirah dty efi`a

    irna miax mitpre divipbewd irn ,ziaeyig zepyla ,zepyla znbek) miax minegz dtiwn efd dirad

    miweg xaa iwqneg ly zeipylad zeqitza oel lkep ,dnbel ,jk .zexg` zeax zel`y dlrn `ide (aygnd

    zedfl minel e` dty miykex epgp` ea ote`d znbek zeiaihipbew zel`ya e` ,zeirah zety ly miilqxaipe`

    lerii znbek ,miax miihwxt miyeniy dl yi :`ixb zihxe`iz dppi` dirady xekfl aeyg ,liawna .milin

    ly ihqihhq gezip .'eke ihnehe` mebxz jildza oey`x alyk ,miknqn mr deare hqwh yetig zeniyn

    Cryptanalysis ly megza xefrl s` leki ,zepey zetya zeize` ly zeiexiz gezip ,hxtae ,zety oia miladd

    iedif e` ,mihqwh ly geprte dptvd zeniynn wlgk ,miiq`lw miptve dtlgd ipteva ,lynl) rin oeghae

    .'eke ,(onted eiw znbek) rin ly dqige eiwa ,(gpretnd jnqnd znerl ixewnd jnqna miqet

    megz `edy ,NLP-d megza `yep-zzk) zirah dty iedif ziira ly zhyten dqxba weqrz eply deard

    aezk day dtyd z` ihnehe` ote`a zedfl enll ozip oda mikx x`zp dkldnae ,(AI-d mlera e`n oiiprn

    :zeipihl zeize`a zeazkp xy` y`xn zexben zety xtqn ly oeebn oian ,oezp hqwh

    zipnex zixtq zipnxb ziwlhi`

    zieey ziplet zixbped zifpepi`

    zifbehxet ziwxeh zilbp`

    ziztxv zipihl qpwixt`

    zeize`d zebltzdl xeywd lka ipiite` qet yi dty lkly did hwiiextd jldna epze` dgpdy iqiqad oeirxd

    ly zebef ly e` (minxbipei) ze

    ea zeize` ly ziqgid zegikydy dtvp ,dnbel,jk .dtya zepeyd milina

    dxeyw dlin seqa e` dlin y`xa znieqn ze` ly zegikydy oke ,dtyl dtyn dpey didz (minxbia) zeize`

    xear elld miqetd z` enlle zeqpl ephlgd ,df oeirx xe`l .diiedifa riiqle dilr irdl leki jkitle ,dtyl

    miqetde mipezpd lr jnzqda ,oezp hqwh ly dtyd z` zefgl lkep m`d weale ,lirl zetyd 14-n `"k

    .epnly

    ,2010-2011 l"dpya xnze oxen i"r dazkpy "Natural Language Detection" deard mr azkzz epzear

    ze`vezd z` xbz`l dqppe ,dpey dxeva dirad z` sewzl dqpp epzeara .dne dniyn mr d

    enzdy

    ribdl epl eriiqiy ,zizek`lnd dpiad megzn miax mitqep milka yeniy jez ,ebiyd zeixewnd zexagndy

    - zigpen dinl zehiy zervn`a diral ybip epgp` mb ,zeixewnd zeazekd enk .xzei s` zeaeh ze`vezl

    ladd mle` .oeni`d xneg jezn zety beeiq zeivwpet zepal dqppe ,dhlgd ivr zervn`ae miihqihhq milk

    leb ladl liaedl lekiy dn ,dtyd ddefn mditl mipiit`nd zxigaa oenh didi zeeard izy oia izednd

    mda minzixebl`ae zilnihte`d aeyigd jxa mb enk ,yeniy dyrp mda miihqihhqd milkae dhlgdd ivra

  • ,zeixewnd zexagnd elirtdy dl` lr mitqep miihnzne miiaeyig zepeirx lirtp ,jkl sqepa .ynzydl epxga

    zniieqn dty zedfl dqpp ,"ala" zety yy oia xgai beeqndy mewnay jk ,beeiqd zniyn z` aigxp s`e

    jixvne ,zxkip dxeva dniynd z` jaqn oaenky dn) ipihl ziatl`a zeazkpy zety ly xzei agx oeebn jezn

    .(mitqep miax mixhnxta zeaygzd

    lr d`ln dhily epivxe xg`n ,epii lr azkp hlgend eaex .Python3.0-a azkp hwiiextd zxbqna ewd

    eidy ,(NLTK znbek) y`xn mixben milk lr qqazdl epivx `le ,epar mzi` mixnegd lre minzixebl`d

    zaezka (README.txt uaewa dvxd ze`xed llek) yibp ewd llk .eply zeyinbd zin z` liabdl mileki

    :d`ad

    http://tinyurl.com/qda6f5h

    5

  • II wlg

    megza zexeyw zeeare dtyd iedif zniyn

    xfrip epzear jxevl .oezp hqwh ly dty iedif zniyn rval ep`eaa rixkne izedn wlg `id mipiit`n zxiga

    zigpen dinl ly zepey zehiy zervn`a dirad z` sewzl lkepy ik ynzyp mda xy` ,miipyla mipiit`na

    (bigrams) minxbiae (unigrams) minxbipei ly zeiegiky weap ,jildzdn wlgk .elld mipiit`nd lr eknzqiy

    zeize` nv e` zniieqn ze` zrted ly zeiexazqdd z` `vnp ,xnelk ,zewapd zetydn `"ka zeize` ly

    mitqep mipiit`n .aezk hqwhd da dtyd lr (jkn enll e`) jkn jilydl lkep m`d weape ,znieqn dtya

    dlin y`xa znieqn ze` zrted zegikyl zeqgiizd ellki - zixewnd deara eqgiizd `l mdil` - ogapy

    .'eke ,(zilbp`a dxipe ziwlhi`a dgiky dlin seqa "a" ze`d ,lynl) dlin seqa e`

    mixwege mipyla i"r xwgpe rei mwlgy) elld mipiit`nd z` enll dqpp ,ipylad zrd megzn dpeya

    ode (mixehwe z`eeyd) "dheyt" zihqihhq dwia jxevl od mda ynzype ,ihnehe` ote`a (NLP-d megza

    zxbqna .('eke information gain ,ditexhp` znbek) zepey geex zeivwpeta yeniy jez ,dhlgd ivr ziipa jxevl

    `ly zeax zetqep zehiya ynzype ,zixewnd deara yeniy dyrp oda zeeznd z` xbz`l dqpp deard

    znxepa yeniy i"r xara elawzdy ze`vezd z` xtyl ozip m`d weap ,dnbel ,jk .zixewnd deara ewap

    ly eteqa .'eke Kullback-Leibler wgxn znbek mitqep milka yeniy mb enk ,zetyd z`eeyd myl seqpi`

    miweg ly hq zxivie zizek`lnd dpiad megzn milk zlrtd jez ,xzei zeaeh ze`vezl ribdl dvxp ,xa

    .oeni`d xneg jnq lr enliiy ,miqete

    hqwhd z` zeaikxnd milind yetig i"r zeidl dleki ,aezk oezp hqwh da dtyd iedifl ziai`pd jxd

    `"k xear oelin wfgzl ,xnelk) zeknzpd zetydn `"ka milind lk z` lelkiy ,ierii rin xb`na wapd

    ,z`f mr gi .("daexw ikd" dtyd z` xifgdle ,oezpd hqwha milindn `"k ea ytgl ,zeknzpd zetydn

    (zetydn `"k xear) dfky mevr rin xb`na yetigd jyne ,mewn zpigan xwi e`n `ed dfky iai`p oexzt

    zxez yi` ly dpga`a ynzydl lkei ixyt` xetiy .zeliria zihnx dxeva rbety dn ,e`n jex` zeidl ieyr

    "apf"e xvw "seb" zlra `ide ,dne i zetyd lka milind zegiky zebltzdy

    1

    d`xdy ,stif 'bxe'b divnxetpi`d

    milind 100-e ,iqetih hqwha mirtendn 25%-k zeqkn zilbp`a xzeia zevetpd milind 10 ,lynl jk :jex`

    lelkiy ,"mvnevn" xb`nl xaend xb`nd z` mvnvl didi ozip ,xnelk .mirtendn 45%-k zeqkn xak zevetpd

    z` zepiit`ny zeiegii xeyiw zelin e` ,zetydn `"ka zevetpd milind (ze`n e`) zexyr dnk ly dniyx

    dpi` - mewn zegt zizernyn zkxeve zkaeqn zegt zizernyn `idy s` lr - efky dinl ,mle`e .dtyd

    milin zedfl dvxpy mixwna ziyeniy didz `l `id ,efn dxzie ,zizek`ln dpia zpigan "zpiiprn" zn`a

    ,zxg` jx `evnl dvxpy ,o`kn .zevetp e`n milin e` qgi zelin llek `ly ,milin xtqn ly svx e` ze

    ea

    ddeab zexazqda oezpd hqwhd ly dtyd z` ddfiy ,beeqn eitl zepale ,oezp oeni` xneg lr jnzqdl lkezy

    .efd dyibd z` mb dxvwa epga epzeara ,z`f mr gi .ozipd lkk

    -al `ean iqxewa mb epxai dilr ,dtyd iedif zniyn oexzitl xzeia zpiiprnde zirahd jxd `id dinl

    lr miqqazny milk md ,Google Translate znbek ,NLP-d megza miax miihnehe` milk .ziaeyig zepyl

    mivex epgp`e ,"dfd oeeika mikled" ziaeyigd zepylad megza miax mixwgn ,ok enk .zewihqihhq lre dinl

    .ribdl lkep ze`vez eli`l - oaenk ,oitp` xirfa - weale jiyndl

    George K. Zipf (1949), Human Behavior and the Principle of Least Effort, Addison-Wesley. 1

    6

  • III wlg

    ogand xnegle oeni`d xnegl mipezpd seqi`

    :zeizernyn e`n zehlgd xtqn lawl epvl`p ,dniynd mr

    enzdl ep`eaa

    zetyd zxiga 1

    znerl) zedfl lkepy zetyd xtqn z` zxkip dxeva libdl did epinvrl epavdy miixewnd mirid g`

    lr e`n drityn `id oky ,e`n zizernyn `id ozedfe zetyd zenk zxiga .(zixewnd deara zety yy

    zeyg zeize` "siqedl" dlelr xgapy dty lk ,efn dxzi .oeni`d ixneg gtp lr mb enk ,zxgapd dibhxhq`d

    libdl leki ipy vn la` ,iedifd zniyn lr lwdl leki g` vny dn ,(zieeya å znbek) dl zeiegiiy

    ,xnelk ,urd leb z` zizernyn dxeva libdl jkae ,minxbiad (jkn xeng)e minxbipeid zenk z` zxkip dxeva

    - dxwira zipkh `id ztqep zixyt` dira .epxviy mivawd leb z`e dvixd jyn z` zihnx dxeva jix`dl

    - zety oia oeina dxeyw zxg` dira .xzei zkaeqn zeidl dlelr xzei "zeihefw`" zetya oeni` ixneg zbyd

    jildz ly eteqa .iedifd jild lr zeywdl did lekiy dn ,zene e`n zety od ziplede qpwixt` ,dnbel ,jk

    eptqed odl ,zixewnd deardn zetyd 6 z` zelleky ,lirl ehxety zetyd 14 z` xegal ephlgd ,daiyg

    zegtynn zety ,(zipihle ziwlhi` ,dnbel) szeyn ixehqid xewn zelrae "zene" zety - zepiiprn zety

    .('eke ,zia`lq dty `id ziplet ,zip`nxb dty `id zieey ,zip`nex dty `id zipnex ,lynl) zepey zety

    oeni`d xnega letihde mihqwhd xewn 2

    hqwhd zty z` zedfl zexyt`d didzy `id ,hwiiextd ziy`xa epnvr ipta epavdy zeaeygd zexhnd zg`

    xewne xg`n .hpxhpi`a bela e` ditiwiea jxr ,dxiy xtq ,oezirn gewl `ed m`d - exewnl xyw `ll

    milind jxe`e mihtynd jxe` ,zeize`d zeiebltzd lre milind xve` lr e`n ritydl mileki ebeqe hqwhd

    .mipey dty ialyn xzeiy dnk miqkny ,mipey zexewnn oeni` ixneg biydl epl aeyg did ,miynzyn oda

    -i` xnega ynzydl epivx `l` ,ze`vezd z` zehdl elkeiy mii`xw` mihqwh lr xytzdl epivx `l ,sqepa

    oda biydl lwy ,zeyibpd zetya mb - zetyd oeebna dne swidae dig` dnxa ,ozipd lkk izeki`e oin` oen

    ziwxeh ,zifpepi` znbek) oda miyibp zegt mixnegy zetya mbe ,(ziztxve zipnxb ,zilbp` znbek) mixneg

    xg`l .elld zetyd z` mixae `l epgp`e zeid mixnegd zeki` z` jixrdl dyw epl did ,sqepa .(qpwixt`e

    zxed `ed ef dxhn zbydl xzeia aehd xewnd ik ep`vn ,mihqwh ly e`n agx oeebn yetige dwinrn dwia

    ,dixehqid ,difhpt ixtq) zepey zeixebhwn mixtq zxiga lr eptwd .zyxa zety oeebna miipexhwl` mixtq

    .mipey dty ialyn lr rvazz dinldy epb` jkae ,('eke dxiy

    epyyge xg`n ,milin yng zegtl ellky mihtynd z` wx epxnye ,mihtynl epwxit mipeyd mihqwhd z`

    dtyl dne didiy ,oin` litext mditl yabl lkepy ik zeize` witqn milikn `l xzei mixvw mihtyny

    zeize`d lk z` epktd ,sqepa .mipey mihtyn 2000 zegtl lelki dty lka oeni`d xnegy jkl epb` .idylk

    .lower case-l

    xear .ditiwie - zilkza dpey xewnn `wee eze` epgwl ,ozipd lkk oeebne "i`nvr" didi ogand xnegy ik

    dtyd `id zwapd dtydy dpind ly (dievxd dtya) ditiwied jxrn mihtyn llk ogand xneg ,dty lk

    zeny ellki `l ,lynl) zexf zetya rin e`n hrn lelkie ,oin` didi my rindy daygn jezn ,da zinyxd

    epynzyd zixtq xear ,lynl ,jk .(dtyd dze`a mibyene migpen `l` ,zilbp`a miax miirn migpen e`

    ."dwixt` mex" jxra epynzyd qpwixt` xeare ,"xtq" jxra

    7

  • miihixw`i mipniqa letih 3

    .zilbp`n epl zexkend zeize`d 26-l xarn ,zetqep zeize` zeniiw zeipihl zeize`a zeazkpd zetydn zeaxa

    ,dlind ly diibdd ote` lr miritynd miitxbezxe` mipniq md ,miihixw`i mipniq mi`xwpd ,dl` mipniq

    ,ă ,á ,ä zeidl dleki a ze`d ,dnbel ,jk ."dlibxd" ze`l zgzn e` lrn edylk oniq ztqed i"r elawzd mde

    dly zernynd z` mb `l` ,dlind ly diibdd z` zepyl wx `l leki ihixw`id oniqa yeniyd xy`k ,'eke ǎ

    .("xak" `id schon dlind zernyn era ,"dti" `id "schön" dlind zernyn ,zipnxbd dtya ,lynl)

    lr (zeywdl e`) lwdle ,hwiiextd lr zizernyn dxeva ritydl dleki miihixw`i mipniq xaa dhlgdd

    mipniq xrid era ,zipnxb `id zwapd dtydy jk lr zwdaen dxeva irz ß ze`d ,lynl ,jk .dtyd iedif

    epiid oiir ,miihixw`id mipniqd z` xiqdl mivex epiidy dgpda ,z`f mr gi .zilbp` lr fnxn miihixw`i

    ,(dlibx a-l ä jetdl ,lynl) zg` ze`l dxnd zlaha ynzydl did ozip - z`f zeyrl vik rixkdl miyxp

    .elld mipniqdn lilk mlrzdl elit` e` ,(zipnxba bedpy itk ,ss-l ß e` ae-l ä jetdl ,lynl) zeize` izyl

    ,yeniy miyer epgp` mda mixehwed llk lre zeiebltzdd lr zxkip dxeva ritydl leki dfky oewiz lk ,xen`k

    ,zipyla dpigan jxr zxqg `id miihixw`id mipniqd ly dxqd ,jkl xarn .hwiiextd ze`vez z` zepyle

    .milin mi`ivnne ,dtyd z` oihelgl mipyn epgp` oky

    weal ephlgd ,ziaeyigd zepylad megzn mixwege mipyla xtqn mr zeievriizd ellky - miax mihal xg`l

    -xw`id mipniqd zxqd xy`k ,miihixw`i mipniq `lle ,miihixw`i mipniq mr :mipte` ipya ze`vezd z`

    NFD (Normalization Form hnxeta yeniye oe'ziit ly unicodedata ziixtqa yeniy jez drvazd miihi

    lr e`n milwn miihixw`i mipniq ditl) eply dxrydd z` weal lkep jk .Canonical Decomposition)

    .miihixw`id mipniqd z` dxiqdy ,zixewnd dearl epzear ze`vez z` zeeydl lkep oke ,(dtyd iedif

    8

  • IV wlg

    minzixebl`

    dyibd z` dxvwa weal mb epxga mle` ,dhlgdd ivre zihqihhqd dyibd lr yb epny epzeara ,xen`k

    .stif weg lr zqqazny ,ziai`pd

    stif weg - ziai`pd dyibd 4

    xqa) oda yeniyd zegiky itl idylk zirah dtya milind z` bxp m` eitl ,ixitn` weg `ed stif weg

    :

    1i-l zilpeivxtexty zexiz zlra `id i-d dlind ik `vnp ,(xei zegiky

    occurances (wi) =K

    i

    .edylk reaw `ed K-e ,dzexiza i-d dlind ly zertedd xtqn `id occurances (wi) xy`k

    ,dinl o`k oi` oky ,zizek`ln dpia zpigan "zpiiprn" zn`a dpi` efd ziai`pd dyibd ,`eand wlga xen`k

    `l didiy epybxde ,qegii zewpk dze` `iadl oekpl ep`vn ,z`f mr gi .milin zniyxa heyt yetig `l`

    .efd dhiyd z` xikfdl ilan ihnehe` ote`a dty iedif zniyn lr xal oekp

    okn xg`le ,(zipihl hrnl) zewapd zetydn `"ka xzeia zevetpd milind zniyx z` epfkix oey`xd alya

    dlin lk lr epvx ,zrk .x ∈ {10, 20, 50, 100, 500, 1000} xy`k ,dty lka zevetpd milind x z` wx ep`ved

    dhiy e` dlibx dxitq) epar dzi` dhiyl m`zda .zevetpd milind x zniyxa dze` epytige ,ogapd hqwha

    ,zevetpd milind zniyxa drited dlind xy`k iaeig ewip) dlinl edylk ewip epzp (Borda Count ziien

    milind zniyxn milin xzei yi ogand uaeway lkky `ed oeirxd .(my drited `l `id xy`k ilily ewipe

    z` xa ly eteqa epxfgd ,jkitl .dtyd dze`a aezk hqwhdy miiekiqd milb jk ,idylk dtya zevetpd

    .xzeia deabd oeivd lawzd dxear dtyd

    :ewip zehiy izy m` epar ,xen`k

    dlibx dxitq 4.1

    oeiv dlaiw `id ,my drited dlind m` .zevetpd milind x zniyxa ogapd hqwha dlin lk epytig ,df dxwna

    ,ogand uaewa eritedy milindn `"k ly ewipd z` epnkq ,xarnd ly eteqa .−1 oeiv dlaiw `id zxg`e ,1

    .xzeia deabd ewipd lawzd dxear dtyd z` epxfgde

    "Borda Count" zhiy 4.2

    i zexiza uaewa drited dlind m` .zevetpd milind x zniyxa ogapd hqwha dlin lk epytig ,df dxwna

    `idy ewipd jk ,ddeab xzei dlind zexizy lkk ,xnelk) x − i oeiv dlaiw `id ,(zevetpd milind x jezn)

    milindn `"k ly ewipd z` epnkq ,xarnd ly eteqa .−1 oeiv dlaiw `id zxg`e ,(xzei deab didi lawz

    did efd ewipd zhiyl lpeivxd .xzeia deabd ewipd lawzd dxear dtyd z` epxfgde ,ogand uaewa eritedy

    daeh dxeva zeirn okle xzei zexiz ok` od ,stif ly ezpga` t"ry ,dtya xzei zevetp milinl zetir zzl

    dlind day dtyd z` sirp okle ,zety xtqna driten znieqn dliny okzii ,sqepa .dtyl zekiiy lr xzei

    .dvetp xzei

    9

  • zihqihhq dyib 5

    -nxt t"r mixehwel mzkitde oeni`d ixneg gezip jez ,zepeyd zetya mixfeg miqet epnl ef dyib zxbqna

    ,igi iqetih xehwe ikl epgi` dty lka zepeyd ze`nbedn elawzdy mipeyd mixehwed z` .mipey mixh

    iedif zniyn .xehwe epnn mb epxvie ,ogand xneg ly dne gezip eprvia okn xg`l .dtyd dze` z` bviind

    -etihd xehwed oial (ogand xneg z` bviiny) lawzdy xehwed oia d`eeyd ikl "dnbxez" ,jk m` ,dtyd

    z` (jynda aigxp odilr ,zepey zehiya) ep

    n classification-d alya ,ok`e .zewapd zetydn `"k ly iq

    did da qetdy dtyd z` epxfgde ,zetyd ly miibeviid mixehwedn `"k oial wapd xehwed oia wgxnd

    .oeni`d xnegl xzeia dned

    ote`e egi`d ote` ,

    nd ly zifhxwd dltknd .zepey mikx xtqna rvazd lirl x`ezy jildza aly lk

    z` epeeyd ,jildzd ly eteqa .ywean hqwh lk xear zepey zewia e`n daxda znkzqn wgxnd zin

    ozep mi

    ndn in reawl lkepy ika ,mi

    nd t"r epvaiwe ,zizin`d dtyl aipd mitexivdn `"ky d`vezd

    .xzeia oekpd aexiwd z`

    oia wgxnd zin ote`

    ogapd xehwel dtyd xehwe

    mixehwed egi` ote`

    dty lk ly

    mipniq

    miihixw`i

    n-gram

    heyt wgxn ig` lwyn mr (ze` zegiky) mxbipei

    iilwe` wgxn ilniqwnd jxrd zxiga `ll y`xa ze` zegiky

    dlin

    zieefd qepiqew t"r oein seqa ze` zegiky

    dlin

    ixhniq-`l Kullback-Leibler wgxn (zeize` inv) mxbia

    ixhniq Kullback-Leibler wgxn

    mewina miyxtdd mekq

    seqpi` znxep

    zihqihhqd dyiba mikezigd :1 dlah

    n-grams t"r 5.1

    zebltzdl xeywd lka ipiite` qet yi dty lkl ditl dpga`d lr mipryp epynzyd mda mipeyd mi

    nd

    :dtya zepeyd milina zeize`d

    mxbipei 5.1.1

    `"ka dtya z

    ea ze` lk ly ziqgid zegikyd did wapd

    nd xy`k ,lirl hxety jildzd z` eprvia

    .oeni`d ixnegn

    dlin y`xa ze` zegiky 5.1.2

    dpey`x ze` xeza zeize`dn `"k ly zegikyd did wapd

    nd xy`k ,lirl hxety jildzd z` eprvia

    .oeni`d ixnegn `"ka milina

    10

  • dlin seqa ze` zegiky 5.1.3

    milina dpexg` ze` xeza zeize`dn `"k ly zegikyd did wapd

    nd xy`k ,lirl hxety jildzd z` eprvia

    .oeni`d ixnegn `"ka

    (zeize` inv) mxbia 5.1.4

    ixnegn `"ka milina zeize` ly zebef ly zeiegikyd did wapd

    nd xy`k ,lirl hxety jildzd z` eprvia

    .oeni`d

    dty lk ly mixehwed egi` ote` 5.2

    elawzdy mixehwed llk z` llwyl epilr did ,zewapd zetydn `"k xear iqetih xehwe yabl lkepy ik

    dze`a mixehwed lkn mipiit`nd lk z` lleky ,g` xehwe ikl dtyd dze`a mipeyd oeni`d ixneg xear

    :mikx izya z`f zeyrl epxga okle ,ze`vezd lr ritydl ieyr hlgda lelwyd ote` .dtyd

    ig` lwyn 5.2.1

    iqgid lwynd ,mixehwe x eid znieqn dty xear m` :ddf did mixehwedn `"kl ozipy iqgid lwynd ,ef jxa

    .

    1xdid mdn `"k ly

    v2 = (0.5, 0, 0.3, 0.2)-e v1 = (0.3, 0.2, 0.1, 0.4) mixehwed z` milawn epiid znieqn dty xear m` ,dnbel

    .v = (0.4, 0.1, 0.2, 0.3)did llweynd xehwed ,(i ze`d ly ziqgid zegikyd z` zbviin i-d dhpixe`ewd xy`k)

    ilniqwnd jxrd zxiga 5.2.2

    .zeiegikyd z` eplnxp okn xg`le ,mixehwed lka ely zilniqwnd zegikyd z` oiit`n lk xear epxga ,ef jxa

    v2 = (0.5, 0, 0.3, 0.2)-e v1 = (0.3, 0.2, 0.1, 0.4) mixehwed z` milawn epiid znieqn dty xear m` ,dnbel

    v = did lenxpd iptl llweynd xehwed ,(i ze`d ly ziqgid zegikyd z` zbviin i-d dhpixe`ewd xy`k)

    .v = 11.4 · (0.5, 0.2, 0.3, 0.4) =(

    514 ,

    214 ,

    314 ,

    414

    )

    did `ed okn xg`le ,(0.5, 0.2, 0.3, 0.4)

    ogapd xehwel dtyd xehwe oia wgxnd zin ote` 5.3

    .ogand xneg z` bviiny xehwe mb enk ,zetydn `"ka zeiebltzdd z` bviiny iqetih xehwe epiia yi zrk

    dney xehwed z` `vnpy ik ,zetydn `"k ly bviind xehwel ogand xehwe oia d`eeyd zlert rval eppevxa

    xehwe z`ivn .xehwed eze` i"r zbveiny dtyd `id oeni`d xneg ly dtydy dfgp jke ,ogand xnegl xzeia

    `ed ogand xehwe oial epia wgxnd exear xehwed z`ivn i"r dzyrp ogand xehwel xzeia "dne"d dtyd

    zxiga .mixehwe ipy oia wgxnd z` jixrdl lkep vik dziid zizednd dl`yd .mixehwed llk oian ilnipind

    eynzyd da) zg` dhiyn xzeia epynzyd okle ,zeiteqd ze`vezd lr e`n ritydl dleki aeyigd jx

    .(zixewnd deara

    epnly dtyd xehwe z`e P = (P1, ..., Pn) xeza ogand xehwe z` onqp mi`ad mitirqd lka ,zegepd myl

    z` mibviin mdy zexnl) dpey `ed mixehwea miheiaixh`d xtqny okziiy al miyp .Q = (Q1, ..., Qm) xeza

    P -l eptqed ,dfd iyewd mr

    enzdl ik .(ogand xnegae oeni`d xnega miielz mixehwed oky - dtyd dze`

    ody eplaiw ,jk .0 lwyn mdl epzpe ,ipyd xehwea miriten ok la` mda miriten `ly miheiaixh`d z` Q-le

    .x ≥ max (m,n) xy`k ,x leba mixehwe md Q ode P

    "heyt" wgxn 5.3.1

    ipy oia wgxndy dvxpe xg`n .

    ∑xi=1 |Pi −Qi| dgqepd i"r oezp Q-e P mixehwed oia wgxnd ,efd dhiya

    l"pd iehiad z` `ian (Q) dly ibeviid xehwedy ef didz diefgd dtyd ,xyt`d lkk ohw didi mixehwed

    11

  • .menipinl

    iilwe` wgxn 5.3.2

    ipy oia wgxndy dvxpe xg`n .

    ∑xi=1 (Pi −Qi)

    2dgqepd i"r oezp Q-e P mixehwed oia wgxnd ,efd dhiya

    l"pd iehiad z` `ian (Q) dly ibeviid xehwedy ef didz diefgd dtyd ,xyt`d lkk ohw didi mixehwed

    .menipinl

    mixehwed oia zieefd qepiqew t"r oein 5.3.3

    :mdipy oia zieefd qepiqew t"r mdipia oeind z` enl ozip ,xeyina miinin-e mixehwe ozpida

    cos (α− β) = cos (α) · cos (β) + sin (α) · sin (β)

    =P1

    P 21 + P22

    ·Q1

    Q21 +Q22

    +P2

    P 21 + P22

    ·Q2

    Q21 +Q22

    =P ×Q

    |P | × |Q|

    :`id inin-x xehwel zillkd dgqepd ,xnelk

    ∑x

    i=1 Pi ·Qi√

    ∑x

    i=1 P2i ·

    √∑x

    i=1 Q2i

    didz mixehwed ipy oia zieefdy lkk oky) xyt`d lkk dphw didz mixehwed ipy oia zieefdy dvxpe xg`n

    z` `ian (Q) dly ibeviid xehwedy ef didz diefgd dtyd ,(xzei leb didi mdipia ladd jk ,xzei dleb

    .menipinl l"pd iehiad

    (zixhniq-`l dqxbe zixhniq dqxb) Kullback-Leibler wgxn 5.3.4

    .DKL (P,Q) =∑x

    i=1 Pi · log(

    PiQi

    )

    dgqepd q"r ,mixehwe ipy oia ladd z` `ven

    nd

    DKL (P,Q) 6= miiwzn) ixhniq `l `ed ely "dheytd" dqxbae xg`n ,"iq`lw"

    n `l `ed dfd

    nd

    `l` ,ziq`lwd ezqxba KL wgxn aeyiga epwtzqd `l ,dfd ixyt`d iyewd mr

    enzdl ik .(DKL (Q,P )

    :mixehwed ipyl deey "qgi" ozepy ,ixhniq KL

    na mb epynzyd

    DSymmetric−KL =1

    2(DKL (P,Q) +DKL (Q,P ))

    ef didz diefgd dtyd okle ,xyt`d lkk ohw didi mixehwed ipy oia wgxndy dvxp zehiyd izyn `"ka

    .menipinl l"pd iehiad z` `ian (Q) dly ibeviid xehwedy

    (Ranks) mewina miyxtdd mekq 5.3.5

    oiit`n lkl epwprde ,xei zegiky xqa xehwe lka mipiit`nd z` epxiq ,(eply gezit ixt `idy) ef dhiya

    oeiva yxtdd ly hlgend jxrd z` epnkq ,okn xg`l .zeiegikyd xeiqa enewinl m`zda ,x-l 1 oia mly oeiv

    .

    ∑xi=1 |(Rank (Pi)−Rank (Qi))| ,xnelk .zeize`dn `"k ly

    .xzeia dned dtyd `id ,wapd xehwed znerl xzeia jenpd miyxtdd mekq lawzn dxeary dtyd

    12

  • Fitness Functions zeneyxd xtqn mipniq

    miihixw`i

    n-gram

    Gini Gain 500 mr (ze` zegiky) mxbipei

    Entropy 1000 `ll dlin y`xa ze` zegiky

    Information Gain 1500 dlin seqa ze` zegiky

    Information Gain Ratio 2000 (zeize` inv) mxbia

    Train Error

    dhlgdd ivra mikezigd :2 dlah

    seqpi` znxep 5.3.6

    .bigrams xear wx dlrted ef dhiy ,zixewnd dearl dnea

    `ide ,xzeia lebd `ed (hlgen jxra) miyxtdd mekq da dxeyd ly meniqwnd zeidl zxben seqpi` znxep

    :`ad ote`a dayeg

    ly (hlgen jxra) miyxtdd mekq z` miaygn ,minxbiad zniyxa zepey`xd zeize`dn `"k xear .1

    .ze`d dze`a miligzny minxbiad

    ze`d dze`a miligzny minxbiad ly (hlgen jxra) miyxtdd mekq dxear dpey`xd ze`d z` mixgea .2

    :dtyd xehwel wapd xehwed oia xzeia lebd yxtdd z` miaipn

    ‖ A ‖∞ = max1≤i≤x

    x∑

    j=1

    |Pij −Qij |

    .ilnipin did mewd alya ep`vny jxrd dxear dtyd z` mixifgn .3

    dhlgd ivr 6

    -nxt t"r dhlgd ivr zxivie oeni`d ixneg gezip jez ,zepeyd zetya mixfeg miqet epnl ef dyib zxbqna

    .zepeyd ze`nbed lr jnzqdae ,mipey mixh

    -i`d ixnegn `"ka eply n-grams-dn `"k ly zeiegikyd z` ep

    n ,oeni`d alya zihqihhqd dhiyl dnea

    dhiydn dpeya .oeni` xneg eze`a n-gram lk ly ziqgid zegikyd z` dnny ,iqetih xehwe oditl epxvie ,oen

    -be lkl g` - mdy enk mixehwed z` epx`yd `l` ,zetydn `"kl igi xehwe epxvi `l o`k ,zihqihhqd

    ,dlaha eply zeneyxd zeidl ektd mde ,oeni`d ixehwe llk oian mixehwe xtqn ilnepx ote`a epxga .dn

    alya .(zeize` ly mipey zebef ,dnbel) ep`vny n-gram-d eid (dlaha miheiaixh`d ,xnelk) dly zeenrdy

    ,ura znev `ed heiaixh` lky jk ,(jynda x`ezny ,ID3 mzixebl` zxfra) envr dhlgdd ur z` epipa `ad

    `ed dlr lke ,(3 dlah itl ,divfihxwqi exar mikxrd xy`k) heiaixh` eze` ly ixyt` jxr zbviin rlv lk

    m`zda ,ogand xneg ly classification eprvia okn xg`l .(zetyd 14 oian idylk dty ,xnelk) iteq beiz

    .epipay dhlgdd ur lr jnzqdae jynda x`ezny mzixebl`l

    z` xifgz dhiy efi` weal dziid dxhndyk ,zepey mikx xtqna rvazd lirl x`ezy jildza aly lk

    :xzeia aehd ieaipd

    13

  • n-grams t"r 6.1

    zebltzdl xeywd lka ipiite` qet yi dty lkl ditl dpga`d lr mipryp epynzyd mda mipeyd mi

    nd

    :dtya zepeyd milina zeize`d

    mxbipei 6.1.1

    `"ka dtya z

    ea ze` lk ly ziqgid zegikyd did wapd

    nd xy`k ,lirl hxety jildzd z` eprvia

    .oeni`d ixnegn

    dlin y`xa ze` zegiky 6.1.2

    dpey`x ze` xeza zeize`dn `"k ly zegikyd did wapd

    nd xy`k ,lirl hxety jildzd z` eprvia

    .oeni`d ixnegn `"ka milina

    dlin seqa ze` zegiky 6.1.3

    milina dpexg` ze` xeza zeize`dn `"k ly zegikyd did wapd

    nd xy`k ,lirl hxety jildzd z` eprvia

    .oeni`d ixnegn `"ka

    (zeize` inv) mxbia 6.1.4

    ixnegn `"ka milina zeize` ly zebef ly zeiegikyd did wapd

    nd xy`k ,lirl hxety jildzd z` eprvia

    .oeni`d

    oky) e`n dleb miheiaixh`d zenk minxbia mr deara ,ze

    ea zeize`e minxbipei mr deardn dpeya

    epxga ,dfd iyewd mr

    enzdl ik .in leb diqxewx wnerl eprbd okle ,(zeize` izy ly zeivhenxta xaen

    .urd z` mditl epipae ,xzeia mivetpd miheiaixh`d 400-800 z` "wx" miax mixwna

    zeveawl epwlig ea ote`d 6.2

    likn htyn lk .0-1 oiay geeha mixtqn ,xnelk ,minxbia e` minxbipei ly zeiegikyd md miheiaixh`d ikxr

    dpiidz zeiegikydy dtvp ,jkitl .zeize` ly dpey xtqnn zeakxend ,mipey mikxe`a milin ly dpey xtqn

    .zeiegikyd ly divfihxwqi rval ul`p ,mipey miheiaixh` oia zeeydl lkepy ike ,efn ef e`n zepey

    (geexd) dn`zdd zeivwpet 6.3

    ,('eke ,minxbia ,zepeyd ze

    ead zeize`d) urd z` mipea epgp` mditl mipeyd mipiit`nd md miheiaixh`d

    .dtya mdly zegikyd mdy ,mdly mikxrl zeqgiizd jez

    # Example a b ... Language

    1 0.081 0.014 ... English

    2 0.12 0.022 ... Spanish

    3 0.068 0.017 ... English

    ... ... ... ... ...

    Entropy itl 6.3.1

    mieqn heiaixh` ly zebltzd iabl ze`eed-i` zin z` bviind divnxetpi`d zxeza

    n `id ditexhp`

    .mdylk mipezp ozpida

    14

  • eply deard zixewnd deard

    minxbia ,minxbipei

    ,dpey`x ze`

    dpexg` ze`

    minxbia minxbipei rhwn

    0-0.00015 0-0.0015 0-0.00001 0-0.001 0

    0.00015-0.0003 0.0015-0.003 0.00001-0.0003 0.001-0.03 1

    0.0003-0.0005 0.003-0.005 0.0003-0.0006 0.03-0.06 2

    0.0005-0.001 0.005-0.01 0.0006-0.0009 0.06-0.09 3

    0.001-0.0015 0.01-0.015 0.0009-0.012 0.09-0.12 4

    0.0015-0.002 0.015-0.02 0.012-0.015 0.12-0.15 5

    0.002-0.0025 0.02-0.025 0.015-0.018 0.15-0.18 6

    0.0025-0.0035 0.025-0.035 0.018-1 0.18-1 7

    0.0035-0.005 0.035-0.05 8

    0.005-0.007 0.05-0.07 9

    0.007-0.01 0.07-0.1 10

    0.01-0.013 0.1-1 11

    0.013-1 12

    zeveawl dwelgd :3 dlah

    zeidl ditexhp`d z` xibp (Language attribute-d xear ditexhp` mirvan ep` xy`k) A dpezp dlahl

    -azqdd `ed pv =|Av ||A| -e ,zetyd zniyxn dtyd z` bviin v xy`k ,H (A) = −

    v∈LanguageList pvlog (pv)

    jxr zelra zeneyxl wx dlahd mevnv `ed Av ,dig` zexazqda miynzyn epgp`) mi`znd beizd ly zex

    .(dlaha zeneyxd xtqn `ed |A|-e a heiaixh`a v

    H (A, a) = zeidl ely ditexhp`d z` xibp mieqn a heiaixh` xear ze`eed-i` zin z` reawl ik

    .

    v∈V alues(a) H (Av)

    dvxp o`k ,deab ikd jxrd mr heiaixh`d z` zgwl dvxp oday ,bivpy ze`ad zeivwpetl ebipay al miyp

    .xzeia jenpd ditexhp`d jxr lra heiaixh`d z` `evnl

    Information Gain itl 6.3.2

    A dlaha a heiaixh` lkl .ura ewewk mieqn heiaixh` ozpida ditexhp`a dzgtdd zlgez z`

    en IG

    :xibp

    IG (A, a) = H (A)−∑

    v∈value(a)

    |Av|

    |A|

    ·H (Av)

    Information Gain Ratio itl 6.3.3

    :zeidl Information Gain Ratio-d z` xibp a heiaixh` lkl

    IGR(A,a) =IG(A, a)

    H(A,a)

    Gini Gain itl 6.3.4

    n `ed Gini index ik epnl epkxry miyetign la` ,dfd

    nd zee` qxewd zxbqna dpyd epnl `l mpn`

    .(Language eply dxwna `edy) dxhnd heiaixh` ly mipeyd mikxrd ly zeiexazqdd oia diihql

  • :Language ly mikxrd xear A dpezpd dlahd ly Gini Index diihqd

    n z` dligz xibp ,ditexhp`l dnea

    GI(A) = 1−∑

    v∈LanguageList

    (

    |Av|

    |A|

    )2

    :zeidl heiaixh` lkl geexd zivwpet z` xibp zrk

    GG(A, a) = GI(A)−∑

    v∈value(a)

    |Av|

    |A|·GI (Av)

    letk dlah-zz lkl diihqd oial zillkd diihqd oia ilnipind yxtdd ly meniqwnd z` `evnl dvxp

    .dze` xviind jxrd ly zexazqdd

    Train Error itl 6.3.5

    -xh`d z` xegal dvxp xy`k ,mieqn heiaixh` xear oeni`d z`iby zlgeza dixid z` z

    en z`f divwpet

    :dixid zlgez z` mqwnny a heiai

    TE(A,a) = minv∈LanguageList(pA)−∑

    v∈value(a)

    (

    |Av|

    |A|

    )

    minLanguage∈LanguageList (pAv )

    .A dlahl qgia xben pA xy`k

    urd ziipal mzixebl`d 6.4

    yeniy jez ,ID3 ,2qxewa epi`xy iaiqxewxd mzixebl`d `ed dhlgdd ivr ziipal epze` yniyy mzixebl`d

    .lirl ex`ezy (geexd) dn`zdd zeivwpeta

    (classification) dtyd beeiql mzixebl`d 6.5

    divxhi` lkay jk ,dhlgdd ur xena "liihn"y "heyt" iaiqxewx mzixebl` did ipey`xd beeiqd mzixebl`

    d`ixw `xewe ,igkepd znevd ly heiaixh`d xear ogand xehwea yiy jxrl mi`zny urd-zz z` xgea `ed

    eze` ly (dtyd z` ,xnelk) beizd z` xifgne ,dlrl ribn `ed xy`k xver mzixebl`d .urd-zz lr ziaiqxewx

    mivrd miax mixwnae xg`ne ,zeneyxd llkn ilnepx ote`a oeni`d xneg z` mixgea epgp`e xg`n ,mle`e .dlr

    heiaixh` eze`l miixyt`d mikxrd lk `l ,miliaend miheiaixh`d 400-800-a xegal epyxpe in mileb eid

    ,beeiqd jildza rwzip ,xnelk ,epnly ura miiw `ly ur-zz ytgle zeqpl milelr epgp` ,okl .mibvein inz

    .mi`zny ur-zz didi `le xg`n

    xehwea heiaixh`d ly jxrl mi`zny ur-zz el oi`y ura znevl eprbdy rbxay ephlgd ,ef dira xeztl ika

    xena liihl jiynpe ,"yegip"k dfky yg leih lk xibp .heiaixh` eze` ly mivrd-izz lk lr xearp ,beeiqd

    xearpe ,"yegip" erk z`f xetqp ur-zzd xena leiha "iziira" avnl ribpy mrt lka .dlrl ribp xy` r ,urd

    xtqn z`e ur-zz eze`n beizd z` lawp jildzd ly eteqa .znevd ly heiaixh`d ly mivr-izzd lk lr

    lk oky) xzeia ohwd "miyegip"d xtqna yeniy jez eprbd eil` beizd z` xifgpe ,jxa epiyry "miyegip"d

    .(ozipd lkk xrtd z` oihwdl dvxpe ,miaxwzn epgp` dil` dtyd oial ogand xneg oia xrt lr irn "yegip"

    ephlgd ,overfitting-n rpnidl ike ,dl`ky miax miznv lelkle mileb zeidl mileki mivrde xg`n ,ok enk

    xtqn z` mb epxard ,ziaiqxewx d`ixw lka - urd xena "leih"d jldna pruning rvale jildzd z` lriil

    ,dfd miyegipd xtqn z` epxar urd xena leihd jldna m`e ,dk r beiz epl biydy jenp ikd miyegipd

    .zilnihte`d d`vezd z` aipi `l `ed oky ,lelqnd eze` lr epxziee ura zelvtzdd jynd z` epwqtd

    24 zitewy ,10 lebxz

    2

    16

  • V wlg

    ze`vezd

    ,df wxta `aen zeitivtqd ze`vezd gezip .ephwp oda zeyibdn `"ka ,eplaiwy ze`vezd z` bivp df wlga

    ,dhlgdd ivr ly ode zihqihhqd dyiba od ,ze`lnd ze`vezd .`ad wxta driten xzei dagx zeqgiizd era

    .'a-e '` migtqpk ze`aen

    (stif weg) ziai`p dyib 7

    :

    3

    sxba ze`xl ozipy itk ,zelern eid dbiyd ziai`pd dyibdy ze`vezd ,dtevnk

    zniiw ,xnelk ,oihelgl ddf did milin 1000-e 500 xear Borda Count zhiya iedifd feg` ik oiivl oiiprn

    .zevetp milin lr zqqaznd ieaipd zlekia "zikekf zxwz" oirn (dxe`kl)

    z` epeeyd `l okle ,hwiiexta epwnzd oda dinld zehiyn e`n dpey efd dyibdy aey yibdl aeyg

    .l"pd ze`vezl epbydy ze`vezd

    zihqihhq dyib 8

    dlaha ehxety miax mikezig itl ,zevxd ly e`n ax xtqn llk dinld jildz ef dhiya ,lirl xaqedy itk

    ephlgdy zetyd 14 llk xear zextp zevxde ,zixewnd deara eritedy zetyd xear zevxd eprvia ,sqepa .1

    .weal

    ,xnelk) mibiiezn mivaw mr epar oky ,inl dheyt dziid zizin`d d`vezl eplaiwy iefigd z`vez z`eeyd

    .(lawl mixen` epgp` dze` dtyd idn epri

    mixehwed oia wgxnd zin ote` t"r 8.1

    -gxnd z` jixrdl ik seqpi` znxepa yeniy dyrp zixewnd deara zihqihhqd dyiba wqry wlga ,xen`k

    milk eplrtd df hwiiexta era ,wapd htynd z` bviiny xehwed oial zetydn `"k ly xehwed oia miw

    .mitqep

    dxexad dxeva - efl ef zeaexw ze`vez epzpy - elawzdy zeivwpetd z` bivdl oevxd lyae mewn iveli`n :mitxbd iabl zillk dxrd

    3

    .(mi`ixw zegt zeidl milelr eid mitxbd efky dl`wqa oky) 100-a miizqne 0-n ligzn inz `l y xiv ,xzeia

    17

  • :ze`ad ze`vezd z` eplaiw ,zixewnd deardn zetyd lr wx epvx xy`k

    aexae ,xzeia zerexbd ze`vezd z` daipd seqpi` znxep ,mikezigde zevxdd lka hrnk ,ze`xl ozipy itk

    600 llk oeni`d xneg xy`k ewap mipezpd ,zixewnd deara .mipte`d xzin izernyn yxtda - mixwnd

    zixewnd deara ebyedy dl`l zene ze`vezl eprbdy ze`xl ozip .46%-k lr enr dglvdd ifeg`e ,zeneyx

    ,k"dqa .mda epaygzd xy`k xzei zeaeh hrne ,miihixw`id mipniqdn epnlrzd xy`k zeaeh zegt hrn -

    .hrna elired miihixw`id mipniqd xy`k ,ef dhiya iedifa dglvd 50%-l 40% oial eprbd

    iedif 70% r - e`n miaeh mirevial eribd epynzyd oda zetqepd zeihnznd zehiyd ,seqpi` znxepn dpeya

    miihixw`i mipniq mr od - dliaend dhiyd .mda zeaygzd jez iedif 80% re ,miihixw`i mipniq `ll

    ziaxn .zixhniq-`ld dqxbae zixhniqd dqxba zene ze`vez daipdy ,KL zhiy dziid - mdirla ode

    xy`k ,(mda epaygzd xy`k xetiyl diihp mr) miihixw`id mipniql xyw ila zene ze`vez eaipd zehiyd

    dxe`kly) miihixw`id mipniqa epaygzd xy`k `weey oiivl oiiprn .Ranks-e KL zehiy md oted i`vei

    -ixw`id mipniqd zkitdy `id jkl daiqd d`xpd lkk .aeh zegt dar Ranks zhiy ,(rin xzei miwtqn

    xy`n dpey ote`a oeni`d xnega zevetpd zeize`d xeiqa miiepiyl dnxb (a-l ǎ ,lynl) zelibx zeize`l miih

    .miiepiyl xzei yibx okle ,xzei mvnevn `edy ,ogand xnega

    :ze`ad ze`vezd z` eplaiw ,zetyd 14 llk lr epvx xy`k

    18

  • zeyrl gilvd `ed ,ala zeixewnd zetyd 6 oian dty zefgl yxp beeqndyk :o`k dxi ze`vezd zeki`

    -xewnd zetyd 6 mr 80% znerl 70%) zetyd 14 oian dty zefgl yxp `edyk xy`n xzei daeh dxeva jk

    ze`xl oiiprn .ixtdl yxp beeqnd odipia zetyd zenka aygzda geina ,oievn iedifa xaen oiir mle` ,(zei

    zeaeh ze`vez mixwnd ziaxna eplaiw ,(6 jezn wx `le) zety 14 oian dty ly beeiqa xaen did xy`k mby

    zeivwpet oia qgid xnyp o`k mb .zixewnd deara yeniy dyrp da ,seqpi` znxep zhiy daipdy dl`n xzei

    e` zene ze`vez eaipd zeivwpetd llk ,Ranks-e Angle hrnl .liaedl dkiynn KL-y jk ,zepeyd dind

    liadl ik ipeigd rind z` miwtqn d`xpd lkky ,miihixw`id mipniqa epaygzd xy`k xzei zeaeh

    .deard zligza hxety itk ,zetyd oia xzei daeh dxeva

    n-gram t"r 8.2

    mi

    n df hwiiexta siqedl ephlgd epgp` ,xen`k .minxbipeie minxbia itl dinl drvazd zixewnd deara

    .dlin seqae dlin y`xa zeize` zegiky lr mikznqny mitqep

    :ze`ad ze`vezd z` eplaiw ,zixewnd deardn zetyd lr wx epvx xy`k

    19

  • :ze`ad ze`vezd z` eplaiw ,zetyd 14 llk lr epvx xy`k

    . e`n leb `ed miixyt`d minxbiad xtqn - minxbiadn ribd xzeia izernynd rind ,epitvy itk

    jildza xzeia axd rind z` wtiq dlind seqa `wee ze`d mewiny zelbl eprzted ,ze

    ead zeize`d oian

    iwtz zlra `id miax mixwna dpexg`d ze`dy daerd mr g` dpwa dler df ielib ,z`f mr gi .dinld

    iedifa riiqz `idy ipeibd hlgda okle ,('eke mipeyd mipnfa lretd ziihp ,miaxd zxev oeiv ,dnbel) iwew

    oiivl oiiprn .(e e` i i"r ziwlhi`ae n i"r zipnxba era ,s ze`d i"r zpievn zilbp`a miaxd zxev ,lynl) dtyd

    ze`vez daipd `id mb mle` ,(iqgi ote`a) zeaeh zegtd ze`vezd z` daipd dpey`xd ze`d itl dwiady

    .(seqpi` znxep ly iedifl dnea) iedif 45% lrn - zerx `l

    (aeyig onf) enild avw 8.3

    epvxd ,dfd mipezpd dpan ozpida .zetyd lkn oeni`d ihtyn lk z` likdy mipezp dpan epxvi oey`xd alya

    `"ka dlind seqa zeize`de dlind zligza zeize`d ,minxbipeid ,minxbiad zeiexiz z` enly minzixebl`

    20

  • .zetyd 14-n

    .zetyd lk xear elld zewihqihhqd lk z` enll ika zeipy 25-k - e`n xidn did enild avw

    overfitting-e zeiawr 8.4

    d`vezke ,zellkd epnn enll mewna oeni`d xneg z` "opyl" ligzn beeqnd xy`k ygxzdl lelr overfitting

    xneg lr xzei zewiene zeaeh ze`vez xifgi beeqnd ,dfky dxwna ."yrx"l xzei leb lwyn ozep `ed ,jkn

    aeh zegt didi beeqnd ,xnelk) xzei zerexb dpiidz xken `le yg ogan xneg lr ze`vezd j` ,xkend oeni`d

    ok` oeni`d xnegy ze`xl ik ipy vne ,g` vn overfitting rvazd `ly `eel ik .(yg rin mr iefiga

    dlaha ze`xl ozipy itk .dinld revia xg`l oeni`d xnegn wlg lr mzixebl`d z` epvxd ,dkldk beeqn

    -nyn dxeva `l j` ,mixwnd ziaxna xzei zeaeh ok` oeni`d xneg lr mzixebl`d zlert ly ze`vezd ,dhnl

    irny dn ,zedeab ziqgi od ze`vezd ,ok enk .overfitting oi`y wiqdl lkepy ,o`kn .oepiy lr dirny zizer

    .mixwnd ziaxna dtyd z` ddif ok` beeqndy jk lr

    Kullback Symmetric

    Kullback

    Angle Eucleadean Infinity Ranks Simple

    Difference

    ze`vezd

    zeixewnd

    69.34 68.87 46.93 59.19 44.94 42.56 53.07

    wlg lr ze`vezd

    oeni`d xnegn ohw

    71.41 69.3 50.95 61.12 40.07 51.91 57.14

    recall, precision, F1 i

    n 8.5

    ogand xneg da dtyd ly iefig eplaiw xa ly eteqae ,ogan xneg lr mzixebl` epvxd ,hwiiextd zxbqna

    :mi`ad mipte`a miiefigd lr lkzqdl ozipy al miyp .aezk

    .dil` jiizyn ok` `ede ,idylk dtyl jiiyk htynd z` epidif = True Positive .1

    .dil` jiizyn `l `ed la` ,idylk dtyl jiiyk htynd z` epidif = False Positive .2

    .dil` jiizyn `l ok` `ede ,idylk dtyl jiiy `lk htynd z` epidif = True Negative .3

    .dil` jiizyn ok `ed la` ,idylk dtyl jiiy `lk htynd z` epidif = False Negative .4

    .precision-e recall i

    n i"r `id ze`vezd z` jixrdl ztqep jxiedifa lawl mixen` epiidy mikeiyd llk jezn ,dtyl oekpd jeiyd xeriy z`

    en xy`

    n `ed recall

    n

    .d`hgdd xeriy z`

    en `ed ,zexg` milina .mlyen

    dze`l lreta eplaiwy mikeiyd llk jezn ,dtyl oekpd jeiyd xeriy z`

    en xy`

    n `ed precision

    n.yrxd xeriy z`

    en `ed ,zexg` milina .dtyd

    recall =True Positive

    True Positive + False Negative

    precision =True Positive

    True Positive + False Positive

    ,rexb precision mr gia ribdl leki aeh recall) mnvr ipta miner `l md la` ,miaeyg md elld mi

    nd ipy`ed mday hleadyk ,mialeyn mi

    n lr mb lkzqdl bedp ,okl .(aeh `ed iefigdy reawl ik i ea oi` okle

    :d`ad dgqepd jnq lr lawzne ,mdipy ly ipenxd rvenn `edy ,F1

    n

    F1 = 2 ·precision · recall

    precision + recall

    aeh iefigd jk ,1-l xzei miaexw mdy lkke ,1-l 0 oiay geeha mirp elld mi

    nd llk ly miixyt`d mikxrd

    21

  • .xzei

    :eply zevxda elld mikxrd z` ep

    n

    with diacritics without diacritics

    500 1000 1500 2000 500 1000 1500 2000

    original 0.648 0.651 0.654 0.653 0.65 0.646 0.642 0.637

    all languages 0.618 0.606 0.599 0.593 0.59 0.586 0.583 0.581

    dhlgd ivr 9

    .2 dlaha exkfedy miax mikezig itl ,zevxd ly e`n ax xtqn eprvia ef dhiya ,lirl xaqedy itk

    ephlgdy zetyd 14 llk xear zextp zevxde ,zixewnd deara eritedy zetyd xear zevxd eprvia ,sqepa

    .weal

    mibiiezn mivaw mr epar oky ,inl dheyt dziid zizin`d d`vezl eplaiwy iefigd z`vez z`eeyd ,o`k mb

    .(lawl mixen` epgp` dze` dtyd idn epri ,xnelk)

    (geexd) dn`zdd zeivwpetl qgia n-gram t"r 9.1

    :ze`ad ze`vezd z` eplaiw ,zixewnd deardn zetyd lr wx epvx xy`k

    :ze`ad ze`vezd z` eplaiw ,zetyd 14 llk lr epvx xy`k

    22

  • :miliaend mi

    nd itl eppiqyk ,mivrd llk xear elawzdy miqetd z` ze`xl ozip `ad sxba

    23

  • mikenp iefig ifeg` eaipd dpexg`d ze`de dpey`xd ze`d lr miaery minzixebl`d zevxdd lka ,epzrztdl

    enk zeveawl dwelg dze`a mxear epynzydy daera dverp jkl daiqd ,epzrl .(30%-n zegt) iqgi ote`a

    .ep`vny zeiebltzdd z` dni`zny zierii dwelga `le ,(3 dlaha hxety itk) illk ote`a minxbipei xear

    dearl - minxbipei xear ode minxbia xear od - e`n zene eply ze`vezd ,Gini ziivwpetl xywda

    epglvd ,xzeia zeaehd ze`vezd z` daipd ef divwpet my ,zixewnd deardn dpeya ,z`f mr gi .zixewnd

    efd d`vezd .minxbia xear ode minxbipei xear od ,IGR ziivwpeta yeniy jez ,xzei s` zeaeh ze`vezl ribdl

    lelry overfitting-d z` mvnvl ik "gzet" IGR ditl ,qxewd jldna epnly daerd mr g` dpwa dler

    izy ly mireviaa 15%-l 5% oia rpy xrt miiw ,ok`e .Information Gain diivwpeta yeniya xveeidl

    .minxbipei xear ode minxbia xear od ,elld zeivwpetd

    zeaeh ze`vez epzp minxbia mr eary zeivwpetd mixwnd aexa ,zixewnd deardn dpeyae ,epitivy itk

    .epynzyd mda miheiaixh`d xtqn z` epnvnv xy`k mb ,xzei

    ze`vezd mle` ,ditexhp`d ziivwpeta yeniy jez exvepy mivrd z` zepal ik ax onf yxp ik oiivl yi

    dliri `l `id ditexhp` ,xnelk .zeivwpetd xzil qgia ode iheleqa` ote`a od ,zeaeh ze`vez eid eaipd mdy

    rin zegt `l ozep milevitd lk jq ,xa ly eteqa mle` ,zeivwpetd x`yn rin zegt ozep da levit lke

    .zexg`d zeivwpetd ziaxna xy`n (rin xzei s` mizrle)

    (aeyig onf) enild avw 9.2

    xtqnl xyi qgia nr dhlgd ur ziipal yxpy onfd jyn .dry ivgk lr nr ur lk ly rvennd diipad onf

    mixwna .jx`zd mivrd ziipa onf jk ,xzei miax eid mdy lkk :oeni`d xnega zexeyd xtqnle miheiaixh`d

    jix`d miihixw`i mipniqa yeniyd .jkn zegt s` e` 800-l miheiaixh`d xtqn z` liabdl epvl`p miax

    24

  • .zihnx dxeva lb miheiaixh`d xtqn oky ,daxda aeyigd jyn z`

    ivr era ,(ze

    ea zew xtqn) xdn ziqgi epap dpexg`d ze`de dpey`xd ze`d ,minxbipeid ly mivrd

    ur era ,dwn zegt jez dpap zeneyx 2000 lr IG ly mxbipeid ur ,lynl) ax aeyig onf eyx minxbiad

    miheiaixh`d xtqn xe`l z`f xiaqdl ozip .(zew 56 jez dpap zeneyx 2000 lr Gini Gain itl mxbiad

    dnk lr nr oxtqn okle ,ze

    ea zeize` eid miheiaixh`d oky) oey`xd beqdn mivrd ly ziqgi mvnevnd

    -nxbiad ivra miheiaixh`d xtqn era ,(miihixw`id mipniqd ztqeza "zelibxd" zeize`d - ze

    ea zexyr

    xaqedy itk ,oey`xd beqdn miheiaixh`d xtqn reaix ly leb xqa `ed ,dyrnl) zizernyn leb did mi

    .(lirl

    n `id ditexhp`dy `id jkl daiqd .xzeia lebd aeyigd jyn z` yx ditexhp`d

    n ,mi

    nd oian

    dxeva mipezpd z` wlgn `l `ed okle ,mipeyd mix'vitae zepeyd zeiexyt`a witqn aygzn `ly ,in illk

    miynzyn jk lya weia .aeyigd jyn zlbdle xzei daxd leb diqxewx wnerl liaeny dn ,dxexa witqn

    .mitqep miaeyiga dze` millwyn la` ditexhp`d z` miaygny ,IGR enk xzei miaeh mi

    na k"a

    25

  • VI wlg

    izrl zeaygne zepwqn ,oei

    zewiien ze`vezl ribdl ozip eply deard qiqa lry zepin`n epgp`" ik eazk xnze oxen ,ozear mekiqa

    iedif 70% lrnl ribdl epglvd - zizernyn zeaeh ze`vez od epzeara eprbd odil` ze`vezd ,ok`e ."xzei s`

    ,(zepey (geex) dn`zd zeivwpete zeihqihhq zeivwpet mre ,minxbipei ,minxbia xear) mikezige zehiy oeebna

    ,60%-k dziid xzeia daehd d`vezd zixewnd deara era ,IGR zhiya 73%-le KL zhiya iedif 79%-l s`e

    feg` ,zehiyd xzi lk xear - zg` dhiya wxe ,dtyl zeneyx 2000 ly leb oeni` xneg xear dlawzd `ide

    .xzeid lkl 45% did iedifd

    hlgend aexa xy`k ,minxbipei xear `wee elawzd xnze oxen ly xzeia zeaehd ze`vezd ik oiivl oiiprn

    elrd od .dglvd 25%-kl eribd od ,minxbia mr ear odyk .50%-n miphw eid dglvdd ifeg` mixwnd ly

    mipey mirhwnl dwelg ike ,dpir witqn dziid `l d`xpk eyry dwelgdy `id jkl daiqdy dxrydd z`

    "wgyl" epiqip ,mivrd lr deardn wlgk .zxg` dler eply deardn ,mle`e .xzei zeaeh ze`vez aipdl dieyr

    ,zepeebne zepey zewelg lr zeax minrt mivrd z` epvxd :(3 dlaha ze`xl ozipy itk) mivawnl dwelgd mr

    ,wenr zegt didi urd ,mipeyd mivawnd oia zpfe`ne dpir xzei didz dwelgdy lkky dziid daygndyk

    epiqipy dwelgd oia ze`veza mixkip milad epilib `l xa ly eteqa ,mle`e .xzei daeh didz dwelgde

    ,epzrl .zixewnd dwelgd mizrle ,xzei zeaeh ze`vez dbiyd dygd dwelgd mizrl ;zixewnd dwelgde

    `le mihen eid odly ogand e` oeni`d ixnegy zeidl dieyr zixewnd deara ziqgi zekenpd ze`vezl daiqd

    ztqep zexyt` .dn`zd dziid `l okle ,(oeni`d xnegn dpeya) iewip xar `l odly ogand xnegy e` ,mipekp

    dxeva miyg mixneg mr

    enzdl ri `l odly beeqnd okle ,overfitting dxvi minxbiad ly dinldy `id

    .daeh

    ly zeikeaiqd znx z` dlrn mzx`ydy egipd ode xg`n ,miihixw`id mipniqd z` exiqd xnze oxen

    -`d izya jenzl epxga ,z`f znerl ,eply hwiiexta .dinld jyn z`e mipezpd ipan z` dlibne ,dniynd

    mipniqd ,`eand wlga ephxite epitivy itk .mze` xiqdle ,miihixw`id mipniqd z` xizedl - zeiexyt

    - dhlgdd ivra ode zihqihhqd dhiya od - minzixebl`d ziaxn okle ,izernyn rin etiqed miihixw`id

    -ihqihhqd milka 10%-k ly rvenn xetiy) miihixw`id mipniqa epaygzd xy`k xzei zeaeh ze`vez ebivd

    minzixebl`d ,miihixw`id mipniqd `ll mb ik oiivl aeyg ,z`f mr gi .(dhlgdd ivra 20% r lye mi

    drityd miihixw`id mipniqd zx`yd ik oiivl i`k .e`n zeti ze`vezl eribde ,mnvr z` egiked mipeyd

    dvixd ipnf lr xkip ote`a drityd `id mle` ,mihqihhqd minzixebl`a dvixd ipnf lr inl gipf ote`a

    miihixw`id mipniqd zx`yde zeid ,riztn df oi` .(epap xak mivrdy ixg` beeiqa `l la`) mivrd ziipaa

    libdl dlelr j` ,miihqihhqd minzixebl`a mitqep miaeyig ly reawe mvnevn xtqn xzeid lkl dtiqen

    wner z` miax mixwna dlrn df xa .(lirl x`ezy itk) dlah lka miheiaixh`d xtqn z` zxkip dxeva

    .mivrd ziipa ly aeyigd onf z` mb m`zdae ,diqxewxd

    zehiy zervn`a ...ze`vezd z` xtyl ozipy" zeayeg od ik eazk xnze oxen ,zihqihhqd dyibl xywda

    iedif zeleki epibtd epynzyd mda miygd milkd llk ,ok`e ."seqpi` znxep[n℄ ...zenwzn xzei dwqd

    ik zelbl oiiprn did ,sqepa .zegt zeaeh ze`vezl ribd seqpi` znxep lr qqazdy

    nd era ,e`n zedeab

    .zixhniq-`ld ezqxbae zixhniqd ezqxba zedf hrnk ze`vez aipd KL

    n

    26

  • ax aeyig onf yxpe ,e`n mileb mdy jka ielz mdly ixwird oexqgd ik epilib ,dhlgdd ivrl xywda

    miheiaixh`d xtqn z` mvnvl epyxp okle ,ilniqwnd diqxewxd wnerl eprbd miax mixwna .mzxivil

    mr "wgyl" epiqip xen`k .miihqihhqd mixwna xy`n aeh zegt did beeiqd ,jkn d`vezk .epynzyd mda

    ly ze`vezd ziaxn ,z`f zexnl .xwip xetiy e`xd `l dl` miiepiy j` ,ze`vezd z` xtyl oeiqipa zewelgd

    wfgny dn ,zetyd 14 lk q"r dpap `edyke zeixewnd zetyd 6 lr "wx" dpap urdyk zedf e`vi dhlgdd ivr

    .wapd zetyd xtqnn mirtyen `le hrnk mdye ,dty iedifl miaeh milk ok` md dhlgd ivr ditl ,epzyib z`

    od era ,dhlgdd ivr mr daeh dxeva ear `l dpexg`d ze`de dpey`xd ze`d ly mi

    nd ik ze`xl ozip

    .zeveawl dwelgl xeyw dfy mixeaq epgp` ,xen`k .miihqihhqd milkd mr zizernyn zeaeh ze`vezl eribd

    xewgl epwtqd `l ,miax zereay ekynp mgezipe mzvxd ,hwiiexta minzixebl`d llk gezit jildze xg`n

    - zeyg zewelg mr zepey zevxd zeqple ,mipezpd z` oegal leki iizr hwiiext ,epzrl .wnerl `yepd z`

    .zetyd ilitext z` xzeia daehd dxeva `hal lkezy ,zil`ii` dwelg zlawl r - zegte xzei zepir

    :dnidnd dprhd z` ygn dgiken ,megza zetqep zeax zeear ly ze`vez enk ,eply deard ze`vez

    e` mipey mialyn ,wew iweg ly driia jxev oi` - mew ipyla ri ila elit` zeirah zety zedfl ozip

    ,miline zeize` ly yaie "xw" gezipa wtzqdl xyt` `l` ,dtyd z` zedfl ik milin ly zihpnq zernyn

    ozip ,sqepa .dti zextqe hpxhpi` ixz` ,mipezir znbek ,miyibp zexewn ly mevr oeebnn gwlidl zelekiy

    zety zee` epl yiy ipylad rid z` xiyrdl ik yeniy epiyr mday milkae deard ze`veza ynzydl

    ziwlhi`a dlin seqa xzei dgiky a ze`d eitl llk ielib ,dnbel) mipey miweg ,odipia daxwd zin - zepey

    .jkn zernzynd zeiaihipbew zeiernyn weale ,(zilbp`a xy`n

    minxbipeie minxbia ly aeliy znbek ,mitqep mi

    n ztqed i"r eply hwiiextd z` aigxdle jiyndl ozip

    ribdl lkep epzrl .(dnbel ,miliaend minxbipeid y-e miliaend minxbiad x md ely miznvdy ur zepal ozip)

    z`tn hwiiextd zxbqna z`f ynnl epwtqd `l ,epxrvl .odil` eprbdy dl`n xzei zeaeh elit` ze`vezl jk

    ,ewap `ly mipey mi

    na mixeyw xewgl ozipy mitqep mi`yep .epynzyd mda mi

    nd ieaixe onfd xvew

    dkex` zipnxba zrvennd dlind ,lynl) dtya zrvennd dlind jxe` e` dtya rvennd milind xtqn znbek

    .xzei s` zeaeh ze`vezl ribdle ,mivawnl dwelgd z` xtyle zeqpl ozip ,ok enk .(zilbp`a zrvennd dlindn

    27

  • VII wlg

    zexewn

    http://www.cs.huji.ac.il/~ai/projects/NLP.pdf :xnze oxen ly zixewnd deard •

    "zizek`ln dpial `ean" - 67842 qxewd ly milebxzde mixeriyd zebvn •

    "dty ly miiaihipbew mihaide ziaeyig dinl" - 36622 qxewd ly mixeriyd zebvn •

    • Gutenberg Project - http://www.gutenberg.org/

    • http://www.bookrix.com/

    • http://www.e-book.com.au/morefreebooks/freemultilingualbooks.htm

    • http://tnlessone.wordpress.com/2007/05/13/how-to-detect-which-language-a-text-is-written-in-or-when-

    science-meets-human/

    • http://en.wikipedia.org/wiki/List_of_languages_by_writing_system#Latin_script

    • http://en.wikipedia.org/wiki/Letter_frequency

    • http://stackoverflow.com/questions/3194516/replace-national-characters-with-ascii-equivalent

    • http://staff.science.uva.nl/~tsagias/?p=185

    • http://www.ise.bgu.ac.il/faculty/liorr/hbchap9.pdf

    • http://www.onlamp.com/pub/a/python/2006/02/09/ai_decision_trees.html?page=4

    • http://www.101languages.net/common-words/

    28

  • VIII wlg

    zeihqihhqd ze`vezd ixwir hexit - '` gtqp

    original langauges w/ diacritics original langauges w/o diacritics

    500 1000 1500 2000 500 1000 1500 2000

    Kullback 79.43 78.07 78.07 79.07 67.72 68.36 66.92 66.86

    Symmetric Kullback 77.85 75.71 75.5 76.71 67.28 67.28 65.86 64.92

    Angle 59.5 58.28 60.57 59.43 57 56.5 60.22 58.78

    Eucleadean 70.21 66.57 68.71 67.5 67.07 66.72 66.78 66.78

    Infinity 48.85 43.14 47.71 46.29 41.14 42.57 45.42 42

    Ranks 58.07 60 58.57 60.71 69.28 67.07 65.22 68.14

    Simple Difference 62.85 65.14 64.14 64.79 58.36 61.07 60.5 61.78

    All langauges w/ diacritics All langauges w/o diacritics

    500 1000 1500 2000 500 1000 1500 2000

    Kullback 69.34 69.22 69.59 69.88 62.73 63.45 62.53 61.9

    Symmetric Kullback 68.87 68.33 69.19 69.4 59.09 60.43 59.02 57.75

    Angle 46.93 46.17 45 44.75 49.48 49.25 49.25 48.96

    Eucleadean 59.19 58.77 57.38 57.12 56.66 57.98 57.56 56.59

    Infinity 44.94 41.49 43.33 40.8 39.78 39.78 41.03 38.39

    Ranks 42.56 43.85 46.35 45.60 55 57.53 58.45 57.75

    Simple Difference 53.07 52.25 51.78 50.67 53.17 53.21 52.98 51.5

    original langauges w/ diacritics original langauges w/o diacritics

    500 1000 1500 2000 500 1000 1500 2000

    Unigrams 62.57 62.05 61.91 60.1 56.96 55.61 57.81 58.57

    Bigrams 69.55 66.89 68.48 69.39 69.6 65.31 68.86 68.81

    First 58.42 60.86 61 61.1 55.1 54.23 52.91 55.71

    Last 77.96 75.42 75.52 77.72 71.42 79.05 73.52 70.61

    All langauges w/ diacritics All langauges w/o diacritics

    Unigrams 53.23 52.04 54.54 53.14 52.1 52.25 54.06 53.23

    Bigrams 68.01 65.59 65.53 65.65 68.88 67.42 67.07 66.84

    First 47.6 48.52 47.85 46.53 45.09 48.32 46.55 45.03

    Last 53.95 55.54 54.57 55.47 53.14 55.32 54.5 53.1

    29

  • IX wlg

    dhlgdd ivra ze`vezd ixwir hexit - 'a gtqp

    First Letter Last Letter

    500 1000 1500 2000 500 1000 1500 2000

    Gini 20 21.15 21.84 18.39 23.45 21.38 20.92 22.76

    Entropy 20.68 20.68 22.06 22 25.74 26.43 23.9 29.86

    IG 18.85 19.54 20.23 21.61 20.92 20.46 20 20.1

    IGR 22.53 27.36 29.66 29.89 21.38 26.9 28.28 26.67

    Train Error 16.09 17.93 18.62 18.16 15.86 20.69 20.69 20.69

    Unigrams Bigrams

    500 1000 1500 2000 500 1000 1500 2000

    Gini 51.03 49.2 52.41 54.71 30.11 30.8 28.9 31.03

    Entropy 57.24 62.29 70.11 68.28 61.38 64.83 67.13 62.56

    IG 42.53 46.67 53.79 56.55 56.32 61.84 61.61 63.51

    IGR 61.38 62.07 72.64 71.49 69.65 71.3 73.33 72.64

    Train Error 39.77 42.53 44.83 46.44 27.58 28.05 33.1 31.72

    30