:zizek`ln dpial `ean qxewa meiq hwiiext
zirah dty iedif
201564895 ,lresisi ,iqiqx xe`il
200790111, mikab4, owxa dwin
2014 uxna 23
1
mipiipr okez
4 `ean I
6 megza zexeyw zeeare dtyd iedif zniyn II
7 ogand xnegle oeni`d xnegl mipezpd seqi` III
7 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . zetyd zxiga 1
7 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . oeni`d xnega letihde mihqwhd xewn 2
8 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . miihixw`i mipniqa letih 3
9 minzixebl` IV
9 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . stif weg - ziai`pd dyibd 4
9 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . dlibx dxitq 4.1
9 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . "Borda Count" zhiy 4.2
10 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . zihqihhq dyib 5
10 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . n-grams t"r 5.1
10 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . mxbipei 5.1.1
10 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . dlin y`xa ze` zegiky 5.1.2
11 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . dlin seqa ze` zegiky 5.1.3
11 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . (zeize` inv) mxbia 5.1.4
11 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . dty lk ly mixehwed egi` ote` 5.2
11 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . ig` lwyn 5.2.1
11 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . ilniqwnd jxrd zxiga 5.2.2
11 . . . . . . . . . . . . . . . . . . . . . ogapd xehwel dtyd xehwe oia wgxnd zin ote` 5.3
11 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . "heyt" wgxn 5.3.1
12 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . iilwe` wgxn 5.3.2
12 . . . . . . . . . . . . . . . . . . . . . . . mixehwed oia zieefd qepiqew t"r oein 5.3.3
12 . . . . . . . . . (zixhniq-`l dqxbe zixhniq dqxb) Kullback-Leibler wgxn 5.3.4
12 . . . . . . . . . . . . . . . . . . . . . . . . . . . (Ranks) mewina miyxtdd mekq 5.3.5
13 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . seqpi` znxep 5.3.6
13 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . dhlgd ivr 6
14 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . n-grams t"r 6.1
14 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . mxbipei 6.1.1
14 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . dlin y`xa ze` zegiky 6.1.2
14 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . dlin seqa ze` zegiky 6.1.3
14 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . (zeize` inv) mxbia 6.1.4
14 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . zeveawl epwlig ea ote`d 6.2
14 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . (geexd) dn`zdd zeivwpet 6.3
14 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Entropy itl 6.3.1
2
15 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Information Gain itl 6.3.2
15 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Information Gain Ratio itl 6.3.3
15 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Gini Gain itl 6.3.4
16 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Train Error itl 6.3.5
16 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . urd ziipal mzixebl`d 6.4
16 . . . . . . . . . . . . . . . . . . . . . . . . . . . . (classification) dtyd beeiql mzixebl`d 6.5
17 ze`vezd V
17 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . (stif weg) ziai`p dyib 7
17 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . zihqihhq dyib 8
17 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . mixehwed oia wgxnd zin ote` t"r 8.1
19 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . n-gram t"r 8.2
20 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . (aeyig onf) enild avw 8.3
21 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . overfitting-e zeiawr 8.4
21 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . recall, precision, F1 i
n 8.5
22 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . dhlgd ivr 9
22 . . . . . . . . . . . . . . . . . . . . . . . . (geexd) dn`zdd zeivwpetl qgia n-gram t"r 9.1
24 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . (aeyig onf) enild avw 9.2
26 izrl zeaygne zepwqn ,oei VI
28 zexewn VII
29 zeihqihhqd ze`vezd ixwir hexit - '` gtqp VIII
30 dhlgdd ivra ze`vezd ixwir hexit - 'a gtqp IX
3
I wlg
`ean
zepey zety 120-n xzei ly ozaizkl ynyn `ede ,mlera zevetpd azkd zekxrnn zg` `ed ipihld ziatl`d
zetya etqep odil` ,zeiqiqa zeize` 26 llek `ed .('eke zeia`lq ,zeip`nxb ,zeip`nex) zepey zety zegtynn
-l`a agxpd yeniyd .dtyl ziatl`d z` mi`zdl erepy ,(miihixw`i mipniq") migein mipniq zenieqn
zedfl ozip vike m`d ,zeipihl zeize`a aezkd hqwh ozpida - zxbz`ne zpiiprn ziaeyig dira dlrn df ziat
?aezk `ed zirah dty efi`a
irna miax mitpre divipbewd irn ,ziaeyig zepyla ,zepyla znbek) miax minegz dtiwn efd dirad
miweg xaa iwqneg ly zeipylad zeqitza oel lkep ,dnbel ,jk .zexg` zeax zel`y dlrn `ide (aygnd
zedfl minel e` dty miykex epgp` ea ote`d znbek zeiaihipbew zel`ya e` ,zeirah zety ly miilqxaipe`
lerii znbek ,miax miihwxt miyeniy dl yi :`ixb zihxe`iz dppi` dirady xekfl aeyg ,liawna .milin
ly ihqihhq gezip .'eke ihnehe` mebxz jildza oey`x alyk ,miknqn mr deare hqwh yetig zeniyn
Cryptanalysis ly megza xefrl s` leki ,zepey zetya zeize` ly zeiexiz gezip ,hxtae ,zety oia miladd
iedif e` ,mihqwh ly geprte dptvd zeniynn wlgk ,miiq`lw miptve dtlgd ipteva ,lynl) rin oeghae
.'eke ,(onted eiw znbek) rin ly dqige eiwa ,(gpretnd jnqnd znerl ixewnd jnqna miqet
megz `edy ,NLP-d megza `yep-zzk) zirah dty iedif ziira ly zhyten dqxba weqrz eply deard
aezk day dtyd z` ihnehe` ote`a zedfl enll ozip oda mikx x`zp dkldnae ,(AI-d mlera e`n oiiprn
:zeipihl zeize`a zeazkp xy` y`xn zexben zety xtqn ly oeebn oian ,oezp hqwh
zipnex zixtq zipnxb ziwlhi`
zieey ziplet zixbped zifpepi`
zifbehxet ziwxeh zilbp`
ziztxv zipihl qpwixt`
zeize`d zebltzdl xeywd lka ipiite` qet yi dty lkly did hwiiextd jldna epze` dgpdy iqiqad oeirxd
ly zebef ly e` (minxbipei) ze
ea zeize` ly ziqgid zegikydy dtvp ,dnbel,jk .dtya zepeyd milina
dxeyw dlin seqa e` dlin y`xa znieqn ze` ly zegikydy oke ,dtyl dtyn dpey didz (minxbia) zeize`
xear elld miqetd z` enlle zeqpl ephlgd ,df oeirx xe`l .diiedifa riiqle dilr irdl leki jkitle ,dtyl
miqetde mipezpd lr jnzqda ,oezp hqwh ly dtyd z` zefgl lkep m`d weale ,lirl zetyd 14-n `"k
.epnly
,2010-2011 l"dpya xnze oxen i"r dazkpy "Natural Language Detection" deard mr azkzz epzear
ze`vezd z` xbz`l dqppe ,dpey dxeva dirad z` sewzl dqpp epzeara .dne dniyn mr d
enzdy
ribdl epl eriiqiy ,zizek`lnd dpiad megzn miax mitqep milka yeniy jez ,ebiyd zeixewnd zexagndy
- zigpen dinl zehiy zervn`a diral ybip epgp` mb ,zeixewnd zeazekd enk .xzei s` zeaeh ze`vezl
ladd mle` .oeni`d xneg jezn zety beeiq zeivwpet zepal dqppe ,dhlgd ivr zervn`ae miihqihhq milk
leb ladl liaedl lekiy dn ,dtyd ddefn mditl mipiit`nd zxigaa oenh didi zeeard izy oia izednd
mda minzixebl`ae zilnihte`d aeyigd jxa mb enk ,yeniy dyrp mda miihqihhqd milkae dhlgdd ivra
,zeixewnd zexagnd elirtdy dl` lr mitqep miihnzne miiaeyig zepeirx lirtp ,jkl sqepa .ynzydl epxga
zniieqn dty zedfl dqpp ,"ala" zety yy oia xgai beeqndy mewnay jk ,beeiqd zniyn z` aigxp s`e
jixvne ,zxkip dxeva dniynd z` jaqn oaenky dn) ipihl ziatl`a zeazkpy zety ly xzei agx oeebn jezn
.(mitqep miax mixhnxta zeaygzd
lr d`ln dhily epivxe xg`n ,epii lr azkp hlgend eaex .Python3.0-a azkp hwiiextd zxbqna ewd
eidy ,(NLTK znbek) y`xn mixben milk lr qqazdl epivx `le ,epar mzi` mixnegd lre minzixebl`d
zaezka (README.txt uaewa dvxd ze`xed llek) yibp ewd llk .eply zeyinbd zin z` liabdl mileki
:d`ad
http://tinyurl.com/qda6f5h
5
II wlg
megza zexeyw zeeare dtyd iedif zniyn
xfrip epzear jxevl .oezp hqwh ly dty iedif zniyn rval ep`eaa rixkne izedn wlg `id mipiit`n zxiga
zigpen dinl ly zepey zehiy zervn`a dirad z` sewzl lkepy ik ynzyp mda xy` ,miipyla mipiit`na
(bigrams) minxbiae (unigrams) minxbipei ly zeiegiky weap ,jildzdn wlgk .elld mipiit`nd lr eknzqiy
zeize` nv e` zniieqn ze` zrted ly zeiexazqdd z` `vnp ,xnelk ,zewapd zetydn `"ka zeize` ly
mitqep mipiit`n .aezk hqwhd da dtyd lr (jkn enll e`) jkn jilydl lkep m`d weape ,znieqn dtya
dlin y`xa znieqn ze` zrted zegikyl zeqgiizd ellki - zixewnd deara eqgiizd `l mdil` - ogapy
.'eke ,(zilbp`a dxipe ziwlhi`a dgiky dlin seqa "a" ze`d ,lynl) dlin seqa e`
mixwege mipyla i"r xwgpe rei mwlgy) elld mipiit`nd z` enll dqpp ,ipylad zrd megzn dpeya
ode (mixehwe z`eeyd) "dheyt" zihqihhq dwia jxevl od mda ynzype ,ihnehe` ote`a (NLP-d megza
zxbqna .('eke information gain ,ditexhp` znbek) zepey geex zeivwpeta yeniy jez ,dhlgd ivr ziipa jxevl
`ly zeax zetqep zehiya ynzype ,zixewnd deara yeniy dyrp oda zeeznd z` xbz`l dqpp deard
znxepa yeniy i"r xara elawzdy ze`vezd z` xtyl ozip m`d weap ,dnbel ,jk .zixewnd deara ewap
ly eteqa .'eke Kullback-Leibler wgxn znbek mitqep milka yeniy mb enk ,zetyd z`eeyd myl seqpi`
miweg ly hq zxivie zizek`lnd dpiad megzn milk zlrtd jez ,xzei zeaeh ze`vezl ribdl dvxp ,xa
.oeni`d xneg jnq lr enliiy ,miqete
hqwhd z` zeaikxnd milind yetig i"r zeidl dleki ,aezk oezp hqwh da dtyd iedifl ziai`pd jxd
`"k xear oelin wfgzl ,xnelk) zeknzpd zetydn `"ka milind lk z` lelkiy ,ierii rin xb`na wapd
,z`f mr gi .("daexw ikd" dtyd z` xifgdle ,oezpd hqwha milindn `"k ea ytgl ,zeknzpd zetydn
(zetydn `"k xear) dfky mevr rin xb`na yetigd jyne ,mewn zpigan xwi e`n `ed dfky iai`p oexzt
zxez yi` ly dpga`a ynzydl lkei ixyt` xetiy .zeliria zihnx dxeva rbety dn ,e`n jex` zeidl ieyr
"apf"e xvw "seb" zlra `ide ,dne i zetyd lka milind zegiky zebltzdy
1
d`xdy ,stif 'bxe'b divnxetpi`d
milind 100-e ,iqetih hqwha mirtendn 25%-k zeqkn zilbp`a xzeia zevetpd milind 10 ,lynl jk :jex`
lelkiy ,"mvnevn" xb`nl xaend xb`nd z` mvnvl didi ozip ,xnelk .mirtendn 45%-k zeqkn xak zevetpd
z` zepiit`ny zeiegii xeyiw zelin e` ,zetydn `"ka zevetpd milind (ze`n e`) zexyr dnk ly dniyx
dpi` - mewn zegt zizernyn zkxeve zkaeqn zegt zizernyn `idy s` lr - efky dinl ,mle`e .dtyd
milin zedfl dvxpy mixwna ziyeniy didz `l `id ,efn dxzie ,zizek`ln dpia zpigan "zpiiprn" zn`a
,zxg` jx `evnl dvxpy ,o`kn .zevetp e`n milin e` qgi zelin llek `ly ,milin xtqn ly svx e` ze
ea
ddeab zexazqda oezpd hqwhd ly dtyd z` ddfiy ,beeqn eitl zepale ,oezp oeni` xneg lr jnzqdl lkezy
.efd dyibd z` mb dxvwa epga epzeara ,z`f mr gi .ozipd lkk
-al `ean iqxewa mb epxai dilr ,dtyd iedif zniyn oexzitl xzeia zpiiprnde zirahd jxd `id dinl
lr miqqazny milk md ,Google Translate znbek ,NLP-d megza miax miihnehe` milk .ziaeyig zepyl
mivex epgp`e ,"dfd oeeika mikled" ziaeyigd zepylad megza miax mixwgn ,ok enk .zewihqihhq lre dinl
.ribdl lkep ze`vez eli`l - oaenk ,oitp` xirfa - weale jiyndl
George K. Zipf (1949), Human Behavior and the Principle of Least Effort, Addison-Wesley. 1
6
III wlg
ogand xnegle oeni`d xnegl mipezpd seqi`
:zeizernyn e`n zehlgd xtqn lawl epvl`p ,dniynd mr
enzdl ep`eaa
zetyd zxiga 1
znerl) zedfl lkepy zetyd xtqn z` zxkip dxeva libdl did epinvrl epavdy miixewnd mirid g`
lr e`n drityn `id oky ,e`n zizernyn `id ozedfe zetyd zenk zxiga .(zixewnd deara zety yy
zeyg zeize` "siqedl" dlelr xgapy dty lk ,efn dxzi .oeni`d ixneg gtp lr mb enk ,zxgapd dibhxhq`d
libdl leki ipy vn la` ,iedifd zniyn lr lwdl leki g` vny dn ,(zieeya å znbek) dl zeiegiiy
,xnelk ,urd leb z` zizernyn dxeva libdl jkae ,minxbiad (jkn xeng)e minxbipeid zenk z` zxkip dxeva
- dxwira zipkh `id ztqep zixyt` dira .epxviy mivawd leb z`e dvixd jyn z` zihnx dxeva jix`dl
- zety oia oeina dxeyw zxg` dira .xzei zkaeqn zeidl dlelr xzei "zeihefw`" zetya oeni` ixneg zbyd
jildz ly eteqa .iedifd jild lr zeywdl did lekiy dn ,zene e`n zety od ziplede qpwixt` ,dnbel ,jk
eptqed odl ,zixewnd deardn zetyd 6 z` zelleky ,lirl ehxety zetyd 14 z` xegal ephlgd ,daiyg
zegtynn zety ,(zipihle ziwlhi` ,dnbel) szeyn ixehqid xewn zelrae "zene" zety - zepiiprn zety
.('eke ,zia`lq dty `id ziplet ,zip`nxb dty `id zieey ,zip`nex dty `id zipnex ,lynl) zepey zety
oeni`d xnega letihde mihqwhd xewn 2
hqwhd zty z` zedfl zexyt`d didzy `id ,hwiiextd ziy`xa epnvr ipta epavdy zeaeygd zexhnd zg`
xewne xg`n .hpxhpi`a bela e` ditiwiea jxr ,dxiy xtq ,oezirn gewl `ed m`d - exewnl xyw `ll
milind jxe`e mihtynd jxe` ,zeize`d zeiebltzd lre milind xve` lr e`n ritydl mileki ebeqe hqwhd
.mipey dty ialyn xzeiy dnk miqkny ,mipey zexewnn oeni` ixneg biydl epl aeyg did ,miynzyn oda
-i` xnega ynzydl epivx `l` ,ze`vezd z` zehdl elkeiy mii`xw` mihqwh lr xytzdl epivx `l ,sqepa
oda biydl lwy ,zeyibpd zetya mb - zetyd oeebna dne swidae dig` dnxa ,ozipd lkk izeki`e oin` oen
ziwxeh ,zifpepi` znbek) oda miyibp zegt mixnegy zetya mbe ,(ziztxve zipnxb ,zilbp` znbek) mixneg
xg`l .elld zetyd z` mixae `l epgp`e zeid mixnegd zeki` z` jixrdl dyw epl did ,sqepa .(qpwixt`e
zxed `ed ef dxhn zbydl xzeia aehd xewnd ik ep`vn ,mihqwh ly e`n agx oeebn yetige dwinrn dwia
,dixehqid ,difhpt ixtq) zepey zeixebhwn mixtq zxiga lr eptwd .zyxa zety oeebna miipexhwl` mixtq
.mipey dty ialyn lr rvazz dinldy epb` jkae ,('eke dxiy
epyyge xg`n ,milin yng zegtl ellky mihtynd z` wx epxnye ,mihtynl epwxit mipeyd mihqwhd z`
dtyl dne didiy ,oin` litext mditl yabl lkepy ik zeize` witqn milikn `l xzei mixvw mihtyny
zeize`d lk z` epktd ,sqepa .mipey mihtyn 2000 zegtl lelki dty lka oeni`d xnegy jkl epb` .idylk
.lower case-l
xear .ditiwie - zilkza dpey xewnn `wee eze` epgwl ,ozipd lkk oeebne "i`nvr" didi ogand xnegy ik
dtyd `id zwapd dtydy dpind ly (dievxd dtya) ditiwied jxrn mihtyn llk ogand xneg ,dty lk
zeny ellki `l ,lynl) zexf zetya rin e`n hrn lelkie ,oin` didi my rindy daygn jezn ,da zinyxd
epynzyd zixtq xear ,lynl ,jk .(dtyd dze`a mibyene migpen `l` ,zilbp`a miax miirn migpen e`
."dwixt` mex" jxra epynzyd qpwixt` xeare ,"xtq" jxra
7
miihixw`i mipniqa letih 3
.zilbp`n epl zexkend zeize`d 26-l xarn ,zetqep zeize` zeniiw zeipihl zeize`a zeazkpd zetydn zeaxa
,dlind ly diibdd ote` lr miritynd miitxbezxe` mipniq md ,miihixw`i mipniq mi`xwpd ,dl` mipniq
,ă ,á ,ä zeidl dleki a ze`d ,dnbel ,jk ."dlibxd" ze`l zgzn e` lrn edylk oniq ztqed i"r elawzd mde
dly zernynd z` mb `l` ,dlind ly diibdd z` zepyl wx `l leki ihixw`id oniqa yeniyd xy`k ,'eke ǎ
.("xak" `id schon dlind zernyn era ,"dti" `id "schön" dlind zernyn ,zipnxbd dtya ,lynl)
lr (zeywdl e`) lwdle ,hwiiextd lr zizernyn dxeva ritydl dleki miihixw`i mipniq xaa dhlgdd
mipniq xrid era ,zipnxb `id zwapd dtydy jk lr zwdaen dxeva irz ß ze`d ,lynl ,jk .dtyd iedif
epiid oiir ,miihixw`id mipniqd z` xiqdl mivex epiidy dgpda ,z`f mr gi .zilbp` lr fnxn miihixw`i
,(dlibx a-l ä jetdl ,lynl) zg` ze`l dxnd zlaha ynzydl did ozip - z`f zeyrl vik rixkdl miyxp
.elld mipniqdn lilk mlrzdl elit` e` ,(zipnxba bedpy itk ,ss-l ß e` ae-l ä jetdl ,lynl) zeize` izyl
,yeniy miyer epgp` mda mixehwed llk lre zeiebltzdd lr zxkip dxeva ritydl leki dfky oewiz lk ,xen`k
,zipyla dpigan jxr zxqg `id miihixw`id mipniqd ly dxqd ,jkl xarn .hwiiextd ze`vez z` zepyle
.milin mi`ivnne ,dtyd z` oihelgl mipyn epgp` oky
weal ephlgd ,ziaeyigd zepylad megzn mixwege mipyla xtqn mr zeievriizd ellky - miax mihal xg`l
-xw`id mipniqd zxqd xy`k ,miihixw`i mipniq `lle ,miihixw`i mipniq mr :mipte` ipya ze`vezd z`
NFD (Normalization Form hnxeta yeniye oe'ziit ly unicodedata ziixtqa yeniy jez drvazd miihi
lr e`n milwn miihixw`i mipniq ditl) eply dxrydd z` weal lkep jk .Canonical Decomposition)
.miihixw`id mipniqd z` dxiqdy ,zixewnd dearl epzear ze`vez z` zeeydl lkep oke ,(dtyd iedif
8
IV wlg
minzixebl`
dyibd z` dxvwa weal mb epxga mle` ,dhlgdd ivre zihqihhqd dyibd lr yb epny epzeara ,xen`k
.stif weg lr zqqazny ,ziai`pd
stif weg - ziai`pd dyibd 4
xqa) oda yeniyd zegiky itl idylk zirah dtya milind z` bxp m` eitl ,ixitn` weg `ed stif weg
:
1i-l zilpeivxtexty zexiz zlra `id i-d dlind ik `vnp ,(xei zegiky
occurances (wi) =K
i
.edylk reaw `ed K-e ,dzexiza i-d dlind ly zertedd xtqn `id occurances (wi) xy`k
,dinl o`k oi` oky ,zizek`ln dpia zpigan "zpiiprn" zn`a dpi` efd ziai`pd dyibd ,`eand wlga xen`k
`l didiy epybxde ,qegii zewpk dze` `iadl oekpl ep`vn ,z`f mr gi .milin zniyxa heyt yetig `l`
.efd dhiyd z` xikfdl ilan ihnehe` ote`a dty iedif zniyn lr xal oekp
okn xg`le ,(zipihl hrnl) zewapd zetydn `"ka xzeia zevetpd milind zniyx z` epfkix oey`xd alya
dlin lk lr epvx ,zrk .x ∈ {10, 20, 50, 100, 500, 1000} xy`k ,dty lka zevetpd milind x z` wx ep`ved
dhiy e` dlibx dxitq) epar dzi` dhiyl m`zda .zevetpd milind x zniyxa dze` epytige ,ogapd hqwha
,zevetpd milind zniyxa drited dlind xy`k iaeig ewip) dlinl edylk ewip epzp (Borda Count ziien
milind zniyxn milin xzei yi ogand uaeway lkky `ed oeirxd .(my drited `l `id xy`k ilily ewipe
z` xa ly eteqa epxfgd ,jkitl .dtyd dze`a aezk hqwhdy miiekiqd milb jk ,idylk dtya zevetpd
.xzeia deabd oeivd lawzd dxear dtyd
:ewip zehiy izy m` epar ,xen`k
dlibx dxitq 4.1
oeiv dlaiw `id ,my drited dlind m` .zevetpd milind x zniyxa ogapd hqwha dlin lk epytig ,df dxwna
,ogand uaewa eritedy milindn `"k ly ewipd z` epnkq ,xarnd ly eteqa .−1 oeiv dlaiw `id zxg`e ,1
.xzeia deabd ewipd lawzd dxear dtyd z` epxfgde
"Borda Count" zhiy 4.2
i zexiza uaewa drited dlind m` .zevetpd milind x zniyxa ogapd hqwha dlin lk epytig ,df dxwna
`idy ewipd jk ,ddeab xzei dlind zexizy lkk ,xnelk) x − i oeiv dlaiw `id ,(zevetpd milind x jezn)
milindn `"k ly ewipd z` epnkq ,xarnd ly eteqa .−1 oeiv dlaiw `id zxg`e ,(xzei deab didi lawz
did efd ewipd zhiyl lpeivxd .xzeia deabd ewipd lawzd dxear dtyd z` epxfgde ,ogand uaewa eritedy
daeh dxeva zeirn okle xzei zexiz ok` od ,stif ly ezpga` t"ry ,dtya xzei zevetp milinl zetir zzl
dlind day dtyd z` sirp okle ,zety xtqna driten znieqn dliny okzii ,sqepa .dtyl zekiiy lr xzei
.dvetp xzei
9
zihqihhq dyib 5
-nxt t"r mixehwel mzkitde oeni`d ixneg gezip jez ,zepeyd zetya mixfeg miqet epnl ef dyib zxbqna
,igi iqetih xehwe ikl epgi` dty lka zepeyd ze`nbedn elawzdy mipeyd mixehwed z` .mipey mixh
iedif zniyn .xehwe epnn mb epxvie ,ogand xneg ly dne gezip eprvia okn xg`l .dtyd dze` z` bviind
-etihd xehwed oial (ogand xneg z` bviiny) lawzdy xehwed oia d`eeyd ikl "dnbxez" ,jk m` ,dtyd
z` (jynda aigxp odilr ,zepey zehiya) ep
n classification-d alya ,ok`e .zewapd zetydn `"k ly iq
did da qetdy dtyd z` epxfgde ,zetyd ly miibeviid mixehwedn `"k oial wapd xehwed oia wgxnd
.oeni`d xnegl xzeia dned
ote`e egi`d ote` ,
nd ly zifhxwd dltknd .zepey mikx xtqna rvazd lirl x`ezy jildza aly lk
z` epeeyd ,jildzd ly eteqa .ywean hqwh lk xear zepey zewia e`n daxda znkzqn wgxnd zin
ozep mi
ndn in reawl lkepy ika ,mi
nd t"r epvaiwe ,zizin`d dtyl aipd mitexivdn `"ky d`vezd
.xzeia oekpd aexiwd z`
oia wgxnd zin ote`
ogapd xehwel dtyd xehwe
mixehwed egi` ote`
dty lk ly
mipniq
miihixw`i
n-gram
heyt wgxn ig` lwyn mr (ze` zegiky) mxbipei
iilwe` wgxn ilniqwnd jxrd zxiga `ll y`xa ze` zegiky
dlin
zieefd qepiqew t"r oein seqa ze` zegiky
dlin
ixhniq-`l Kullback-Leibler wgxn (zeize` inv) mxbia
ixhniq Kullback-Leibler wgxn
mewina miyxtdd mekq
seqpi` znxep
zihqihhqd dyiba mikezigd :1 dlah
n-grams t"r 5.1
zebltzdl xeywd lka ipiite` qet yi dty lkl ditl dpga`d lr mipryp epynzyd mda mipeyd mi
nd
:dtya zepeyd milina zeize`d
mxbipei 5.1.1
`"ka dtya z
ea ze` lk ly ziqgid zegikyd did wapd
nd xy`k ,lirl hxety jildzd z` eprvia
.oeni`d ixnegn
dlin y`xa ze` zegiky 5.1.2
dpey`x ze` xeza zeize`dn `"k ly zegikyd did wapd
nd xy`k ,lirl hxety jildzd z` eprvia
.oeni`d ixnegn `"ka milina
10
dlin seqa ze` zegiky 5.1.3
milina dpexg` ze` xeza zeize`dn `"k ly zegikyd did wapd
nd xy`k ,lirl hxety jildzd z` eprvia
.oeni`d ixnegn `"ka
(zeize` inv) mxbia 5.1.4
ixnegn `"ka milina zeize` ly zebef ly zeiegikyd did wapd
nd xy`k ,lirl hxety jildzd z` eprvia
.oeni`d
dty lk ly mixehwed egi` ote` 5.2
elawzdy mixehwed llk z` llwyl epilr did ,zewapd zetydn `"k xear iqetih xehwe yabl lkepy ik
dze`a mixehwed lkn mipiit`nd lk z` lleky ,g` xehwe ikl dtyd dze`a mipeyd oeni`d ixneg xear
:mikx izya z`f zeyrl epxga okle ,ze`vezd lr ritydl ieyr hlgda lelwyd ote` .dtyd
ig` lwyn 5.2.1
iqgid lwynd ,mixehwe x eid znieqn dty xear m` :ddf did mixehwedn `"kl ozipy iqgid lwynd ,ef jxa
.
1xdid mdn `"k ly
v2 = (0.5, 0, 0.3, 0.2)-e v1 = (0.3, 0.2, 0.1, 0.4) mixehwed z` milawn epiid znieqn dty xear m` ,dnbel
.v = (0.4, 0.1, 0.2, 0.3)did llweynd xehwed ,(i ze`d ly ziqgid zegikyd z` zbviin i-d dhpixe`ewd xy`k)
ilniqwnd jxrd zxiga 5.2.2
.zeiegikyd z` eplnxp okn xg`le ,mixehwed lka ely zilniqwnd zegikyd z` oiit`n lk xear epxga ,ef jxa
v2 = (0.5, 0, 0.3, 0.2)-e v1 = (0.3, 0.2, 0.1, 0.4) mixehwed z` milawn epiid znieqn dty xear m` ,dnbel
v = did lenxpd iptl llweynd xehwed ,(i ze`d ly ziqgid zegikyd z` zbviin i-d dhpixe`ewd xy`k)
.v = 11.4 · (0.5, 0.2, 0.3, 0.4) =(
514 ,
214 ,
314 ,
414
)
did `ed okn xg`le ,(0.5, 0.2, 0.3, 0.4)
ogapd xehwel dtyd xehwe oia wgxnd zin ote` 5.3
.ogand xneg z` bviiny xehwe mb enk ,zetydn `"ka zeiebltzdd z` bviiny iqetih xehwe epiia yi zrk
dney xehwed z` `vnpy ik ,zetydn `"k ly bviind xehwel ogand xehwe oia d`eeyd zlert rval eppevxa
xehwe z`ivn .xehwed eze` i"r zbveiny dtyd `id oeni`d xneg ly dtydy dfgp jke ,ogand xnegl xzeia
`ed ogand xehwe oial epia wgxnd exear xehwed z`ivn i"r dzyrp ogand xehwel xzeia "dne"d dtyd
zxiga .mixehwe ipy oia wgxnd z` jixrdl lkep vik dziid zizednd dl`yd .mixehwed llk oian ilnipind
eynzyd da) zg` dhiyn xzeia epynzyd okle ,zeiteqd ze`vezd lr e`n ritydl dleki aeyigd jx
.(zixewnd deara
epnly dtyd xehwe z`e P = (P1, ..., Pn) xeza ogand xehwe z` onqp mi`ad mitirqd lka ,zegepd myl
z` mibviin mdy zexnl) dpey `ed mixehwea miheiaixh`d xtqny okziiy al miyp .Q = (Q1, ..., Qm) xeza
P -l eptqed ,dfd iyewd mr
enzdl ik .(ogand xnegae oeni`d xnega miielz mixehwed oky - dtyd dze`
ody eplaiw ,jk .0 lwyn mdl epzpe ,ipyd xehwea miriten ok la` mda miriten `ly miheiaixh`d z` Q-le
.x ≥ max (m,n) xy`k ,x leba mixehwe md Q ode P
"heyt" wgxn 5.3.1
ipy oia wgxndy dvxpe xg`n .
∑xi=1 |Pi −Qi| dgqepd i"r oezp Q-e P mixehwed oia wgxnd ,efd dhiya
l"pd iehiad z` `ian (Q) dly ibeviid xehwedy ef didz diefgd dtyd ,xyt`d lkk ohw didi mixehwed
11
.menipinl
iilwe` wgxn 5.3.2
ipy oia wgxndy dvxpe xg`n .
√
∑xi=1 (Pi −Qi)
2dgqepd i"r oezp Q-e P mixehwed oia wgxnd ,efd dhiya
l"pd iehiad z` `ian (Q) dly ibeviid xehwedy ef didz diefgd dtyd ,xyt`d lkk ohw didi mixehwed
.menipinl
mixehwed oia zieefd qepiqew t"r oein 5.3.3
:mdipy oia zieefd qepiqew t"r mdipia oeind z` enl ozip ,xeyina miinin-e mixehwe ozpida
cos (α− β) = cos (α) · cos (β) + sin (α) · sin (β)
=P1
√
P 21 + P22
·Q1
√
Q21 +Q22
+P2
√
P 21 + P22
·Q2
√
Q21 +Q22
=P ×Q
|P | × |Q|
:`id inin-x xehwel zillkd dgqepd ,xnelk
∑x
i=1 Pi ·Qi√
∑x
i=1 P2i ·
√∑x
i=1 Q2i
didz mixehwed ipy oia zieefdy lkk oky) xyt`d lkk dphw didz mixehwed ipy oia zieefdy dvxpe xg`n
z` `ian (Q) dly ibeviid xehwedy ef didz diefgd dtyd ,(xzei leb didi mdipia ladd jk ,xzei dleb
.menipinl l"pd iehiad
(zixhniq-`l dqxbe zixhniq dqxb) Kullback-Leibler wgxn 5.3.4
.DKL (P,Q) =∑x
i=1 Pi · log(
PiQi
)
dgqepd q"r ,mixehwe ipy oia ladd z` `ven
nd
DKL (P,Q) 6= miiwzn) ixhniq `l `ed ely "dheytd" dqxbae xg`n ,"iq`lw"
n `l `ed dfd
nd
`l` ,ziq`lwd ezqxba KL wgxn aeyiga epwtzqd `l ,dfd ixyt`d iyewd mr
enzdl ik .(DKL (Q,P )
:mixehwed ipyl deey "qgi" ozepy ,ixhniq KL
na mb epynzyd
DSymmetric−KL =1
2(DKL (P,Q) +DKL (Q,P ))
ef didz diefgd dtyd okle ,xyt`d lkk ohw didi mixehwed ipy oia wgxndy dvxp zehiyd izyn `"ka
.menipinl l"pd iehiad z` `ian (Q) dly ibeviid xehwedy
(Ranks) mewina miyxtdd mekq 5.3.5
oiit`n lkl epwprde ,xei zegiky xqa xehwe lka mipiit`nd z` epxiq ,(eply gezit ixt `idy) ef dhiya
oeiva yxtdd ly hlgend jxrd z` epnkq ,okn xg`l .zeiegikyd xeiqa enewinl m`zda ,x-l 1 oia mly oeiv
.
∑xi=1 |(Rank (Pi)−Rank (Qi))| ,xnelk .zeize`dn `"k ly
.xzeia dned dtyd `id ,wapd xehwed znerl xzeia jenpd miyxtdd mekq lawzn dxeary dtyd
12
Fitness Functions zeneyxd xtqn mipniq
miihixw`i
n-gram
Gini Gain 500 mr (ze` zegiky) mxbipei
Entropy 1000 `ll dlin y`xa ze` zegiky
Information Gain 1500 dlin seqa ze` zegiky
Information Gain Ratio 2000 (zeize` inv) mxbia
Train Error
dhlgdd ivra mikezigd :2 dlah
seqpi` znxep 5.3.6
.bigrams xear wx dlrted ef dhiy ,zixewnd dearl dnea
`ide ,xzeia lebd `ed (hlgen jxra) miyxtdd mekq da dxeyd ly meniqwnd zeidl zxben seqpi` znxep
:`ad ote`a dayeg
ly (hlgen jxra) miyxtdd mekq z` miaygn ,minxbiad zniyxa zepey`xd zeize`dn `"k xear .1
.ze`d dze`a miligzny minxbiad
ze`d dze`a miligzny minxbiad ly (hlgen jxra) miyxtdd mekq dxear dpey`xd ze`d z` mixgea .2
:dtyd xehwel wapd xehwed oia xzeia lebd yxtdd z` miaipn
‖ A ‖∞ = max1≤i≤x
x∑
j=1
|Pij −Qij |
.ilnipin did mewd alya ep`vny jxrd dxear dtyd z` mixifgn .3
dhlgd ivr 6
-nxt t"r dhlgd ivr zxivie oeni`d ixneg gezip jez ,zepeyd zetya mixfeg miqet epnl ef dyib zxbqna
.zepeyd ze`nbed lr jnzqdae ,mipey mixh
-i`d ixnegn `"ka eply n-grams-dn `"k ly zeiegikyd z` ep
n ,oeni`d alya zihqihhqd dhiyl dnea
dhiydn dpeya .oeni` xneg eze`a n-gram lk ly ziqgid zegikyd z` dnny ,iqetih xehwe oditl epxvie ,oen
-be lkl g` - mdy enk mixehwed z` epx`yd `l` ,zetydn `"kl igi xehwe epxvi `l o`k ,zihqihhqd
,dlaha eply zeneyxd zeidl ektd mde ,oeni`d ixehwe llk oian mixehwe xtqn ilnepx ote`a epxga .dn
alya .(zeize` ly mipey zebef ,dnbel) ep`vny n-gram-d eid (dlaha miheiaixh`d ,xnelk) dly zeenrdy
,ura znev `ed heiaixh` lky jk ,(jynda x`ezny ,ID3 mzixebl` zxfra) envr dhlgdd ur z` epipa `ad
`ed dlr lke ,(3 dlah itl ,divfihxwqi exar mikxrd xy`k) heiaixh` eze` ly ixyt` jxr zbviin rlv lk
m`zda ,ogand xneg ly classification eprvia okn xg`l .(zetyd 14 oian idylk dty ,xnelk) iteq beiz
.epipay dhlgdd ur lr jnzqdae jynda x`ezny mzixebl`l
z` xifgz dhiy efi` weal dziid dxhndyk ,zepey mikx xtqna rvazd lirl x`ezy jildza aly lk
:xzeia aehd ieaipd
13
n-grams t"r 6.1
zebltzdl xeywd lka ipiite` qet yi dty lkl ditl dpga`d lr mipryp epynzyd mda mipeyd mi
nd
:dtya zepeyd milina zeize`d
mxbipei 6.1.1
`"ka dtya z
ea ze` lk ly ziqgid zegikyd did wapd
nd xy`k ,lirl hxety jildzd z` eprvia
.oeni`d ixnegn
dlin y`xa ze` zegiky 6.1.2
dpey`x ze` xeza zeize`dn `"k ly zegikyd did wapd
nd xy`k ,lirl hxety jildzd z` eprvia
.oeni`d ixnegn `"ka milina
dlin seqa ze` zegiky 6.1.3
milina dpexg` ze` xeza zeize`dn `"k ly zegikyd did wapd
nd xy`k ,lirl hxety jildzd z` eprvia
.oeni`d ixnegn `"ka
(zeize` inv) mxbia 6.1.4
ixnegn `"ka milina zeize` ly zebef ly zeiegikyd did wapd
nd xy`k ,lirl hxety jildzd z` eprvia
.oeni`d
oky) e`n dleb miheiaixh`d zenk minxbia mr deara ,ze
ea zeize`e minxbipei mr deardn dpeya
epxga ,dfd iyewd mr
enzdl ik .in leb diqxewx wnerl eprbd okle ,(zeize` izy ly zeivhenxta xaen
.urd z` mditl epipae ,xzeia mivetpd miheiaixh`d 400-800 z` "wx" miax mixwna
zeveawl epwlig ea ote`d 6.2
likn htyn lk .0-1 oiay geeha mixtqn ,xnelk ,minxbia e` minxbipei ly zeiegikyd md miheiaixh`d ikxr
dpiidz zeiegikydy dtvp ,jkitl .zeize` ly dpey xtqnn zeakxend ,mipey mikxe`a milin ly dpey xtqn
.zeiegikyd ly divfihxwqi rval ul`p ,mipey miheiaixh` oia zeeydl lkepy ike ,efn ef e`n zepey
(geexd) dn`zdd zeivwpet 6.3
,('eke ,minxbia ,zepeyd ze
ead zeize`d) urd z` mipea epgp` mditl mipeyd mipiit`nd md miheiaixh`d
.dtya mdly zegikyd mdy ,mdly mikxrl zeqgiizd jez
# Example a b ... Language
1 0.081 0.014 ... English
2 0.12 0.022 ... Spanish
3 0.068 0.017 ... English
... ... ... ... ...
Entropy itl 6.3.1
mieqn heiaixh` ly zebltzd iabl ze`eed-i` zin z` bviind divnxetpi`d zxeza
n `id ditexhp`
.mdylk mipezp ozpida
14
eply deard zixewnd deard
minxbia ,minxbipei
,dpey`x ze`
dpexg` ze`
minxbia minxbipei rhwn
0-0.00015 0-0.0015 0-0.00001 0-0.001 0
0.00015-0.0003 0.0015-0.003 0.00001-0.0003 0.001-0.03 1
0.0003-0.0005 0.003-0.005 0.0003-0.0006 0.03-0.06 2
0.0005-0.001 0.005-0.01 0.0006-0.0009 0.06-0.09 3
0.001-0.0015 0.01-0.015 0.0009-0.012 0.09-0.12 4
0.0015-0.002 0.015-0.02 0.012-0.015 0.12-0.15 5
0.002-0.0025 0.02-0.025 0.015-0.018 0.15-0.18 6
0.0025-0.0035 0.025-0.035 0.018-1 0.18-1 7
0.0035-0.005 0.035-0.05 8
0.005-0.007 0.05-0.07 9
0.007-0.01 0.07-0.1 10
0.01-0.013 0.1-1 11
0.013-1 12
zeveawl dwelgd :3 dlah
zeidl ditexhp`d z` xibp (Language attribute-d xear ditexhp` mirvan ep` xy`k) A dpezp dlahl
-azqdd `ed pv =|Av ||A| -e ,zetyd zniyxn dtyd z` bviin v xy`k ,H (A) = −
∑
v∈LanguageList pvlog (pv)
jxr zelra zeneyxl wx dlahd mevnv `ed Av ,dig` zexazqda miynzyn epgp`) mi`znd beizd ly zex
.(dlaha zeneyxd xtqn `ed |A|-e a heiaixh`a v
H (A, a) = zeidl ely ditexhp`d z` xibp mieqn a heiaixh` xear ze`eed-i` zin z` reawl ik
.
∑
v∈V alues(a) H (Av)
dvxp o`k ,deab ikd jxrd mr heiaixh`d z` zgwl dvxp oday ,bivpy ze`ad zeivwpetl ebipay al miyp
.xzeia jenpd ditexhp`d jxr lra heiaixh`d z` `evnl
Information Gain itl 6.3.2
A dlaha a heiaixh` lkl .ura ewewk mieqn heiaixh` ozpida ditexhp`a dzgtdd zlgez z`
en IG
:xibp
IG (A, a) = H (A)−∑
v∈value(a)
|Av|
|A|
·H (Av)
Information Gain Ratio itl 6.3.3
:zeidl Information Gain Ratio-d z` xibp a heiaixh` lkl
IGR(A,a) =IG(A, a)
H(A,a)
Gini Gain itl 6.3.4
n `ed Gini index ik epnl epkxry miyetign la` ,dfd
nd zee` qxewd zxbqna dpyd epnl `l mpn`
.(Language eply dxwna `edy) dxhnd heiaixh` ly mipeyd mikxrd ly zeiexazqdd oia diihql
:Language ly mikxrd xear A dpezpd dlahd ly Gini Index diihqd
n z` dligz xibp ,ditexhp`l dnea
GI(A) = 1−∑
v∈LanguageList
(
|Av|
|A|
)2
:zeidl heiaixh` lkl geexd zivwpet z` xibp zrk
GG(A, a) = GI(A)−∑
v∈value(a)
|Av|
|A|·GI (Av)
letk dlah-zz lkl diihqd oial zillkd diihqd oia ilnipind yxtdd ly meniqwnd z` `evnl dvxp
.dze` xviind jxrd ly zexazqdd
Train Error itl 6.3.5
-xh`d z` xegal dvxp xy`k ,mieqn heiaixh` xear oeni`d z`iby zlgeza dixid z` z
en z`f divwpet
:dixid zlgez z` mqwnny a heiai
TE(A,a) = minv∈LanguageList(pA)−∑
v∈value(a)
(
|Av|
|A|
)
minLanguage∈LanguageList (pAv )
.A dlahl qgia xben pA xy`k
urd ziipal mzixebl`d 6.4
yeniy jez ,ID3 ,2qxewa epi`xy iaiqxewxd mzixebl`d `ed dhlgdd ivr ziipal epze` yniyy mzixebl`d
.lirl ex`ezy (geexd) dn`zdd zeivwpeta
(classification) dtyd beeiql mzixebl`d 6.5
divxhi` lkay jk ,dhlgdd ur xena "liihn"y "heyt" iaiqxewx mzixebl` did ipey`xd beeiqd mzixebl`
d`ixw `xewe ,igkepd znevd ly heiaixh`d xear ogand xehwea yiy jxrl mi`zny urd-zz z` xgea `ed
eze` ly (dtyd z` ,xnelk) beizd z` xifgne ,dlrl ribn `ed xy`k xver mzixebl`d .urd-zz lr ziaiqxewx
mivrd miax mixwnae xg`ne ,zeneyxd llkn ilnepx ote`a oeni`d xneg z` mixgea epgp`e xg`n ,mle`e .dlr
heiaixh` eze`l miixyt`d mikxrd lk `l ,miliaend miheiaixh`d 400-800-a xegal epyxpe in mileb eid
,beeiqd jildza rwzip ,xnelk ,epnly ura miiw `ly ur-zz ytgle zeqpl milelr epgp` ,okl .mibvein inz
.mi`zny ur-zz didi `le xg`n
xehwea heiaixh`d ly jxrl mi`zny ur-zz el oi`y ura znevl eprbdy rbxay ephlgd ,ef dira xeztl ika
xena liihl jiynpe ,"yegip"k dfky yg leih lk xibp .heiaixh` eze` ly mivrd-izz lk lr xearp ,beeiqd
xearpe ,"yegip" erk z`f xetqp ur-zzd xena leiha "iziira" avnl ribpy mrt lka .dlrl ribp xy` r ,urd
xtqn z`e ur-zz eze`n beizd z` lawp jildzd ly eteqa .znevd ly heiaixh`d ly mivr-izzd lk lr
lk oky) xzeia ohwd "miyegip"d xtqna yeniy jez eprbd eil` beizd z` xifgpe ,jxa epiyry "miyegip"d
.(ozipd lkk xrtd z` oihwdl dvxpe ,miaxwzn epgp` dil` dtyd oial ogand xneg oia xrt lr irn "yegip"
ephlgd ,overfitting-n rpnidl ike ,dl`ky miax miznv lelkle mileb zeidl mileki mivrde xg`n ,ok enk
xtqn z` mb epxard ,ziaiqxewx d`ixw lka - urd xena "leih"d jldna pruning rvale jildzd z` lriil
,dfd miyegipd xtqn z` epxar urd xena leihd jldna m`e ,dk r beiz epl biydy jenp ikd miyegipd
.zilnihte`d d`vezd z` aipi `l `ed oky ,lelqnd eze` lr epxziee ura zelvtzdd jynd z` epwqtd
24 zitewy ,10 lebxz
2
16
V wlg
ze`vezd
,df wxta `aen zeitivtqd ze`vezd gezip .ephwp oda zeyibdn `"ka ,eplaiwy ze`vezd z` bivp df wlga
,dhlgdd ivr ly ode zihqihhqd dyiba od ,ze`lnd ze`vezd .`ad wxta driten xzei dagx zeqgiizd era
.'a-e '` migtqpk ze`aen
(stif weg) ziai`p dyib 7
:
3
sxba ze`xl ozipy itk ,zelern eid dbiyd ziai`pd dyibdy ze`vezd ,dtevnk
zniiw ,xnelk ,oihelgl ddf did milin 1000-e 500 xear Borda Count zhiya iedifd feg` ik oiivl oiiprn
.zevetp milin lr zqqaznd ieaipd zlekia "zikekf zxwz" oirn (dxe`kl)
z` epeeyd `l okle ,hwiiexta epwnzd oda dinld zehiyn e`n dpey efd dyibdy aey yibdl aeyg
.l"pd ze`vezl epbydy ze`vezd
zihqihhq dyib 8
dlaha ehxety miax mikezig itl ,zevxd ly e`n ax xtqn llk dinld jildz ef dhiya ,lirl xaqedy itk
ephlgdy zetyd 14 llk xear zextp zevxde ,zixewnd deara eritedy zetyd xear zevxd eprvia ,sqepa .1
.weal
,xnelk) mibiiezn mivaw mr epar oky ,inl dheyt dziid zizin`d d`vezl eplaiwy iefigd z`vez z`eeyd
.(lawl mixen` epgp` dze` dtyd idn epri
mixehwed oia wgxnd zin ote` t"r 8.1
-gxnd z` jixrdl ik seqpi` znxepa yeniy dyrp zixewnd deara zihqihhqd dyiba wqry wlga ,xen`k
milk eplrtd df hwiiexta era ,wapd htynd z` bviiny xehwed oial zetydn `"k ly xehwed oia miw
.mitqep
dxexad dxeva - efl ef zeaexw ze`vez epzpy - elawzdy zeivwpetd z` bivdl oevxd lyae mewn iveli`n :mitxbd iabl zillk dxrd
3
.(mi`ixw zegt zeidl milelr eid mitxbd efky dl`wqa oky) 100-a miizqne 0-n ligzn inz `l y xiv ,xzeia
17
:ze`ad ze`vezd z` eplaiw ,zixewnd deardn zetyd lr wx epvx xy`k
aexae ,xzeia zerexbd ze`vezd z` daipd seqpi` znxep ,mikezigde zevxdd lka hrnk ,ze`xl ozipy itk
600 llk oeni`d xneg xy`k ewap mipezpd ,zixewnd deara .mipte`d xzin izernyn yxtda - mixwnd
zixewnd deara ebyedy dl`l zene ze`vezl eprbdy ze`xl ozip .46%-k lr enr dglvdd ifeg`e ,zeneyx
,k"dqa .mda epaygzd xy`k xzei zeaeh hrne ,miihixw`id mipniqdn epnlrzd xy`k zeaeh zegt hrn -
.hrna elired miihixw`id mipniqd xy`k ,ef dhiya iedifa dglvd 50%-l 40% oial eprbd
iedif 70% r - e`n miaeh mirevial eribd epynzyd oda zetqepd zeihnznd zehiyd ,seqpi` znxepn dpeya
miihixw`i mipniq mr od - dliaend dhiyd .mda zeaygzd jez iedif 80% re ,miihixw`i mipniq `ll
ziaxn .zixhniq-`ld dqxbae zixhniqd dqxba zene ze`vez daipdy ,KL zhiy dziid - mdirla ode
xy`k ,(mda epaygzd xy`k xetiyl diihp mr) miihixw`id mipniql xyw ila zene ze`vez eaipd zehiyd
dxe`kly) miihixw`id mipniqa epaygzd xy`k `weey oiivl oiiprn .Ranks-e KL zehiy md oted i`vei
-ixw`id mipniqd zkitdy `id jkl daiqd d`xpd lkk .aeh zegt dar Ranks zhiy ,(rin xzei miwtqn
xy`n dpey ote`a oeni`d xnega zevetpd zeize`d xeiqa miiepiyl dnxb (a-l ǎ ,lynl) zelibx zeize`l miih
.miiepiyl xzei yibx okle ,xzei mvnevn `edy ,ogand xnega
:ze`ad ze`vezd z` eplaiw ,zetyd 14 llk lr epvx xy`k
18
zeyrl gilvd `ed ,ala zeixewnd zetyd 6 oian dty zefgl yxp beeqndyk :o`k dxi ze`vezd zeki`
-xewnd zetyd 6 mr 80% znerl 70%) zetyd 14 oian dty zefgl yxp `edyk xy`n xzei daeh dxeva jk
ze`xl oiiprn .ixtdl yxp beeqnd odipia zetyd zenka aygzda geina ,oievn iedifa xaen oiir mle` ,(zei
zeaeh ze`vez mixwnd ziaxna eplaiw ,(6 jezn wx `le) zety 14 oian dty ly beeiqa xaen did xy`k mby
zeivwpet oia qgid xnyp o`k mb .zixewnd deara yeniy dyrp da ,seqpi` znxep zhiy daipdy dl`n xzei
e` zene ze`vez eaipd zeivwpetd llk ,Ranks-e Angle hrnl .liaedl dkiynn KL-y jk ,zepeyd dind
liadl ik ipeigd rind z` miwtqn d`xpd lkky ,miihixw`id mipniqa epaygzd xy`k xzei zeaeh
.deard zligza hxety itk ,zetyd oia xzei daeh dxeva
n-gram t"r 8.2
mi
n df hwiiexta siqedl ephlgd epgp` ,xen`k .minxbipeie minxbia itl dinl drvazd zixewnd deara
.dlin seqae dlin y`xa zeize` zegiky lr mikznqny mitqep
:ze`ad ze`vezd z` eplaiw ,zixewnd deardn zetyd lr wx epvx xy`k
19
:ze`ad ze`vezd z` eplaiw ,zetyd 14 llk lr epvx xy`k
. e`n leb `ed miixyt`d minxbiad xtqn - minxbiadn ribd xzeia izernynd rind ,epitvy itk
jildza xzeia axd rind z` wtiq dlind seqa `wee ze`d mewiny zelbl eprzted ,ze
ead zeize`d oian
iwtz zlra `id miax mixwna dpexg`d ze`dy daerd mr g` dpwa dler df ielib ,z`f mr gi .dinld
iedifa riiqz `idy ipeibd hlgda okle ,('eke mipeyd mipnfa lretd ziihp ,miaxd zxev oeiv ,dnbel) iwew
oiivl oiiprn .(e e` i i"r ziwlhi`ae n i"r zipnxba era ,s ze`d i"r zpievn zilbp`a miaxd zxev ,lynl) dtyd
ze`vez daipd `id mb mle` ,(iqgi ote`a) zeaeh zegtd ze`vezd z` daipd dpey`xd ze`d itl dwiady
.(seqpi` znxep ly iedifl dnea) iedif 45% lrn - zerx `l
(aeyig onf) enild avw 8.3
epvxd ,dfd mipezpd dpan ozpida .zetyd lkn oeni`d ihtyn lk z` likdy mipezp dpan epxvi oey`xd alya
`"ka dlind seqa zeize`de dlind zligza zeize`d ,minxbipeid ,minxbiad zeiexiz z` enly minzixebl`
20
.zetyd 14-n
.zetyd lk xear elld zewihqihhqd lk z` enll ika zeipy 25-k - e`n xidn did enild avw
overfitting-e zeiawr 8.4
d`vezke ,zellkd epnn enll mewna oeni`d xneg z` "opyl" ligzn beeqnd xy`k ygxzdl lelr overfitting
xneg lr xzei zewiene zeaeh ze`vez xifgi beeqnd ,dfky dxwna ."yrx"l xzei leb lwyn ozep `ed ,jkn
aeh zegt didi beeqnd ,xnelk) xzei zerexb dpiidz xken `le yg ogan xneg lr ze`vezd j` ,xkend oeni`d
ok` oeni`d xnegy ze`xl ik ipy vne ,g` vn overfitting rvazd `ly `eel ik .(yg rin mr iefiga
dlaha ze`xl ozipy itk .dinld revia xg`l oeni`d xnegn wlg lr mzixebl`d z` epvxd ,dkldk beeqn
-nyn dxeva `l j` ,mixwnd ziaxna xzei zeaeh ok` oeni`d xneg lr mzixebl`d zlert ly ze`vezd ,dhnl
irny dn ,zedeab ziqgi od ze`vezd ,ok enk .overfitting oi`y wiqdl lkepy ,o`kn .oepiy lr dirny zizer
.mixwnd ziaxna dtyd z` ddif ok` beeqndy jk lr
Kullback Symmetric
Kullback
Angle Eucleadean Infinity Ranks Simple
Difference
ze`vezd
zeixewnd
69.34 68.87 46.93 59.19 44.94 42.56 53.07
wlg lr ze`vezd
oeni`d xnegn ohw
71.41 69.3 50.95 61.12 40.07 51.91 57.14
recall, precision, F1 i
n 8.5
ogand xneg da dtyd ly iefig eplaiw xa ly eteqae ,ogan xneg lr mzixebl` epvxd ,hwiiextd zxbqna
:mi`ad mipte`a miiefigd lr lkzqdl ozipy al miyp .aezk
.dil` jiizyn ok` `ede ,idylk dtyl jiiyk htynd z` epidif = True Positive .1
.dil` jiizyn `l `ed la` ,idylk dtyl jiiyk htynd z` epidif = False Positive .2
.dil` jiizyn `l ok` `ede ,idylk dtyl jiiy `lk htynd z` epidif = True Negative .3
.dil` jiizyn ok `ed la` ,idylk dtyl jiiy `lk htynd z` epidif = False Negative .4
.precision-e recall i
n i"r `id ze`vezd z` jixrdl ztqep jxiedifa lawl mixen` epiidy mikeiyd llk jezn ,dtyl oekpd jeiyd xeriy z`
en xy`
n `ed recall
n
.d`hgdd xeriy z`
en `ed ,zexg` milina .mlyen
dze`l lreta eplaiwy mikeiyd llk jezn ,dtyl oekpd jeiyd xeriy z`
en xy`
n `ed precision
n.yrxd xeriy z`
en `ed ,zexg` milina .dtyd
recall =True Positive
True Positive + False Negative
precision =True Positive
True Positive + False Positive
,rexb precision mr gia ribdl leki aeh recall) mnvr ipta miner `l md la` ,miaeyg md elld mi
nd ipy`ed mday hleadyk ,mialeyn mi
n lr mb lkzqdl bedp ,okl .(aeh `ed iefigdy reawl ik i ea oi` okle
:d`ad dgqepd jnq lr lawzne ,mdipy ly ipenxd rvenn `edy ,F1
n
F1 = 2 ·precision · recall
precision + recall
aeh iefigd jk ,1-l xzei miaexw mdy lkke ,1-l 0 oiay geeha mirp elld mi
nd llk ly miixyt`d mikxrd
21
.xzei
:eply zevxda elld mikxrd z` ep
n
with diacritics without diacritics
500 1000 1500 2000 500 1000 1500 2000
original 0.648 0.651 0.654 0.653 0.65 0.646 0.642 0.637
all languages 0.618 0.606 0.599 0.593 0.59 0.586 0.583 0.581
dhlgd ivr 9
.2 dlaha exkfedy miax mikezig itl ,zevxd ly e`n ax xtqn eprvia ef dhiya ,lirl xaqedy itk
ephlgdy zetyd 14 llk xear zextp zevxde ,zixewnd deara eritedy zetyd xear zevxd eprvia ,sqepa
.weal
mibiiezn mivaw mr epar oky ,inl dheyt dziid zizin`d d`vezl eplaiwy iefigd z`vez z`eeyd ,o`k mb
.(lawl mixen` epgp` dze` dtyd idn epri ,xnelk)
(geexd) dn`zdd zeivwpetl qgia n-gram t"r 9.1
:ze`ad ze`vezd z` eplaiw ,zixewnd deardn zetyd lr wx epvx xy`k
:ze`ad ze`vezd z` eplaiw ,zetyd 14 llk lr epvx xy`k
22
:miliaend mi
nd itl eppiqyk ,mivrd llk xear elawzdy miqetd z` ze`xl ozip `ad sxba
23
mikenp iefig ifeg` eaipd dpexg`d ze`de dpey`xd ze`d lr miaery minzixebl`d zevxdd lka ,epzrztdl
enk zeveawl dwelg dze`a mxear epynzydy daera dverp jkl daiqd ,epzrl .(30%-n zegt) iqgi ote`a
.ep`vny zeiebltzdd z` dni`zny zierii dwelga `le ,(3 dlaha hxety itk) illk ote`a minxbipei xear
dearl - minxbipei xear ode minxbia xear od - e`n zene eply ze`vezd ,Gini ziivwpetl xywda
epglvd ,xzeia zeaehd ze`vezd z` daipd ef divwpet my ,zixewnd deardn dpeya ,z`f mr gi .zixewnd
efd d`vezd .minxbia xear ode minxbipei xear od ,IGR ziivwpeta yeniy jez ,xzei s` zeaeh ze`vezl ribdl
lelry overfitting-d z` mvnvl ik "gzet" IGR ditl ,qxewd jldna epnly daerd mr g` dpwa dler
izy ly mireviaa 15%-l 5% oia rpy xrt miiw ,ok`e .Information Gain diivwpeta yeniya xveeidl
.minxbipei xear ode minxbia xear od ,elld zeivwpetd
zeaeh ze`vez epzp minxbia mr eary zeivwpetd mixwnd aexa ,zixewnd deardn dpeyae ,epitivy itk
.epynzyd mda miheiaixh`d xtqn z` epnvnv xy`k mb ,xzei
ze`vezd mle` ,ditexhp`d ziivwpeta yeniy jez exvepy mivrd z` zepal ik ax onf yxp ik oiivl yi
dliri `l `id ditexhp` ,xnelk .zeivwpetd xzil qgia ode iheleqa` ote`a od ,zeaeh ze`vez eid eaipd mdy
rin zegt `l ozep milevitd lk jq ,xa ly eteqa mle` ,zeivwpetd x`yn rin zegt ozep da levit lke
.zexg`d zeivwpetd ziaxna xy`n (rin xzei s` mizrle)
(aeyig onf) enild avw 9.2
xtqnl xyi qgia nr dhlgd ur ziipal yxpy onfd jyn .dry ivgk lr nr ur lk ly rvennd diipad onf
mixwna .jx`zd mivrd ziipa onf jk ,xzei miax eid mdy lkk :oeni`d xnega zexeyd xtqnle miheiaixh`d
jix`d miihixw`i mipniqa yeniyd .jkn zegt s` e` 800-l miheiaixh`d xtqn z` liabdl epvl`p miax
24
.zihnx dxeva lb miheiaixh`d xtqn oky ,daxda aeyigd jyn z`
ivr era ,(ze
ea zew xtqn) xdn ziqgi epap dpexg`d ze`de dpey`xd ze`d ,minxbipeid ly mivrd
ur era ,dwn zegt jez dpap zeneyx 2000 lr IG ly mxbipeid ur ,lynl) ax aeyig onf eyx minxbiad
miheiaixh`d xtqn xe`l z`f xiaqdl ozip .(zew 56 jez dpap zeneyx 2000 lr Gini Gain itl mxbiad
dnk lr nr oxtqn okle ,ze
ea zeize` eid miheiaixh`d oky) oey`xd beqdn mivrd ly ziqgi mvnevnd
-nxbiad ivra miheiaixh`d xtqn era ,(miihixw`id mipniqd ztqeza "zelibxd" zeize`d - ze
ea zexyr
xaqedy itk ,oey`xd beqdn miheiaixh`d xtqn reaix ly leb xqa `ed ,dyrnl) zizernyn leb did mi
.(lirl
n `id ditexhp`dy `id jkl daiqd .xzeia lebd aeyigd jyn z` yx ditexhp`d
n ,mi
nd oian
dxeva mipezpd z` wlgn `l `ed okle ,mipeyd mix'vitae zepeyd zeiexyt`a witqn aygzn `ly ,in illk
miynzyn jk lya weia .aeyigd jyn zlbdle xzei daxd leb diqxewx wnerl liaeny dn ,dxexa witqn
.mitqep miaeyiga dze` millwyn la` ditexhp`d z` miaygny ,IGR enk xzei miaeh mi
na k"a
25
VI wlg
izrl zeaygne zepwqn ,oei
zewiien ze`vezl ribdl ozip eply deard qiqa lry zepin`n epgp`" ik eazk xnze oxen ,ozear mekiqa
iedif 70% lrnl ribdl epglvd - zizernyn zeaeh ze`vez od epzeara eprbd odil` ze`vezd ,ok`e ."xzei s`
,(zepey (geex) dn`zd zeivwpete zeihqihhq zeivwpet mre ,minxbipei ,minxbia xear) mikezige zehiy oeebna
,60%-k dziid xzeia daehd d`vezd zixewnd deara era ,IGR zhiya 73%-le KL zhiya iedif 79%-l s`e
feg` ,zehiyd xzi lk xear - zg` dhiya wxe ,dtyl zeneyx 2000 ly leb oeni` xneg xear dlawzd `ide
.xzeid lkl 45% did iedifd
hlgend aexa xy`k ,minxbipei xear `wee elawzd xnze oxen ly xzeia zeaehd ze`vezd ik oiivl oiiprn
elrd od .dglvd 25%-kl eribd od ,minxbia mr ear odyk .50%-n miphw eid dglvdd ifeg` mixwnd ly
mipey mirhwnl dwelg ike ,dpir witqn dziid `l d`xpk eyry dwelgdy `id jkl daiqdy dxrydd z`
"wgyl" epiqip ,mivrd lr deardn wlgk .zxg` dler eply deardn ,mle`e .xzei zeaeh ze`vez aipdl dieyr
,zepeebne zepey zewelg lr zeax minrt mivrd z` epvxd :(3 dlaha ze`xl ozipy itk) mivawnl dwelgd mr
,wenr zegt didi urd ,mipeyd mivawnd oia zpfe`ne dpir xzei didz dwelgdy lkky dziid daygndyk
epiqipy dwelgd oia ze`veza mixkip milad epilib `l xa ly eteqa ,mle`e .xzei daeh didz dwelgde
,epzrl .zixewnd dwelgd mizrle ,xzei zeaeh ze`vez dbiyd dygd dwelgd mizrl ;zixewnd dwelgde
`le mihen eid odly ogand e` oeni`d ixnegy zeidl dieyr zixewnd deara ziqgi zekenpd ze`vezl daiqd
ztqep zexyt` .dn`zd dziid `l okle ,(oeni`d xnegn dpeya) iewip xar `l odly ogand xnegy e` ,mipekp
dxeva miyg mixneg mr
enzdl ri `l odly beeqnd okle ,overfitting dxvi minxbiad ly dinldy `id
.daeh
ly zeikeaiqd znx z` dlrn mzx`ydy egipd ode xg`n ,miihixw`id mipniqd z` exiqd xnze oxen
-`d izya jenzl epxga ,z`f znerl ,eply hwiiexta .dinld jyn z`e mipezpd ipan z` dlibne ,dniynd
mipniqd ,`eand wlga ephxite epitivy itk .mze` xiqdle ,miihixw`id mipniqd z` xizedl - zeiexyt
- dhlgdd ivra ode zihqihhqd dhiya od - minzixebl`d ziaxn okle ,izernyn rin etiqed miihixw`id
-ihqihhqd milka 10%-k ly rvenn xetiy) miihixw`id mipniqa epaygzd xy`k xzei zeaeh ze`vez ebivd
minzixebl`d ,miihixw`id mipniqd `ll mb ik oiivl aeyg ,z`f mr gi .(dhlgdd ivra 20% r lye mi
drityd miihixw`id mipniqd zx`yd ik oiivl i`k .e`n zeti ze`vezl eribde ,mnvr z` egiked mipeyd
dvixd ipnf lr xkip ote`a drityd `id mle` ,mihqihhqd minzixebl`a dvixd ipnf lr inl gipf ote`a
miihixw`id mipniqd zx`yde zeid ,riztn df oi` .(epap xak mivrdy ixg` beeiqa `l la`) mivrd ziipaa
libdl dlelr j` ,miihqihhqd minzixebl`a mitqep miaeyig ly reawe mvnevn xtqn xzeid lkl dtiqen
wner z` miax mixwna dlrn df xa .(lirl x`ezy itk) dlah lka miheiaixh`d xtqn z` zxkip dxeva
.mivrd ziipa ly aeyigd onf z` mb m`zdae ,diqxewxd
zehiy zervn`a ...ze`vezd z` xtyl ozipy" zeayeg od ik eazk xnze oxen ,zihqihhqd dyibl xywda
iedif zeleki epibtd epynzyd mda miygd milkd llk ,ok`e ."seqpi` znxep[n℄ ...zenwzn xzei dwqd
ik zelbl oiiprn did ,sqepa .zegt zeaeh ze`vezl ribd seqpi` znxep lr qqazdy
nd era ,e`n zedeab
.zixhniq-`ld ezqxbae zixhniqd ezqxba zedf hrnk ze`vez aipd KL
n
26
ax aeyig onf yxpe ,e`n mileb mdy jka ielz mdly ixwird oexqgd ik epilib ,dhlgdd ivrl xywda
miheiaixh`d xtqn z` mvnvl epyxp okle ,ilniqwnd diqxewxd wnerl eprbd miax mixwna .mzxivil
mr "wgyl" epiqip xen`k .miihqihhqd mixwna xy`n aeh zegt did beeiqd ,jkn d`vezk .epynzyd mda
ly ze`vezd ziaxn ,z`f zexnl .xwip xetiy e`xd `l dl` miiepiy j` ,ze`vezd z` xtyl oeiqipa zewelgd
wfgny dn ,zetyd 14 lk q"r dpap `edyke zeixewnd zetyd 6 lr "wx" dpap urdyk zedf e`vi dhlgdd ivr
.wapd zetyd xtqnn mirtyen `le hrnk mdye ,dty iedifl miaeh milk ok` md dhlgd ivr ditl ,epzyib z`
od era ,dhlgdd ivr mr daeh dxeva ear `l dpexg`d ze`de dpey`xd ze`d ly mi
nd ik ze`xl ozip
.zeveawl dwelgl xeyw dfy mixeaq epgp` ,xen`k .miihqihhqd milkd mr zizernyn zeaeh ze`vezl eribd
xewgl epwtqd `l ,miax zereay ekynp mgezipe mzvxd ,hwiiexta minzixebl`d llk gezit jildze xg`n
- zeyg zewelg mr zepey zevxd zeqple ,mipezpd z` oegal leki iizr hwiiext ,epzrl .wnerl `yepd z`
.zetyd ilitext z` xzeia daehd dxeva `hal lkezy ,zil`ii` dwelg zlawl r - zegte xzei zepir
:dnidnd dprhd z` ygn dgiken ,megza zetqep zeax zeear ly ze`vez enk ,eply deard ze`vez
e` mipey mialyn ,wew iweg ly driia jxev oi` - mew ipyla ri ila elit` zeirah zety zedfl ozip
,miline zeize` ly yaie "xw" gezipa wtzqdl xyt` `l` ,dtyd z` zedfl ik milin ly zihpnq zernyn
ozip ,sqepa .dti zextqe hpxhpi` ixz` ,mipezir znbek ,miyibp zexewn ly mevr oeebnn gwlidl zelekiy
zety zee` epl yiy ipylad rid z` xiyrdl ik yeniy epiyr mday milkae deard ze`veza ynzydl
ziwlhi`a dlin seqa xzei dgiky a ze`d eitl llk ielib ,dnbel) mipey miweg ,odipia daxwd zin - zepey
.jkn zernzynd zeiaihipbew zeiernyn weale ,(zilbp`a xy`n
minxbipeie minxbia ly aeliy znbek ,mitqep mi
n ztqed i"r eply hwiiextd z` aigxdle jiyndl ozip
ribdl lkep epzrl .(dnbel ,miliaend minxbipeid y-e miliaend minxbiad x md ely miznvdy ur zepal ozip)
z`tn hwiiextd zxbqna z`f ynnl epwtqd `l ,epxrvl .odil` eprbdy dl`n xzei zeaeh elit` ze`vezl jk
,ewap `ly mipey mi
na mixeyw xewgl ozipy mitqep mi`yep .epynzyd mda mi
nd ieaixe onfd xvew
dkex` zipnxba zrvennd dlind ,lynl) dtya zrvennd dlind jxe` e` dtya rvennd milind xtqn znbek
.xzei s` zeaeh ze`vezl ribdle ,mivawnl dwelgd z` xtyle zeqpl ozip ,ok enk .(zilbp`a zrvennd dlindn
27
VII wlg
zexewn
http://www.cs.huji.ac.il/~ai/projects/NLP.pdf :xnze oxen ly zixewnd deard •
"zizek`ln dpial `ean" - 67842 qxewd ly milebxzde mixeriyd zebvn •
"dty ly miiaihipbew mihaide ziaeyig dinl" - 36622 qxewd ly mixeriyd zebvn •
• Gutenberg Project - http://www.gutenberg.org/
• http://www.bookrix.com/
• http://www.e-book.com.au/morefreebooks/freemultilingualbooks.htm
• http://tnlessone.wordpress.com/2007/05/13/how-to-detect-which-language-a-text-is-written-in-or-when-
science-meets-human/
• http://en.wikipedia.org/wiki/List_of_languages_by_writing_system#Latin_script
• http://en.wikipedia.org/wiki/Letter_frequency
• http://stackoverflow.com/questions/3194516/replace-national-characters-with-ascii-equivalent
• http://staff.science.uva.nl/~tsagias/?p=185
• http://www.ise.bgu.ac.il/faculty/liorr/hbchap9.pdf
• http://www.onlamp.com/pub/a/python/2006/02/09/ai_decision_trees.html?page=4
• http://www.101languages.net/common-words/
28
VIII wlg
zeihqihhqd ze`vezd ixwir hexit - '` gtqp
original langauges w/ diacritics original langauges w/o diacritics
500 1000 1500 2000 500 1000 1500 2000
Kullback 79.43 78.07 78.07 79.07 67.72 68.36 66.92 66.86
Symmetric Kullback 77.85 75.71 75.5 76.71 67.28 67.28 65.86 64.92
Angle 59.5 58.28 60.57 59.43 57 56.5 60.22 58.78
Eucleadean 70.21 66.57 68.71 67.5 67.07 66.72 66.78 66.78
Infinity 48.85 43.14 47.71 46.29 41.14 42.57 45.42 42
Ranks 58.07 60 58.57 60.71 69.28 67.07 65.22 68.14
Simple Difference 62.85 65.14 64.14 64.79 58.36 61.07 60.5 61.78
All langauges w/ diacritics All langauges w/o diacritics
500 1000 1500 2000 500 1000 1500 2000
Kullback 69.34 69.22 69.59 69.88 62.73 63.45 62.53 61.9
Symmetric Kullback 68.87 68.33 69.19 69.4 59.09 60.43 59.02 57.75
Angle 46.93 46.17 45 44.75 49.48 49.25 49.25 48.96
Eucleadean 59.19 58.77 57.38 57.12 56.66 57.98 57.56 56.59
Infinity 44.94 41.49 43.33 40.8 39.78 39.78 41.03 38.39
Ranks 42.56 43.85 46.35 45.60 55 57.53 58.45 57.75
Simple Difference 53.07 52.25 51.78 50.67 53.17 53.21 52.98 51.5
original langauges w/ diacritics original langauges w/o diacritics
500 1000 1500 2000 500 1000 1500 2000
Unigrams 62.57 62.05 61.91 60.1 56.96 55.61 57.81 58.57
Bigrams 69.55 66.89 68.48 69.39 69.6 65.31 68.86 68.81
First 58.42 60.86 61 61.1 55.1 54.23 52.91 55.71
Last 77.96 75.42 75.52 77.72 71.42 79.05 73.52 70.61
All langauges w/ diacritics All langauges w/o diacritics
Unigrams 53.23 52.04 54.54 53.14 52.1 52.25 54.06 53.23
Bigrams 68.01 65.59 65.53 65.65 68.88 67.42 67.07 66.84
First 47.6 48.52 47.85 46.53 45.09 48.32 46.55 45.03
Last 53.95 55.54 54.57 55.47 53.14 55.32 54.5 53.1
29
IX wlg
dhlgdd ivra ze`vezd ixwir hexit - 'a gtqp
First Letter Last Letter
500 1000 1500 2000 500 1000 1500 2000
Gini 20 21.15 21.84 18.39 23.45 21.38 20.92 22.76
Entropy 20.68 20.68 22.06 22 25.74 26.43 23.9 29.86
IG 18.85 19.54 20.23 21.61 20.92 20.46 20 20.1
IGR 22.53 27.36 29.66 29.89 21.38 26.9 28.28 26.67
Train Error 16.09 17.93 18.62 18.16 15.86 20.69 20.69 20.69
Unigrams Bigrams
500 1000 1500 2000 500 1000 1500 2000
Gini 51.03 49.2 52.41 54.71 30.11 30.8 28.9 31.03
Entropy 57.24 62.29 70.11 68.28 61.38 64.83 67.13 62.56
IG 42.53 46.67 53.79 56.55 56.32 61.84 61.61 63.51
IGR 61.38 62.07 72.64 71.49 69.65 71.3 73.33 72.64
Train Error 39.77 42.53 44.83 46.44 27.58 28.05 33.1 31.72
30
Top Related