Zornitsa Kozareva USC/ISI Marina del Rey, CA · List of attributes (last: class variable) Frequency...

35
!"#$$% '()*+,-./+( )+ 0123 Zornitsa Kozareva USC/ISI Marina del Rey, CA [email protected] www.isi.edu/~kozareva 43(-3*5 678 69:;

Transcript of Zornitsa Kozareva USC/ISI Marina del Rey, CA · List of attributes (last: class variable) Frequency...

!"#$$%&'()*+,-./+(&)+&0123&

Zornitsa Kozareva!USC/ISI!

Marina del Rey, [email protected]!

www.isi.edu/~kozareva!

43(-3*5&678&69:;&

!"#$%!&'(&)*%"+,'-*+./+)%0*-%#+*12/34/%$+&256'6%

!/(&7%8&)&%9'+'+4%:*;1&-/%•  <*22/=>*+%*0%.&=?'+/%2/&-+'+4%&24*-')?.6%

–  *@/+A6*B-=/%@&=(&4/%1-'C/+%'+%D&,&%

•  E6/3%0*-%-/6/&-=?F%/3B=&>*+%&+3%&@@2'=&>*+%

•  9&'+%0/&)B-/67%–  3&)&%@-/A@-*=/66'+4%)**26%–  2/&-+'+4%&24*-')?.6%–  /,&2B&>*+%./)?*36%

–  4-&@?'=&2%'+0/-/+=/%–  /+,'-*+./+)%0*-%=*.@&-'+4%2/&-+'+4%&24*-')?.6%

G%

!/(&7%8&)&%9'+'+4%:*;1&-/%

•  <2&66'H=&>*+%&24*-')?.67%–  %3/='6'*+%)-//6F%(IIF%:J9F%I&',/AK&5/6%

•  L-/3'=>*+%&24*-')?.67%–  -/4-/66'*+%M2'+/&-N:J9O%F%@/-=/@)-*+%

•  9/)&A&24*-')?.67%–  K&44'+4F%K**6>+4%M$3&P**6)O%

&.*+4%*)?/-6%

Q%

R/S+4%:)&-)/3%

•  T+6)&22%!/(&%6*;1&-/%M*+%U'+BVO7%

–  8*1+2*&3%2'+(7%%•  ?C@7NN@-3*1+2*&36W6*B-=/0*-4/W+/)N1/(&N1/(&AQAXAGWY'@%•  E+Y'@%)?/%6*;1&-/%

– Z/[B'-/./+)7%%%%%D&,&%\W]%M*-%?'4?/-O%

–  T+,*(/%!/(&%=*..&+37%•  ^&,&%A=@%1/(&W^&-%!"#$%&'())%*+,-

_%

]%

java -Xmx1000M -jar weka.jar Weka GUI Chooser

@relation named_entity

@attribute position numeric @attribute pos_tag { NN, NP, VB, DT} @attribute word_length numeric @attribute in_gazetteer { no, yes} @attribute class { B-PER, I-PER, B-LOC, I-LOC, O}

@data 3,DT,3,no,B-ORG 4,NP,10,yes,I-ORG 15,NP,6,yes,O 7, NN,12,?,B-PER ...

Data File Format (.arff)

Other attribute types:

•  String

•  Date

Missing value

X%

List of attributes (last: class variable)

Frequency and categories for the selected

attribute

Statistics about the values of the selected attribute

Classification

Filter selection

Manual attribute selection

Statistical attribute selection

Preprocessing

The Preprocessing Tab

Slide adapted from Marti Hearst

Choice of classifier

The attribute whose value is to be predicted from the values of the remaining ones.

Default is the last attribute.

Cross-validation: split the data into e.g. 10 folds and

10 times train on 9 folds and test on the remaining one

The Classification Tab

Slide adapted from Marti Hearst

Choosing a classifier

Slide adapted from Marti Hearst

Slide adapted from Marti Hearst

Slide adapted from Marti Hearst

all other numbers can be obtained from it

different/easy class

accuracy

Slide adapted from Marti Hearst

Running on Test Set

\Q%Slide adapted from Marti Hearst

!"#$%<*..&+3%U'+/%

\_%

!/(&%:@/='H=&>*+6%•  -̀&'+%=2&66'H/-%*+%)-&'+'+4%3&)&%&+3%*B)@B)%.*3/2%

•  ^&,&%A=@%1/(&W^&-%!'.%//01#2&34*'5(*,%a)%b62%0*&1.#,%%A3%!62%0*#+&)(+#.,-

•  ZB+%)-&'+/3%=2&66'H/-%.*3/2%*+%)/6)%3&)&%•  ^&,&%A=@%1/(&W^&-%!'.%//01#2&34*'5(*,%a`%b6#/6&1.#,%%A2%b62%0*#+&)(+#.,-

•  :@/='05'+4%@&-&./)/-67%A)%7%)-&'+'+4%H2/%MW&-cO%A`%7%)/6)%H2/%MW&-cO%A3%7%*B)@B)%H2/+&./%M)-&'+/3%=2&66'H/-%.*3/2O%A2%7%'+@B)%.*3/2%M0*-%)/6>+4O%A#%7%+B.K/-%*0%+/&-/6)%+/'4?K*-6%0*-%(II%&24*-')?.%&7-8-7#.9-:'7#'$-(46-(67#2-9%2%)#6#2-(95(*/;-#6'<=-

general parameters

Classifier-specific

parameters

\]%

"V&.@2/7%$II%'+%!/(&%

•  -̀&'+%&%=2&66'H/-%B6'+4%GII%&24*-')?.%•  ^&,&%A=@%1/(&W^&-%%%%%%%%%%%%%%%%%%1/(&W=2&66'H/-6W2&Y5WTP(%%

%%%%%%%%%%%%A)%%3&)&N1/&)?/-W&-c-

%%%%%%%%%%%%A#%%G%

%%%%%%%%%%%%A3%%.*3/2WG++%

•  ZB+%)?/%)-&'+/3%=2&66'H/-%*+%)/6)%3&)&%•  ^&,&%A=@%1/(&W^&-%%%%%%%%%%%%%%%%%%1/(&W=2&66'H/-6W2&Y5WTP(%%

%%%%%%%%%%%%A`%%3&)&N1/&)?/-W&-c-

%%%%%%%%%%%%A2%%.*3/2WG++%

Classifier-function in weka

Training file Algorithm parameter Output model name

Classifier-function in weka Test file

Input model name

\X%

:&.@2/%!/(&%dB)@B)%

\e%

•  <2&66'H=&>*+%2&K/26%0*-%/&=?%'+6)&+=/%MB6/%fa@%\g%*@>*+O%•  ^&,&%A=@%1/(&W^&-%%%1/(&W=2&66'H/-6W2&Y5WTK(%%A`%%3&)&N1/&)?/-W&-c%%%A2%%.*3/2WG++%%A@%\%

9*-/%8/)&'2/3%dB)@B)%

\h%

•  (II7%%•  8/='6'*+%)-//67%•  I&i,/%P&5/67%•  $3&P**6)7%%%

!/(&%<2&66'H=&>*+%jB+=>*+6%

\k%

$33'>*+&2%T+0*-.&>*+%

•  R/+/-&2%3*=B./+)&>*+7%%%%%?C@7NN111W=6W1&'(&)*W&=W+YN.2N1/(&N%?C@7NN@-3*1+2*&36W6*B-=/0*-4/W+/)N1/(&N1/(&W@@)%

•  <*..&+3%2'+/%3*=7%

%%%?C@7NN1/(&W1'('6@&=/6W=*.NL-'./-%

Gl%

$66'4+./+)%m\%

G\%

I&./3%"+>)5%Z/=*4+'>*+%

•  R',/+7% &% )-&'+% &+3% 3/,/2*@./+)% 3&)&% 6/)6% *0%"+42'6?%6/+)/+=/6%)&44/3%1')?%)?/%=2&66/67%– PAL"ZF%TAL"Z%M@/*@2/O%– PAdZRF%TAdZR%M*-4&+'Y&>*+O%– PAUd<F%TAUd<%M2*=&>*+O%– PA9T:<F%TA9T:<%M.'6=/22&+/*B6O%– d%M*B)6'3/F%./&+'+4%+*)%&%+&./3%/+>)5%=2&66O%

•  n*B-% *K^/=>,/% '67% )*% 3/,/2*@% &% .&=?'+/%2/&-+'+4%I"% 656)/.F%1?'=?%1?/+%4',/+%&%+/1%@-/,'*B625% B+6//+% )/V)% M'W/W% )/6)% 6/)O% 1'22%'3/+>05% &+3% =2&66'05% )?/% +&./3% /+>>/6%=*--/=)25%

GG%

8&)&%8/6=-'@>*+%•  `?/% 3&)&% =*+6'6)6% *0% )?-//% =*2B.+6% 6/@&-&)/3% K5% &% 6'+42/%

6@&=/W%"&=?%1*-3%?&6%K//+%@B)%*+%&%6/@&-&)/%2'+/%&+3%)?/-/%'6%&+%/.@)5%2'+/%&;/-%/&=?%6/+)/+=/W%%

GQ%

EWIW%%IIL%PAdZR%%*o='&2%II%d%%

"(/B6%IIL%PAL"Z%?/&36%JPp%d%%0*-%TI%d%%

P&4?3&3%IIL%PAUd<%%W%W%d%%

1*-3%

@&-)A*0A6@//=?A)&4%

+&./3%/+>)5%)&4%

<=>?@A%./&+6%)?/%1*-3%'6%)?/%K/4'++'+4%*0%&%@?-&6/%*0%)5@/%`nL"%B%./&+6%)?/%1*-3%'6%+*)%@&-)%*0%&%@?-&6/%%

Make sure to preserve the empty lines in the output of the test data

Z/6B2)6%0*-%I"%8/)/=>*+%

G_%

!+CDD=6996&"E3(FGH&AI3J-3/+(&K3)3&

8&)&%6/)6% m)*(/+6% mI"6%

-̀&'+% GX_Fe\]% \hFek_%

8/,/2*@./+)% ]GFkGQ% _FQ]\%

/̀6)% ]\F]QQ% QF]]h%

!3**1*3G&1)&3JL86996& @*1.FGF+(& M1.3JJ& N=G.+*1&

PTd%3/,W% kGW_]% klWhh% k\WXX%

AI3J-3/+(&O13G-*1G&

Z/6B2)6%0*-%I"%8/)/=>*+%

G]%

!+CDD=6996&"E3(FGH&AI3J-3/+(&K3)3&

8&)&%6/)6% m)*(/+6% mI"6%

-̀&'+% GX_Fe\]% \hFek_%

8/,/2*@./+)% ]GFkGQ% _FQ]\%

/̀6)% ]\F]QQ% QF]]h%

!3**1*3G&1)&3JL86996& @*1.FGF+(& M1.3JJ& N=G.+*1&

PTd%3/,W% kGW_]% klWhh% k\WXX%

Z/6B2)6%0*-%I"%<2&66'H=&>*+q%

GX%

"E3(FGH&&K1IL&

@*1.FGF+(& M1.3JJ& N=G.+*1&

Ud<% ekWl_% hlWll% ekW]G%

9T:<% ]]W_h% ]_WX\% ]]Wl_%

dZR% ekW]e% eXWlX% eeWee%

L"Z% heW\k% hXWk\% heWl]%

*,/-&22% ekW\]% eeWhl% ehW_e%

"E3(FGH&&>1G)L&

@*1.FGF+(& M1.3JJ& N=G.+*1&

Ud<% h]WeX% ekW_Q% hGW_e%

9T:<% XlW\k% ]eWQ]% ]hWeQ%

dZR% h\WG\% hGW_Q% h\Wh\%

L"Z% h_We\% kQW_e% hhWhe%

*,/-&22% h\WQh% h\W_l% h\WQk%

:56)/.%*0%<&--/-&6%/)%&2WFGllG%

`'./2'+/%

Ge%

M1J13G1&

>*3F(PK1I1J+EQ1()&K3)3& 43(-3*5&67)H&69:;&

>1G)&K3)3& 43(-3*5&R)H&69:;&

M1G-J)&"-SQFGGF+(&K13,JF(1&

43(-3*5&T)H&69:;&U::%#7&EQ&VO>W&

J3)1*&G-SQFGGF+(G&XFJJ&(+)&S1&3..1E)1,&

>1.H(F.3J&M1E+*)&K13,JF(1&

43(-3*5&T)H&69:;&

:BK.')%•  `?/%6*B-=/%=*3/%0*-%)?/%0/&)B-/%4/+/-&>*+%%%%%%%MQ321&G-*1&F)&XFJJ&*-(&-(,1*&DF(-YO%

•  `?/%*o='&2%)-&'+%&+3%)/6)%0/&)B-/%H2/6%B6/3%'+%)?/%H+&2%-B+F%)*4/)?/-%1')?%)?/%H+&2%*B)@B)%*0%5*B-%656)/.%0*-%)?/%)/6)%3&)&%

•  $33'>*+&225%4/+/-&)/3%-/6*B-=/6%M'0%&+5O%

•  !-')/%r_%@&4/3%K-'/0%3/6=-'@>*+%*0%5*B-%&@@-*&=?%/V@2&'+'+47%–  B6/3%IUL%)**26%–  3/6'4+/3%0/&)B-/6%–  /.@2*5/3%.&=?'+/%2/&-+'+4%&24*-')?.s.*>,&>*+%

Gh%

",&2B&>*+%'6%K&6/3%*+%•  -&+('+4%*0%5*B-%656)/.%&4&'+6)%)?/%-/6)%

•  3/6'4+/3%0/&)B-/6%–  (+I1JF%@-/,'*B625%B+(+*1+%0/&)B-/6%1'22%K/%0&,*-/3%–  656)/.t6%@-/%*-%@*6)%@-*=/66'+4%

•  Z1(1*3)1,&*1G+-*.1G%%–  6'Y/F%./)?*36%&+3%6*B-=/6%0*-%4&Y/C//-%/V)-&=>*+%

–  )-'44/-%2'6)6%

•  [B&2')5%*0%)?/%@&@/-%3/6=-'@>*+%–  6)-B=)B-/%–  B6/%*0%2')/-&)B-/%%–  1**+*&3(3J5GFG&

Gk%

R/+/-&)/%n*B-%d1+%Z/6*B-=/6%•  "V)-&=)%4&Y/C//-6%0-*.%!'('@/3'&%%

–  L/*@2/%M6'+4/-6F%)/&=?/-6F%.&)?/.&>='&+6%/)=WO%–  U*=&>*+6%M='>/6F%=*B+)-'/6O%–  d-4&+'Y&>*+6%MB+',/-6'>/6F%T`%=*.@&+'/6%/)=WO%

•  "V)-&=)%)-'44/-%1*-36%0-*.%!*-3I/)%–  2**(%0*-%?5@*+5.6%*0%@/-6*+F%2*=&>*+F%*-4&+'Y&>*+%

•  "V)-&=)%&+3%-&+(%)?/%@&C/-+6%'+%1?'=?%)?/%I"6%*==B--/3%'+%)?/%)-&'+%&+3%3/,/2*@./+)%3&)&W%:?*1%1?&)%@/-=/+)&4/6%*0%)?/6/%1/-/%0*B+3%'+%)?/%H+&2%)/6)%3&)&W%

•  "V)-&=)%2'6)6%*0%,/-K6%0*B+3%+/V)%)*%)?/%I"6W%8*%5*B%H+3%&+5%6'.'2&-')5N-/4B2&-')5%*0% )?/%,/-K6%&66*='&)/3%1')?%/&=?%*+/%*0%)?/%I"%=&)/4*-'/6u%

Ql%

!?&)%.B6)%T%3*%v%•  E6/% )?/% )-&'+% &+3% 3/,/2*@./+)% 3&)&% )*% 3/6'4+%&+3%)B+/%5*B-%I"%656)/.%

•  8/='3/% *+% )?/% 0/&)B-/6% 5*B% 1*B23% 2'(/% )*%'+=*-@*-&)/%'+%5*B-%I"%656)/.%

•  <?**6/%&%.&=?'+/%2/&-+'+4%=2&66'H/-%0-*.%!/(&%•  %?C@7NN111W=6W1&'(&)*W&=W+YN.2N1/(&N%•  T+)-*%K5%9&->%w/&-6)%?C@7NN=*B-6/6W'6=?**2WK/-(/2/5W/3BN'G]XN0lXN2/=)B-/6N2/=)B-/\XW@@)%

>HFG&FG&3&SFZ&3GGFZ(Q1()&G)3*)&13*J5[&Q\%

!?&)%5*B%.B6)%+*)%3*%v%

•  \G1& 1YFG/(Z& (3Q1,& 1(/)5& G5G)1QUGW& +*& JFS*3*5&1F)H1*&3G&3&]13)-*1&Z1(1*3)+*8&+-)E-)&Z1(1*3)+*&1).L%%

•  T0%5*B%3*F%)?/+%5*B%1'22%?&,/%)*%-B+%5*B-%656)/.%0*-%)1*%&33'>*+&2%2&+4B&4/6%:@&+'6?F%T)&2'&+%!%

QG%

$,&'2&K2/%Z/6*B-=/6%•  !*-3I/)%?C@7NN1*-3+/)W@-'+=/)*+W/3BN%•  L&-)A*0A6@//=?%)&44/-6%

–  -̀// &̀44/-?C@7NN111W'.6WB+'A6)BC4&-)W3/N@-*^/()/N=*-@2/VN -̀// &̀44/-N8/='6'*+ -̀// &̀44/-W?).2%

–  :)&+0*-3%L*:% &̀44/-%?C@7NN+2@W6)&+0*-3W/3BN6*;1&-/N)&44/-W6?).2%

•  IL%=?B+(/-%–  ?C@7NN111W3=6W6?/0W&=WB(Nr.&-(N'+3/VW?).2u?C@7NN111W3=6W6?/0W&=WB(Nr.&-(N@?3N6*;1&-/N=?B+(/-W?).2%

•  L&-6/-%–  :)&+0*-3%L&-6/-%?C@7NN+2@W6)&+0*-3W/3BN6*;1&-/N2/VA@&-6/-W6?).2%

•  d)?/-%% %?C@7NN+2@W6)&+0*-3W/3BN2'+(6N6)&)+2@W?).2%

QQ%

R**3%UB=(x%

Q_%