Exhaustkd Searching exhaustively for an optimal setting of Timbl’s -k and -d parameters.

exhaustkd

Searching exhaustively for an optimal setting of Timbl’s -k and -d parameters

Overview

• Introduction: – Timbl’s k & distance weighting

• Idea:– Read knn sets and distances from Timbl’s output

• Implementation: – exhaust.py & cvexhaust.py

• Examples: – diminutive & Prosit data

• Discussion: – limitations & improvements

Introduction

?X

X

X

XX

XYY

Y

k=1k=2

k=3

Knn classification without distance weighting

Introduction (cont.)

?XX

X

XX

X

YY

Y

k=1k=2

k=3

Knn classification with distance weighting

Introduction (cont.)

• Distance weighting methods:– Z (no weighting)– ID (inverse distance)– IL (inverse linear)– EDa (exponential decay with alpha a)

Idea

• Knn classification is actually a two-step process:1. Determine the nearest neighbor sets (= those instances with

similar features, optionally using feature weighting and MVDM) 2. Determine the majority class within all nearest neighbors at

maximally the k-th distance, optionally using distance weighting

We can do step 2 without repeating step 1!

Idea (cont.)

• The +v option allows you to write the knn sets and their distances to the output file +vn := write nearest neighbors +vdi :=write distance (of the instance to be classified, and of its

knn sets) +vdb := write class distribution (of the instance to be classified,

and of its knn sets)

Example: Timbl -f dimin.train -t dimin.test -k3 +vn+di+db

=,=,=,=,+,k,u,=,-,bl,u,m,E,E { E 4.00000, P 3.00000 } 0.0000000000000# k=1, 1 Neighbor(s) at distance: 0.00000# =,=,=,=,+,k,u,=,-,bl,u,m,{ P 1 }# k=2, 1 Neighbor(s) at distance: 0.0594251# =,=,=,=,+,m,K,=,-,bl,u,m,{ E 1 }# k=3, 5 Neighbor(s) at distance: 0.103409# =,=,=,=,+,m,O,z,-,bl,u,m,{ E 1, P 1 }# =,=,=,=,+,st,K,l,-,bl,u,m,{ E 1 }# =,=,=,=,+,m,y,r,-,bl,u,m,{ E 1, P 1 }+,m,I,=,-,d,A,G,-,d,},t,J,J { J 8.00000 } 0.28274085738293# k=1, 1 Neighbor(s) at distance: 0.282741# -,v,@,r,+,v,A,l,-,p,},t,{ J 1 }# k=2, 6 Neighbor(s) at distance: 0.311890# =,=,=,=,=,=,=,=,+,k,},t,{ J 1 }# =,=,=,=,=,=,=,=,+,p,},t,{ J 1 }# =,=,=,=,=,=,=,=,+,xr,},t,{ J 1 }# =,=,=,=,=,=,=,=,+,l,},t,{ J 1 }# =,=,=,=,=,=,=,=,+,h,},t,{ J 1 }# =,=,=,=,=,=,=,=,+,fr,},t,{ J 1 }# k=3, 1 Neighbor(s) at distance: 0.325529# +,m,K,=,-,d,@,=,-,pr,a,t,{ J 1 }

Idea (cont.)

• From this output, you can– read the knn members and their distances– repeat classification for smaller k’s and other distance weightings– without calculating the knn sets and their distances again

• Hence – classification is potentially much faster– exhaustively trying all combinations of k and distance weighting

becomes feasible

Implementation

• Python scripts– exhaustkd– cvexhaustkd

• Requirements:– Minimally python 2.1 – Expenv libraries

• Input– List of classes – Timbl output file(s) produced with +vn+di+db and a high k– Some option settings

• Output– tables with performance measures (accuracy, recall, precision, F-score) for

all combinations of k and d

$ exhaustkd -husage: exhaustkd [options] CLASSES FILE exhaustkd [options] CLASSES <FILE

purpose: Timbl's +vn+di+di option causes it to add the nearest neigbors and their distances to its output. This output can then be passed on to exhaustkd to perform an exhaustive classification over a range over k's and distance weighting metrics. It will always try Z, ID, and IL. Optionally, various settings of ED can be tried. exhaustkd tabulates the performance for all settings. Which evaluation metrics are reported depends on the PATTERN of the -o option. A PATTERN is a comma-separated list of one or more of the following symbols: A = accuracy K = kappa P = combined precision R = combined recall F = combined f-score pC = precision on class C rC = recall on class C fC = F-score on class C

args: CLASSES classes as a comma separated list FILE classifier output file

options: --version show program's version number and exit -h, --help show this help message and exit -aFLOAT1,FLOAT2,...,FLOATn, --alphas=FLOAT1,FLOAT2,...,FLOATn values to try as the alpha constant in the exponential decay metric (default is none) -bFLOAT, --beta=FLOAT beta in F score calculation (default is 1.0) -dSTRING, --delimiter=STRING column delimiter (default is ' ') -f, --full-output output all available evaluation metrics for every setting -kINT, --max-k=INT the maximum number of nearest neighbors to try (default is 1) -nINT, --n-best=INT the number of settings reported in the n-best list (default is 10) -oPATTTERN, --output=PATTTERN output patttern (default is 'A,P,R,F') -rINT, --random-seed=INT seed for random generator (default is current system time) -t{once|continue|random}, --tie-resolution={once|continue|random} tie resolution by increasing k once (default), by increasing k continuously, or by choosing randomly -%, --percent output in percentages

examples:

exhaustkd -k5 -a1.0,2.0 X,Y,Z output_file

perform an exhaustive classification into classes X,Y,Z with k from 1 to 5, and distance metrics Z, ID, IL, ED1.0 and D2.0

exhaustkd -% -opX,rX,fX X,Y,Z <output_file output precision, recall, and f score percentages on class X

exhaustkd -k10 -tcontinue -0A X,Y,Z output_file

ouput accuracy when using continuous tie resolution upto k=10

Example: diminutive

• Commands– Timbl -f dimin.train -t dimin.test -o out -k5 +vn+db+di– exhaustkd -d, -k5 -a1,5 -oA -% P,T,J,E,K out

================================================================================Accuracy (%)================================================================================

k Z ID IL ED1.0 ED5.01 96.74 96.74 96.74 96.74 96.742 97.37 96.74 96.63 96.42 96.633 96.42 96.42 97.05 96.53 96.954 95.05 95.68 96.42 95.68 96.325 95.37 95.26 96.42 95.47 95.79

Rank: Score: k: d:1 97.37 1 Z2 97.05 2 IL3 96.95 2 ED5.04 96.74 1 ID5 96.74 0 Z6 96.74 0 IL7 96.74 0 ID

Example: Prosit breaks

• Commands:– For each of the 10 folds, a Timbl with -k31 -o out0?? – cvexhaustkd -a1,5 -k30 -% -oA,P,R,F,pB,rB,fB B,- out0??

>exhaustive-report

================================================================================Accuracy (%)================================================================================

k: Z: ID: IL: ED1.0: ED5.0: 1 94.83 0.35 94.83 0.35 94.83 0.35 94.83 0.35 94.83 0.352 96.00 0.44 94.83 0.35 94.83 0.35 94.83 0.35 94.83 0.353 96.00 0.44 96.00 0.44 95.89 0.48 96.01 0.44 95.99 0.434 96.28 0.41 96.02 0.47 95.99 0.47 96.02 0.46 96.02 0.475 96.28 0.41 96.29 0.41 96.22 0.49 96.28 0.41 96.28 0.406 96.37 0.44 96.28 0.41 96.24 0.41 96.27 0.42 96.27 0.417 96.37 0.44 96.38 0.44 96.36 0.46 96.37 0.44 96.36 0.458 96.41 0.45 96.42 0.47 96.37 0.44 96.41 0.47 96.41 0.479 96.41 0.45 96.42 0.44 96.42 0.46 96.41 0.44 96.43 0.4310 96.40 0.42 96.45 0.43 96.45 0.45 96.45 0.43 96.45 0.4311 96.40 0.42 96.41 0.41 96.46 0.46 96.40 0.42 96.43 0.4312 96.43 0.41 96.44 0.44 96.46 0.47 96.43 0.45 96.45 0.4413 96.43 0.41 96.43 0.41 96.47 0.44 96.42 0.41 96.44 0.4214 96.44 0.47 96.46 0.45 96.47 0.46 96.46 0.45 96.48 0.4615 96.44 0.47 96.44 0.47 96.47 0.47 96.44 0.47 96.46 0.4616 96.41 0.49 96.45 0.47 96.45 0.46 96.45 0.47 96.47 0.4717 96.41 0.49 96.41 0.48 96.46 0.47 96.40 0.49 96.45 0.4718 96.40 0.48 96.45 0.47 96.47 0.47 96.44 0.47 96.46 0.4719 96.40 0.48 96.40 0.47 96.47 0.48 96.39 0.48 96.44 0.4520 96.37 0.49 96.41 0.46 96.48 0.48 96.41 0.47 96.43 0.4620 96.37 0.49 96.41 0.46 96.48 0.48 96.41 0.47 96.43 0.4621 96.37 0.49 96.37 0.48 96.48 0.49 96.37 0.49 96.40 0.4722 96.38 0.49 96.39 0.46 96.48 0.47 96.40 0.47 96.41 0.4623 96.38 0.49 96.38 0.49 96.48 0.49 96.38 0.49 96.41 0.4824 96.39 0.51 96.41 0.48 96.48 0.50 96.41 0.48 96.43 0.4725 96.39 0.51 96.39 0.50 96.47 0.49 96.39 0.51 96.42 0.4926 96.40 0.48 96.43 0.50 96.49 0.49 96.43 0.49 96.45 0.4927 96.39 0.48 96.40 0.48 96.48 0.49 96.40 0.48 96.43 0.4928 96.40 0.46 96.41 0.48 96.48 0.49 96.41 0.48 96.41 0.4729 96.40 0.46 96.40 0.46 96.48 0.49 96.40 0.46 96.43 0.4730 96.39 0.49 96.43 0.46 96.48 0.49 96.42 0.47 96.43 0.47

Discussion: Time

• Normal time:– An avarage 10 fold CV Tmbl experiment on the Prosit breaks

requires about 30 hours (min. 20 to max. 50 hours)– Here we have k x distance weighting = 30 x 5 = 150 CV

experiments– Thus, this would normally require about 150 x 30 = 4500 hours =

188 days

• Time with exhaustkd:– A single 10 fold CV experiment with dumping of the knn sets

requires about 30 hours– Running cvexhaustkd takes about 3 minutes (!)– Therefore, we have reduced the required by a factor 150

– (BTW the “seconds taken” reported by Timbl are a little off :-)

Discussion: Memory & Space

• Memory:– Exhaustkd works locally, reading an instance and its nn’s from file,

classifying, and adding the result to a confusion matrix– Consumes very little memory (2-5MB)

• Disk Space:– writing knn’s to output can take a lot of space– E.g. upsampled Prosit break data with k=31 requires about 1.8GB

Discussion: limitations

• Obviously, the k of exhaustkd can never be larger than the real k (= the k of the original Timbl experiment)

• Actually, the k of exhaustkd must be one less than the real k• Reason: tie resolution

– In case of a tie, k is increased by one

• Also, output of exhaustkd may differ slightly from Timbl’s output

• Reason: tie resolution– If a tie is still unresolved after increasing k,

Timbl resorts to a random choice– The exact random behaviour cannot be reproduced by exhaustkd

Discussion: limitations (cont.)

k Z ID IL ED1.0

1 96.74 4/5 96.74 4/5 96.63 3/5 96.74 4/52 97.37 10/12 96.74 0/1 96.74 4/5 96.42 0/13 96.42 6/8 96.42 0/1 96.74 0/1 96.53 1/14 95.05 3/5 95.68 0/0 96.74 1/1 95.68 0/05 95.37 5/5 95.26 0/0 96.42 0/0 95.47 0/0

k: Z: ID: IL: ED1.0: 1 96.78 0.05 96.74 0.00 96.74 0.00 96.74 0.002 97.25 0.09 96.74 0.00 96.63 0.00 96.42 0.003 96.46 0.05 96.42 0.00 97.05 0.00 96.53 0.004 95.05 0.00 95.68 0.00 96.42 0.00 95.68 0.005 95.37 0.00 95.26 0.00 96.42 0.00 95.47 0.00

• Timbl output (accuracy and #ties):

• Exhaustkd output (average accuracy and SD):

Discussion: Limitations (cont.)

• Limit on number of nn’s:– Currently, if the number of nn’s exceeds 500, Timbl will only write

the first 500– Because you don’t want to dump the whole instance base (!)– However, would be nice if this was an option

Discussion: Plans

• Exhaustkd can be faster:– Code not really profiled yet– Code can be (partly) compiled to C

• Exhaustkd can be combined with methods that optimise feature weighting (-w) and featture metric (-m) options

– Paramsearch/Iterative Deepening

• Experiment with exhaustkd’s 3 options for tie resolution:– Random– Increase k once– Increase k continously until tie is resolved

• Wild plans:– Can exhaustkd be a part of Timbl?

Exhaustkd Searching exhaustively for an optimal setting of Timbl’s -k and -d parameters.

Documents

Transcript of Exhaustkd Searching exhaustively for an optimal setting of Timbl’s -k and -d parameters.