Section 1Section 1 Machine Learning basic concepts Learning TutorialMachine Learning Tutorial ......

download Section 1Section 1 Machine Learning basic concepts Learning TutorialMachine Learning Tutorial ... Section 1Section 1 Machine Learning basic concepts ... Learning Tutorial CB,,, GS,

of 33

  • date post

    28-Apr-2018
  • Category

    Documents

  • view

    213
  • download

    1

Embed Size (px)

Transcript of Section 1Section 1 Machine Learning basic concepts Learning TutorialMachine Learning Tutorial ......

  • Machine Learning TutorialMachine Learning Tutorial

    CB, GS, REC, ,

    Section 1Section 1Machine Learning basic concepts

    Machine Learning Tutorial for the UKP labJune 10 2011June 10, 2011

  • This ppt includes some slides/slide-parts/text takenfrom online materials created by the followingpeople:people:- Greg Grudic- Alexander Vezhnevets- Hal III Daume

  • What is Machine Learning?What is Machine Learning?

    The goal of machine learning is to build computer systems that can adapt and learn from their experience.

    Tom Dietterich

    3SS 2011 | Computer Science Department | UKP Lab - Gyrgy Szarvas |

  • A Generic SystemA Generic System

    1x2x

    1yySystem 2

    x

    x

    2y

    y1 2, , ..., Kh h hNx My1 2 K

    ( )I t V i bl ( )1 2, ,..., Nx x x=x( )h h h=h

    Input Variables:Hidden Variables: ( )1 2, ,..., Kh h h=h

    ( )1 2, ,..., Ky y y=yHidden Variables:

    Output Variables:

    4

    ( )1 2 Ky y yySS 2011 | Computer Science Department | UKP Lab - Gyrgy Szarvas |

  • When are ML algorithms NOT needed?When are ML algorithms NOT needed?

    When the relationships between all system variables (input, output, and hidden) is completely understood!

    This is NOT the case for almost any real system!

    5SS 2011 | Computer Science Department | UKP Lab - Gyrgy Szarvas |

  • The Sub Fields of MLThe Sub-Fields of ML

    Supervised Learning

    Reinforcement LearningReinforcement Learning

    Unsupervised Learning

    6SS 2011 | Computer Science Department | UKP Lab - Gyrgy Szarvas |

  • Supervised LearningSupervised Learning

    Given: Training examples

    ( )( ) ( )( ) ( )( ){ }f f ffor some unknown function (system)

    ( )( ) ( )( ) ( )( ){ }1 1 2 2, , , ,..., ,P Px f x x f x x f x( )=y f xfor some unknown function (system) ( )=y f x

    ( )Find Predict , where is not in the training set

    ( )f x( ) =y f x x

    7SS 2011 | Computer Science Department | UKP Lab - Gyrgy Szarvas |

  • Model model qualityModel, model quality

    Definition: A computer program is said to learnfrom experience Ewith respect to some class of tasks Tand performance measure P, if its performance at tasks in T, as measured by P, improves with experience E.

    Learned hypothesis: model of problem/task TModel quality: accuracy/performance measured by P

    8SS 2011 | Computer Science Department | UKP Lab - Gyrgy Szarvas |

  • Data / Examples / Sample / InstancesData / Examples / Sample / Instances

    f f /Data: experience E in the form of examples / instancescharacteristic of the whole input space

    representative samplerepresentative sampleindependent and identically distributed (no bias in selection / observations)

    G d lGood example1000 abstracts chosen randomly out of 20M PubMed entries (abstracts)

    probably i.i.d.probably i.i.d.representative?

    if annotation is involved it is always a question of compromises

    Definitely bad exampleDefinitely bad exampleall abstracts that have John Smith as an author

    9

    Instances have to be comparable to each otherSS 2011 | Computer Science Department | UKP Lab - Gyrgy Szarvas |

  • Data / Examples / Sample / InstancesData / Examples / Sample / Instances

    f fExample: set of queries and a set of top retrieved documents(characterized via tf, idf, tf*idf, PRank, BM25 scores) for each

    try predicting relevance for reranking!try predicting relevance for reranking!top retrieved set is dependent on underlying IR system!

    issues with representativeness, but for reranking this is finecharacterization is dependent on query (exc. PRank), i.e. only certain pairs (forthe same Q) are meaningfully comparable (c.f. independent examples for thesame Q)

    we have to normalize the features per query to have same mean/variancewe have to form pairs and compare e.g. the diff of feature values

    Toy example:Q = learning, rank 1: tf = 15, rank 100: tf = 2

    10

    Q = overfitting, rank 1: tf = 2, rank 10: tf = 0

    SS 2011 | Computer Science Department | UKP Lab - Gyrgy Szarvas |

  • FeaturesFeatures

    The available examples (experience) has to bedescribed to the algorithm in a consumable format

    Here: examples are represented as vectors of pre-defined featuresE.g. for credit risk assesment, typical features can be: income range, debt load employment history real estate properties criminal recorddebt load, employment history, real estate properties, criminal record, city of residence, etc.

    Common feature typesypbinary (criminal record, Y/N)nominal (city of residence, X)nominal (city of residence, X)ordinal (income range, 0-10K, 10-20K, )numeric (debt load, $)

    11

    numeric (debt load, $)

    SS 2011 | Computer Science Department | UKP Lab - Gyrgy Szarvas |

  • Machine Learning TutorialMachine Learning Tutorial

    CB, GS, REC, ,

    Section 2Section 2Experimental practice

    by now youve learned what machine learning is; in the supervised approach you need (carefully selected / prepared) examples that you describe through features; the algorithm then learns a model of the problem based on the examples (usually some kind of optimization is performed in the background); and as a resultsome kind of optimization is performed in the background); and as a result, improvement is observed in terms of some performance measure

    Machine Learning Tutorial for the UKP labMachine Learning Tutorial for the UKP labJune 10, 2011

  • Model parametersModel parameters

    2 kinds of parameters2 kinds of parametersone the user sets for the training procedure in advance hyperparameter

    the degree of polynom to match in regressionnumber/size of hidden layer in Neural Networkynumber of instances per leaf in decision tree

    one that actually gets optimized through the training parameterregression coefficientsnetwork weightssize/depth of decision tree (in Weka, other implementations might allow to control that)

    we usually do not talk about the latter, but refer to hyperparameters as parameters

    Hyperparametersthe less the algorithm has, the better

    Naive Bayes the best? No parameters!usually algs with better discriminative power are not parameter-free

    typically are set to optimize performance (on validation set, or through cross-validation)manual, grid search, simulated annealing, gradient descent, etc.

    13

    common pitfall:select the hyperparameters via CV, e.g. 10-fold + report cross-validation results

    SS 2011 | Computer Science Department | UKP Lab - Gyrgy Szarvas |

  • Cross validation IllustrationCross-validation, Illustration

    { }kk xxX ,...,1=

    2X 3X 5X4X1X

    Test The result is an average over all iterations

    Train

    14SS 2011 | Computer Science Department | UKP Lab - Gyrgy Szarvas |

  • Cross ValidationCross-Validation

    f ld CV ti f ki (h ) t ti tin-fold CV: common practice for making (hyper)parameter estimation morerobust

    round robin training/testing n times, with (n-1)/n data to train and 1/n data to evaluate the modeltypical: random splits, without replacement (each instance tests exactly once)

    the other way: random subsampling cross-validation

    n-fold CV: common practice to report average performance deviation etcn-fold CV: common practice to report average performance, deviation, etc.No Unbiased Estimator of the Variance of K-Fold Cross-Validation (Bengio and Grandvalet 2004)bad practice? problem: training sets largely overlap, test errors are also dependent

    tends to underestimate real variance of CV (thus e g confidence intervals are to be treated with extremetends to underestimate real variance of CV (thus e.g. confidence intervals are to be treated with extreme caution)5-2 CV is a better option: do 2-fold CV and repeat 5 times, calculate average: less overlap in training sets

    Folding ia nat ral nits of processing for the gi en taskFolding via natural units of processing for the given tasktypically, document boundaries best practice is doing it yourself!

    ML package / CSV representation is not aware of e.g. document boundaries!The PPI case

    15

    The PPI case

    SS 2011 | Computer Science Department | UKP Lab - Gyrgy Szarvas |

  • Cross ValidationCross-Validation

    Ideally the valid settings are:take off-the-shelf algorithms, avoid parameter tuning and compareresults e g via cross validationresults, e.g. via cross-validation

    n.b. you probably do the folding yourself, trying to minimize biases!do parameter tuning (n.b. selecting/tuning your features is also tuning!) but then normally you have to have a blind set (from the beginning)

    e.g. have a look at shared tasks, e.g. CoNLL practical way to learnexperimental best practice to align the predefined standards (you might evenexperimental best practice to align the predefined standards (you might evenbenefit from comparative results, etc.)

    You might want to do something different be aware of these & the consequences

    16

    be aware of these & the consequences

    SS 2011 | Computer Science Department | UKP Lab - Gyrgy Szarvas |

  • The ML workflowThe ML workflow

    Common ML experimenting pipelineCommon ML experimenting pipeline1. define the task

    instance, target variable/labels, collect and label/annotate datadi i k 1 di d/b d di i hcredit risk assessment: 1 credit request, good/bad credit, ~s ran out in the

    previous year

    2. define and collect/calculate features, define train / validation2. define and collect/calculate features, define train / validation(development) ((test!)) / test (evaluation) data