1 TMVA Workshop, 21 Jan 2011 Andreas Hoecker – Introduction TMVA Workshop – Introduction Andreas...

1TMVA Workshop, 21 Jan 2011 Andreas Hoecker – Introduction

TMVA Workshop – Introduction

Andreas Hoecker (CERN)TMVA Workshop, 21 January 2011, CERN, Switzerland

http://tmva.sf.net

http://tmva.sf.net/

http://tmva.sf.net/

http://tmva.sf.net/

http://tmva.sf.net/


Goals of the WorkshopIntroduction to multivariate classification and regression with TMVA

• Pedagogical talks on various multivariate methods (morning)

• Talks from users (14:00)

• Tutorial (15:30)


TMVA

• ROOT: is the analysis framework used by most (HEP)-physicists

• Idea: rather than just implementing new MVA techniques and making them available in ROOT:

﹣ Have one common platform / interface for all MVA methods

﹣ Have common data pre-processing capabilities

﹣ Train and test all classifiers on same data sample and evaluate consistently

﹣ Provide common analysis (ROOT scripts) and application framework

﹣ Provide access with and without ROOT, through macros, C++ executables or python


TMVA

• TMVA started in 2006 on the Sourceforge development platform

• 6 core developers, 21 contributors so far

• TMVA is written in C++, and relies on ROOT functionality

• Since ROOT 5.15 / TMVA v3.7.2 TMVA is part of ROOT, and developed directly in ROOT SVN

﹣ Continue to maintain primary tmva-users mailing list on Sourceforge

﹣ New TMVA versions also published as downloadable tgz files on Sourceforge

﹣ For bug reports, use ROOT Savannah

http://tmva.sf.net/

http://root.cern.ch/root/htmldoc/TMVA_Index.html

http://sourceforge.net/mail/?group_id=152074

https://savannah.cern.ch/bugs/?func=additem&group=savroot


Simulated Higgs Event in CMS

Higgs event in an LHC proton–proton collision at high luminosity (together with ~24 other inelastic events)

7 TeVLHC 2010Such events occur only in a tiny fraction

of the proton-proton collisions O(10−10)


Event Classification in HEP

• Most HEP analyses require discrimination of signal from background:﹣ Event level (Higgs searches, …)

﹣ Cone level (Tau-vs-jet reconstruction, …)

﹣ Track level (particle identification, …)

﹣ Object level (flavour tagging, …)

﹣ Parameter estimation (significance, mass, CP violation in B system, …)

• The multivariate input information used for this has various sources ﹣ Kinematic variables (masses, momenta, decay angles, …)

﹣ Event properties (jet/lepton multiplicity, sum of charges, …)

﹣ Event shape (sphericity, Fox-Wolfram moments, …)

﹣ Detector response (silicon hits, dE/dx, Cherenkov angle, shower profiles, muon hits, …)

• Traditionally few powerful input variables were combined. New methods allow to use up to 100 and more variables w/o loss of classification powere.g. MiniBooNE: NIMA 543 (2005), or D0 single top: Phys.Rev. D78, 012005 (2008)


Event Classification

• Suppose data sample with two types of events: H0, H1

﹣ We have found discriminating input variables x1, x2, …

﹣ What decision boundary should we use to select events of type H1 ?

Linear boundary? A nonlinear one?Rectangular cuts?

H1

H0

x1

x2 H1

H0

x1

x2 H1

H0

x1

x2

Low variance (stable), high bias methods High variance, small bias methods



• Suppose data sample with two types of events: H0, H1

﹣ We have found discriminating input variables x1, x2, …

﹣ What decision boundary should we use to select events of type H1 ?

Linear boundary? A nonlinear one?Rectangular cuts?

H1

H0

x1

x2 H1

H0

x1

x2 H1

H0

x1

x2

• How can we decide this in an optimal way ? Let the machine learn it !


Parameter Regression

• How to estimate a “functional behaviour” from a set of measurements?﹣ Energy deposit in a the calorimeter, distance between overlapping photons, …

﹣ Entry location of a particle in the calorimeter or on a silicon pad, …

x

f(x)

x

f(x)

x

f(x)

Linear function ? A non-linear one ?Constant ?

• Looks trivial? What if we have many input variables?


Multivariate Event Classification


Each event, Signal or Background, has “D” measured variables.

D

“feature space”

Find a mapping from D-dimensional input-observable = ”feature” spaceto one dimensional output class labels



D

“feature space”

y(x)

Most general formy = y(x); x in D x = {x1,….,xD}: input variables

y(x): Rn ® R:

Plotting the resulting y(x) values:

Find a mapping from D-dimensional input-observable = ”feature” spaceto one dimensional output class labels



D

“feature space”

y(x): Rn ® R:

y(x)

y(x): “test statistic” in D-dimensional space of input variables

Distributions of y(x): PDFS(y) and PDFB(y)

Overlap of PDFS(y) and PDFB(y) affects separation power, purity

> cut: signal= cut: decision boundary< cut: background

y(x):Used to set the selection cut !

y(x) = const: surface defining the decision boundary

Efficiency and purity

y(B) 0, y(S) 1


Multi-Class Classification

Binary classification: two classes, “signal” and “background”

Signal Background


Class 1

Class 2Class 3

Class 5

Class 6

Class 4

Multi-Class Classification

Multi-class classification – natural extension for many classifiers


P(Class=C|x) (or simply P(C|x)) : probability that the event class is of type C, giventhe measured

observables x = {x1,….,xD} y(x)

P(y | C) P(C)P(Class = C | y) =

P(y)

Prior probability to observe an event of “class C”, i.e., the relative abundance of “signal” versus “background”

Overall probability density to observe the actual measurement y(x), i.e.,

Classes

P(y) = P(y | Class)P(Class)

Probability density distribution according to the measurements x and the given mapping function

Posterior probability



P(y | C)P(C)P(Class = C | y) =

P(y)ANDMinimum error in misclassification if C chosen such that it has maximum P(C|y)

To select S(ignal) over B(ackground), place decision on:

P(S | y) P(y | S) P(S)c

P(B | y) P(y | B) P(B)

Likelihood ratio as discriminating

function y(x)

Prior odds ratio of choosing a signal event(relative probability of signal vs. bkg)

“c” determines efficiency and purity

x = {x1,….,xD}: measured observablesy = y(x)

[ Or any monotonic function of P(S|y) / P(B|y) ]

Posterior odds ratio

Bayes Optimal Classification


Trying to select signal events:(i.e. try to disprove the null-hypothesis stating it were “only” a background event)

Type-2 error: Fail to identify an event from Class C as such(reject a hypothesis although it would have been true)(fail to reject the null-hypothesis/accept null hypothesis although it is false)

loss of efficiency (in selecting signal events)

Decide to treat an event as “Signal” or “Background”

Signal Back-ground

Signal Type-2 error

Back-ground

Type-1 error

Accept as:Truly is:

Type-1 error: Classify event as Class C even though it is not(accept a hypothesis although it is not true)(reject the null-hypothesis although it would have been the correct one)

loss of purity (in the selection of signal events)

Any Decision Involves a Risk

Significance α: Type-1 error rate:(=p-value): α = background selection “efficiency”

Size β: Type-2 error rate:Power: 1- β = signal selection efficiency

should be small !

should be small !

“A”: region where event is called signal


Neyman-Peason:

The Likelihood ratio used as “selection criterion” y(x) gives for each selection efficiency the best possible background rejection.

i.e. it maximises the area under the “Receiver Operation Characteristics” (ROC) curve

Varying y(x) > “cut” moves the working point (efficiency and purity) along the ROC curve

• How to choose “cut”? need to know prior probabilities (S, B abundances)

﹣ Measurement of signal cross section: maximum of S/√(S+B) or equiv. √(e·p)﹣ Discovery of a signal : maximum of S/√(B)﹣ Precision measurement: high purity (p)﹣ Trigger selection: high efficiency ( ) e (sometimes high background rejection)

P(x | S)Likelihood Ratio : y(x)

P(x | B)

0 1

1

0

1- e

ba

ckg

r.

esignal

random guessing

good classification

better classification

“limit” in ROC curve

given by likelihood ratio

Type-1 error smallType-2 error large

Type-1 error large Type-2 error small

Neyman-Pearson Lemma


Unfortunately, the true probability densities functions are typically unknown: Neyman-Pearson’s lemma doesn’t really help us…

Supervised (machine) learning

* Hyperplane in the strict sense goes through the origin. Here is meant an “affine set” to be precise.

Use MC simulation, or more generally: set of known (already classified) “events”

Use these “training” events to:

• Try to estimate the functional form of P(x|C) from which the likelihood ratio can be obtained e.g. D-dimensional histogram, Kernel densitiy estimators, MC-based matrix-element methods, …

• Find a “discrimination function” y(x) and corresponding decision boundary

(i.e. hyperplane* in the “feature space”: y(x) = const) that optimally separates signal from backgrounde.g. Linear Discriminator, Neural Networks, Boosted Decision, …

Realistic Event Classification


Unfortunately, the true probability densities functions are typically unknown: Neyman-Pearson’s lemma doesn’t really help us…

Supervised (machine) learning

* Hyperplane in the strict sense goes through the origin. Here is meant an “affine set” to be precise.

Use MC simulation, or more generally: set of known (already classified) “events”

Use these “training” events to:

• Try to estimate the functional form of P(x|C) from which the likelihood ratio can be obtained e.g. D-dimensional histogram, Kernel densitiy estimators, MC-based matrix-element methods, …

• Find a “discrimination function” y(x) and corresponding decision boundary (i.e. hyperplane* in the “feature space”: y(x) = const) that optimally separates signal from backgrounde.g. Linear Discriminator, Neural Networks, …

Realistic Event Classification

Of course, there is no magic in here. We still need to:

﹣ Choose the discriminating variables

﹣ Choose the class of models (linear, non-linear, flexible or less flexible)

﹣ Tune the “learning parameters” bias vs. variance trade off

﹣ Check the generalisation properties (avoid overtraining)

﹣ Consider trade off between statistical and systematic uncertainties


Multivariate Analysis Methods in TMVA Examples for classifiers and regression methods– Rectangular cut optimisation

– Projective and multidimensional likelihood estimator

– k-Nearest Neighbor algorithm

– Fisher and H-Matrix discriminants

– Function discriminants

– Artificial neural networks

– Boosted decision trees

– RuleFit

– Support Vector Machine

Preprocessing methods:

– Decorrelation, Principal Value Decomposition, Gaussianisation

Examples for synthesis methods:

– Boosting, Categorisation (valid for all methods, and their combinations)

Joerg

Jan

Helge

Eckhard

Peter


We have a Users Guide !

Available on http://tmva.sf.net

TMVA Users Guide142pp, incl. code examples

arXiv physics/0703039

http://tmva.sf.net/docu/TMVAUsersGuide.pdf


T MVA tutorial

U s i n g T M V A

A typical TMVA analysis consists of two main steps:

1. Training phase: training, testing and evaluation of classifiers using data samples with known signal and background composition

2. Application phase: using selected trained classifiers to classify unknown data samples

https://twiki.cern.ch/twiki/bin/view/TMVA/WebHome







T MVA tutorial

Code Flow for Training and Application








Can be ROOT scripts, C++ executables or python scripts (via PyROOT), or any other high-level language that interfaces with ROOT

T MVA tutorial

Code Flow for Training and Application








T MVA tutorial

Strong Methods need Strong Evaluation

A lot of evaluation information is already provided in the logging output of the training

Simple GUIs provide access to evaluation plots and tools for single and multi-class classification and regression








T MVA tutorial

Involved Methods need Thorough Evaluation








T MVA tutorial









T MVA tutorial


average no. of nodes before/after pruning: 4193 / 968








Join the tutorial this afternoon !!

T MVA tutorial








Two Wishes to Users :

T MVA tutorial








﹣ A detector element may only exist in the barrel, but not in the endcaps

﹣ A variable may have different distributions in barrel, overlap, endcap regions

Ignoring this dependence may reduce performance and creates correlations between variables, which must be learned by the classifier

﹣ Classifiers such as the projective likelihood, which do not account for correlations, significantly loose performance if the sub-populations are not separated

Categorisation means splitting the data sample into categories defining disjoint data samples with the following (idealised) properties:

﹣ Events belonging to the same category are statistically indistinguishable

﹣ Events belonging to different categories have different properties

Multivariate training samples often have distinct sub-populations of data


MethodCategory is Your Friend !

It provides fully transparent support for categorisation of your input data,

applicable to any TMVA method

See Peter’s talk later today


The HEP community has already a lot of experience with MVA classification

In particular for rare events searches, O(all) mature experiments use it

They increase the experiment’s sensitivity, and may reduce systematic errors due to a smaller background component

MVAs are not black boxes, but (possibly involved) Rn ® R mapping functions

Should acquire more experience in HEP with multivariate regression

Our calibration schemes are often still quite simple: linear or simple functions, look-up-table based, mostly depending on few variables (e.g., η, pT)

Non-linear multivariate regression may significantly boost calibration and corrections applied, in particular if it is possible to train from data

Available since TMVA 4 for: LD, FDA, k-NN, PDERS, PDEFoam, MLP, BDT


S t a t u s & O u t l o o k

• 2-class classification supported by all methods

• Multi-class classification supported by MLP (NN), BDTG, FDA

• Single-target regression: PDE-RS, PDE-Foam, k-NN, LD, FDA, MLP, BDT

• Multi-target regression: PDE-Foam, k-NN, MLP

• All methods support categorised classification and generalised boosting

• Priority on to-do list for future releases﹣ Automatic self-optimisation of parameter settings for all methods

﹣ For this: full support of cross validation

﹣ Increase support of multi-dimensional classification and regression

﹣ Individual improvements of methods (see, eg, MLP talk by Jan)

﹣ Introduction of unsupervised learning


• Several similar data mining efforts with rising importance in most fields

of science and industry

• Important for HEP:﹣ Parallelised MVA training and evaluation pioneered by Cornelius package (BABAR)

﹣ Also frequently used: StatPatternRecognition package by I. Narsky (Cal Tech)

﹣ Many implementations of individual classifiers exist

• TMVA is open source software

• Use & redistribution of source permitted according to terms in BSD license

Contributed to TMVA have: Andreas Hoecker (CERN, Switzerland), Jörg Stelzer (CERN, Switzerland), Peter Speckmayer (CERN, Switzerland), Jan Therhaag (Universität Bonn, Germany), Eckhard von Toerne (Universität Bonn, Germany), Helge Voss (MPI für Kernphysik Heidelberg, Germany), Moritz Backes (Geneva University, Switzerland), Tancredi Carli (CERN, Switzerland), Asen Christov (Universität Freiburg, Germany), Or Cohen (CERN, Switzerland and Weizmann, Israel), Krzysztof Danielowski (IFJ and AGH/UJ, Krakow, Poland), Dominik Dannheim (CERN, Switzerland), Sophie Henrot-Versille (LAL Orsay, France), Matthew Jachowski (Stanford University, USA), Kamil Kraszewski (IFJ and AGH/UJ, Krakow, Poland), Attila Krasznahorkay Jr. (CERN, Switzerland, and Manchester U., UK), Maciej Kruk (IFJ and AGH/UJ, Krakow, Poland), Yair Mahalalel (Tel Aviv University, Israel), Rustem Ospanov (University of Texas, USA), Xavier Prudent (LAPP Annecy, France), Arnaud Robert (LPNHE Paris, France), Doug Schouten (S. Fraser U., Canada), Fredrik Tegenfeldt (Iowa University, USA, until Aug 2007), Alexander Voigt (CERN, Switzerland), Kai Voss (University of Victoria, Canada), Marcin Wolter (IFJ PAN Krakow, Poland), Andrzej Zemla (IFJ PAN Krakow, Poland).

Copyrights & Credits

http://tmva.sf.net/LICENSE


Software packages for Multivariate Data Analysis/Classification:

Individual classifier software: e.g. “JETNET” C.Peterson, T. Rognvaldsson, L.Loennblad,many, many other packages!

“All inclusive” packagesStatPatternRecognition: I.Narsky, arXiv: physics/0507143http://www.hep.caltech.edu/~narsky/spr.html

TMVA: Hoecker, Speckmayer, Stelzer, Therhaag, von Toerne, Voss, arXiv: physics/0703039http://tmva.sf.net or every ROOT distribution

WEKA: http://www.cs.waikato.ac.nz/ml/weka/

Huge data analysis library available in “R”: http://www.r-project.org/

Literature:

T. Hastie, R. Tibshirani, J. Friedman, “The Elements of Statistical Learning”, Springer 2001C.M. Bishop, “Pattern Recognition and Machine Learning”, Springer 2006

Conferences: PHYSTAT, ACAT,…

A few References

http://www.hep.caltech.edu/~narsky/spr.html

http://www.hep.caltech.edu/~narsky/spr.html

http://tmva.sf.net/

http://tmva.sf.net/

http://www.cs.waikato.ac.nz/ml/weka/

http://www.r-project.org/

1 TMVA Workshop, 21 Jan 2011 Andreas Hoecker – Introduction TMVA Workshop – Introduction Andreas...

Documents

Transcript of 1 TMVA Workshop, 21 Jan 2011 Andreas Hoecker – Introduction TMVA Workshop – Introduction Andreas...