1 TMVA Workshop, 21 Jan 2011 Andreas Hoecker – Introduction TMVA Workshop – Introduction Andreas...
-
Upload
jemima-preston -
Category
Documents
-
view
223 -
download
0
Transcript of 1 TMVA Workshop, 21 Jan 2011 Andreas Hoecker – Introduction TMVA Workshop – Introduction Andreas...
1TMVA Workshop, 21 Jan 2011 Andreas Hoecker – Introduction
TMVA Workshop – Introduction
Andreas Hoecker (CERN)TMVA Workshop, 21 January 2011, CERN, Switzerland
http://tmva.sf.net
2TMVA Workshop, 21 Jan 2011 Andreas Hoecker – Introduction
Goals of the WorkshopIntroduction to multivariate classification and regression with TMVA
• Pedagogical talks on various multivariate methods (morning)
• Talks from users (14:00)
• Tutorial (15:30)
3TMVA Workshop, 21 Jan 2011 Andreas Hoecker – Introduction
TMVA
• ROOT: is the analysis framework used by most (HEP)-physicists
• Idea: rather than just implementing new MVA techniques and making them available in ROOT:
﹣ Have one common platform / interface for all MVA methods
﹣ Have common data pre-processing capabilities
﹣ Train and test all classifiers on same data sample and evaluate consistently
﹣ Provide common analysis (ROOT scripts) and application framework
﹣ Provide access with and without ROOT, through macros, C++ executables or python
4TMVA Workshop, 21 Jan 2011 Andreas Hoecker – Introduction
TMVA
• TMVA started in 2006 on the Sourceforge development platform
• 6 core developers, 21 contributors so far
• TMVA is written in C++, and relies on ROOT functionality
• Since ROOT 5.15 / TMVA v3.7.2 TMVA is part of ROOT, and developed directly in ROOT SVN
﹣ Continue to maintain primary tmva-users mailing list on Sourceforge
﹣ New TMVA versions also published as downloadable tgz files on Sourceforge
﹣ For bug reports, use ROOT Savannah
5TMVA Workshop, 21 Jan 2011 Andreas Hoecker – Introduction
Simulated Higgs Event in CMS
Higgs event in an LHC proton–proton collision at high luminosity (together with ~24 other inelastic events)
7 TeVLHC 2010Such events occur only in a tiny fraction
of the proton-proton collisions O(10−10)
6TMVA Workshop, 21 Jan 2011 Andreas Hoecker – Introduction
Event Classification in HEP
• Most HEP analyses require discrimination of signal from background:﹣ Event level (Higgs searches, …)
﹣ Cone level (Tau-vs-jet reconstruction, …)
﹣ Track level (particle identification, …)
﹣ Object level (flavour tagging, …)
﹣ Parameter estimation (significance, mass, CP violation in B system, …)
• The multivariate input information used for this has various sources ﹣ Kinematic variables (masses, momenta, decay angles, …)
﹣ Event properties (jet/lepton multiplicity, sum of charges, …)
﹣ Event shape (sphericity, Fox-Wolfram moments, …)
﹣ Detector response (silicon hits, dE/dx, Cherenkov angle, shower profiles, muon hits, …)
• Traditionally few powerful input variables were combined. New methods allow to use up to 100 and more variables w/o loss of classification powere.g. MiniBooNE: NIMA 543 (2005), or D0 single top: Phys.Rev. D78, 012005 (2008)
7TMVA Workshop, 21 Jan 2011 Andreas Hoecker – Introduction
Event Classification
• Suppose data sample with two types of events: H0, H1
﹣ We have found discriminating input variables x1, x2, …
﹣ What decision boundary should we use to select events of type H1 ?
Linear boundary? A nonlinear one?Rectangular cuts?
H1
H0
x1
x2 H1
H0
x1
x2 H1
H0
x1
x2
Low variance (stable), high bias methods High variance, small bias methods
8TMVA Workshop, 21 Jan 2011 Andreas Hoecker – Introduction
Event Classification
• Suppose data sample with two types of events: H0, H1
﹣ We have found discriminating input variables x1, x2, …
﹣ What decision boundary should we use to select events of type H1 ?
Linear boundary? A nonlinear one?Rectangular cuts?
H1
H0
x1
x2 H1
H0
x1
x2 H1
H0
x1
x2
• How can we decide this in an optimal way ? Let the machine learn it !
9TMVA Workshop, 21 Jan 2011 Andreas Hoecker – Introduction
Parameter Regression
• How to estimate a “functional behaviour” from a set of measurements?﹣ Energy deposit in a the calorimeter, distance between overlapping photons, …
﹣ Entry location of a particle in the calorimeter or on a silicon pad, …
x
f(x)
x
f(x)
x
f(x)
Linear function ? A non-linear one ?Constant ?
• Looks trivial? What if we have many input variables?
10TMVA Workshop, 21 Jan 2011 Andreas Hoecker – Introduction
Multivariate Event Classification
11TMVA Workshop, 21 Jan 2011 Andreas Hoecker – Introduction
Each event, Signal or Background, has “D” measured variables.
D
“feature space”
Find a mapping from D-dimensional input-observable = ”feature” spaceto one dimensional output class labels
12TMVA Workshop, 21 Jan 2011 Andreas Hoecker – Introduction
Each event, Signal or Background, has “D” measured variables.
D
“feature space”
y(x)
Most general formy = y(x); x in D x = {x1,….,xD}: input variables
y(x): Rn ® R:
Plotting the resulting y(x) values:
Find a mapping from D-dimensional input-observable = ”feature” spaceto one dimensional output class labels
13TMVA Workshop, 21 Jan 2011 Andreas Hoecker – Introduction
Each event, Signal or Background, has “D” measured variables.
D
“feature space”
y(x): Rn ® R:
y(x)
y(x): “test statistic” in D-dimensional space of input variables
Distributions of y(x): PDFS(y) and PDFB(y)
Overlap of PDFS(y) and PDFB(y) affects separation power, purity
> cut: signal= cut: decision boundary< cut: background
y(x):Used to set the selection cut !
y(x) = const: surface defining the decision boundary
Efficiency and purity
y(B) 0, y(S) 1
14TMVA Workshop, 21 Jan 2011 Andreas Hoecker – Introduction
Multi-Class Classification
Binary classification: two classes, “signal” and “background”
Signal Background
15TMVA Workshop, 21 Jan 2011 Andreas Hoecker – Introduction
Class 1
Class 2Class 3
Class 5
Class 6
Class 4
Multi-Class Classification
Multi-class classification – natural extension for many classifiers
16TMVA Workshop, 21 Jan 2011 Andreas Hoecker – Introduction
P(Class=C|x) (or simply P(C|x)) : probability that the event class is of type C, giventhe measured
observables x = {x1,….,xD} y(x)
P(y | C) P(C)P(Class = C | y) =
P(y)
Prior probability to observe an event of “class C”, i.e., the relative abundance of “signal” versus “background”
Overall probability density to observe the actual measurement y(x), i.e.,
Classes
P(y) = P(y | Class)P(Class)
Probability density distribution according to the measurements x and the given mapping function
Posterior probability
Event Classification
17TMVA Workshop, 21 Jan 2011 Andreas Hoecker – Introduction
P(y | C)P(C)P(Class = C | y) =
P(y)ANDMinimum error in misclassification if C chosen such that it has maximum P(C|y)
To select S(ignal) over B(ackground), place decision on:
P(S | y) P(y | S) P(S)c
P(B | y) P(y | B) P(B)
Likelihood ratio as discriminating
function y(x)
Prior odds ratio of choosing a signal event(relative probability of signal vs. bkg)
“c” determines efficiency and purity
x = {x1,….,xD}: measured observablesy = y(x)
[ Or any monotonic function of P(S|y) / P(B|y) ]
Posterior odds ratio
Bayes Optimal Classification
18TMVA Workshop, 21 Jan 2011 Andreas Hoecker – Introduction
Trying to select signal events:(i.e. try to disprove the null-hypothesis stating it were “only” a background event)
Type-2 error: Fail to identify an event from Class C as such(reject a hypothesis although it would have been true)(fail to reject the null-hypothesis/accept null hypothesis although it is false)
loss of efficiency (in selecting signal events)
Decide to treat an event as “Signal” or “Background”
Signal Back-ground
Signal Type-2 error
Back-ground
Type-1 error
Accept as:Truly is:
Type-1 error: Classify event as Class C even though it is not(accept a hypothesis although it is not true)(reject the null-hypothesis although it would have been the correct one)
loss of purity (in the selection of signal events)
Any Decision Involves a Risk
Significance α: Type-1 error rate:(=p-value): α = background selection “efficiency”
Size β: Type-2 error rate:Power: 1- β = signal selection efficiency
should be small !
should be small !
“A”: region where event is called signal
19TMVA Workshop, 21 Jan 2011 Andreas Hoecker – Introduction
Neyman-Peason:
The Likelihood ratio used as “selection criterion” y(x) gives for each selection efficiency the best possible background rejection.
i.e. it maximises the area under the “Receiver Operation Characteristics” (ROC) curve
Varying y(x) > “cut” moves the working point (efficiency and purity) along the ROC curve
• How to choose “cut”? need to know prior probabilities (S, B abundances)
﹣ Measurement of signal cross section: maximum of S/√(S+B) or equiv. √(e·p)﹣ Discovery of a signal : maximum of S/√(B)﹣ Precision measurement: high purity (p)﹣ Trigger selection: high efficiency ( ) e (sometimes high background rejection)
P(x | S)Likelihood Ratio : y(x)
P(x | B)
0 1
1
0
1- e
ba
ckg
r.
esignal
random guessing
good classification
better classification
“limit” in ROC curve
given by likelihood ratio
Type-1 error smallType-2 error large
Type-1 error large Type-2 error small
Neyman-Pearson Lemma
20TMVA Workshop, 21 Jan 2011 Andreas Hoecker – Introduction
Unfortunately, the true probability densities functions are typically unknown: Neyman-Pearson’s lemma doesn’t really help us…
Supervised (machine) learning
* Hyperplane in the strict sense goes through the origin. Here is meant an “affine set” to be precise.
Use MC simulation, or more generally: set of known (already classified) “events”
Use these “training” events to:
• Try to estimate the functional form of P(x|C) from which the likelihood ratio can be obtained e.g. D-dimensional histogram, Kernel densitiy estimators, MC-based matrix-element methods, …
• Find a “discrimination function” y(x) and corresponding decision boundary
(i.e. hyperplane* in the “feature space”: y(x) = const) that optimally separates signal from backgrounde.g. Linear Discriminator, Neural Networks, Boosted Decision, …
Realistic Event Classification
21TMVA Workshop, 21 Jan 2011 Andreas Hoecker – Introduction
Unfortunately, the true probability densities functions are typically unknown: Neyman-Pearson’s lemma doesn’t really help us…
Supervised (machine) learning
* Hyperplane in the strict sense goes through the origin. Here is meant an “affine set” to be precise.
Use MC simulation, or more generally: set of known (already classified) “events”
Use these “training” events to:
• Try to estimate the functional form of P(x|C) from which the likelihood ratio can be obtained e.g. D-dimensional histogram, Kernel densitiy estimators, MC-based matrix-element methods, …
• Find a “discrimination function” y(x) and corresponding decision boundary (i.e. hyperplane* in the “feature space”: y(x) = const) that optimally separates signal from backgrounde.g. Linear Discriminator, Neural Networks, …
Realistic Event Classification
Of course, there is no magic in here. We still need to:
﹣ Choose the discriminating variables
﹣ Choose the class of models (linear, non-linear, flexible or less flexible)
﹣ Tune the “learning parameters” bias vs. variance trade off
﹣ Check the generalisation properties (avoid overtraining)
﹣ Consider trade off between statistical and systematic uncertainties
22TMVA Workshop, 21 Jan 2011 Andreas Hoecker – Introduction
Multivariate Analysis Methods in TMVA Examples for classifiers and regression methods– Rectangular cut optimisation
– Projective and multidimensional likelihood estimator
– k-Nearest Neighbor algorithm
– Fisher and H-Matrix discriminants
– Function discriminants
– Artificial neural networks
– Boosted decision trees
– RuleFit
– Support Vector Machine
Preprocessing methods:
– Decorrelation, Principal Value Decomposition, Gaussianisation
Examples for synthesis methods:
– Boosting, Categorisation (valid for all methods, and their combinations)
Joerg
Jan
Helge
Eckhard
Peter
23TMVA Workshop, 21 Jan 2011 Andreas Hoecker – Introduction
We have a Users Guide !
Available on http://tmva.sf.net
TMVA Users Guide142pp, incl. code examples
arXiv physics/0703039
24TMVA Workshop, 21 Jan 2011 Andreas Hoecker – Introduction
T MVA tutorial
U s i n g T M V A
A typical TMVA analysis consists of two main steps:
1. Training phase: training, testing and evaluation of classifiers using data samples with known signal and background composition
2. Application phase: using selected trained classifiers to classify unknown data samples
25TMVA Workshop, 21 Jan 2011 Andreas Hoecker – Introduction
T MVA tutorial
Code Flow for Training and Application
26TMVA Workshop, 21 Jan 2011 Andreas Hoecker – Introduction
Can be ROOT scripts, C++ executables or python scripts (via PyROOT), or any other high-level language that interfaces with ROOT
T MVA tutorial
Code Flow for Training and Application
27TMVA Workshop, 21 Jan 2011 Andreas Hoecker – Introduction
T MVA tutorial
Strong Methods need Strong Evaluation
A lot of evaluation information is already provided in the logging output of the training
Simple GUIs provide access to evaluation plots and tools for single and multi-class classification and regression
28TMVA Workshop, 21 Jan 2011 Andreas Hoecker – Introduction
T MVA tutorial
Involved Methods need Thorough Evaluation
29TMVA Workshop, 21 Jan 2011 Andreas Hoecker – Introduction
T MVA tutorial
Involved Methods need Thorough Evaluation
30TMVA Workshop, 21 Jan 2011 Andreas Hoecker – Introduction
T MVA tutorial
Involved Methods need Thorough Evaluation
average no. of nodes before/after pruning: 4193 / 968
31TMVA Workshop, 21 Jan 2011 Andreas Hoecker – Introduction
Join the tutorial this afternoon !!
T MVA tutorial
32TMVA Workshop, 21 Jan 2011 Andreas Hoecker – Introduction
Two Wishes to Users :
T MVA tutorial
33TMVA Workshop, 21 Jan 2011 Andreas Hoecker – Introduction
﹣ A detector element may only exist in the barrel, but not in the endcaps
﹣ A variable may have different distributions in barrel, overlap, endcap regions
Ignoring this dependence may reduce performance and creates correlations between variables, which must be learned by the classifier
﹣ Classifiers such as the projective likelihood, which do not account for correlations, significantly loose performance if the sub-populations are not separated
Categorisation means splitting the data sample into categories defining disjoint data samples with the following (idealised) properties:
﹣ Events belonging to the same category are statistically indistinguishable
﹣ Events belonging to different categories have different properties
Multivariate training samples often have distinct sub-populations of data
34TMVA Workshop, 21 Jan 2011 Andreas Hoecker – Introduction
MethodCategory is Your Friend !
It provides fully transparent support for categorisation of your input data,
applicable to any TMVA method
See Peter’s talk later today
35TMVA Workshop, 21 Jan 2011 Andreas Hoecker – Introduction
The HEP community has already a lot of experience with MVA classification
In particular for rare events searches, O(all) mature experiments use it
They increase the experiment’s sensitivity, and may reduce systematic errors due to a smaller background component
MVAs are not black boxes, but (possibly involved) Rn ® R mapping functions
Should acquire more experience in HEP with multivariate regression
Our calibration schemes are often still quite simple: linear or simple functions, look-up-table based, mostly depending on few variables (e.g., η, pT)
Non-linear multivariate regression may significantly boost calibration and corrections applied, in particular if it is possible to train from data
Available since TMVA 4 for: LD, FDA, k-NN, PDERS, PDEFoam, MLP, BDT
36TMVA Workshop, 21 Jan 2011 Andreas Hoecker – Introduction
S t a t u s & O u t l o o k
• 2-class classification supported by all methods
• Multi-class classification supported by MLP (NN), BDTG, FDA
• Single-target regression: PDE-RS, PDE-Foam, k-NN, LD, FDA, MLP, BDT
• Multi-target regression: PDE-Foam, k-NN, MLP
• All methods support categorised classification and generalised boosting
• Priority on to-do list for future releases﹣ Automatic self-optimisation of parameter settings for all methods
﹣ For this: full support of cross validation
﹣ Increase support of multi-dimensional classification and regression
﹣ Individual improvements of methods (see, eg, MLP talk by Jan)
﹣ Introduction of unsupervised learning
37TMVA Workshop, 21 Jan 2011 Andreas Hoecker – Introduction
• Several similar data mining efforts with rising importance in most fields
of science and industry
• Important for HEP:﹣ Parallelised MVA training and evaluation pioneered by Cornelius package (BABAR)
﹣ Also frequently used: StatPatternRecognition package by I. Narsky (Cal Tech)
﹣ Many implementations of individual classifiers exist
• TMVA is open source software
• Use & redistribution of source permitted according to terms in BSD license
Contributed to TMVA have: Andreas Hoecker (CERN, Switzerland), Jörg Stelzer (CERN, Switzerland), Peter Speckmayer (CERN, Switzerland), Jan Therhaag (Universität Bonn, Germany), Eckhard von Toerne (Universität Bonn, Germany), Helge Voss (MPI für Kernphysik Heidelberg, Germany), Moritz Backes (Geneva University, Switzerland), Tancredi Carli (CERN, Switzerland), Asen Christov (Universität Freiburg, Germany), Or Cohen (CERN, Switzerland and Weizmann, Israel), Krzysztof Danielowski (IFJ and AGH/UJ, Krakow, Poland), Dominik Dannheim (CERN, Switzerland), Sophie Henrot-Versille (LAL Orsay, France), Matthew Jachowski (Stanford University, USA), Kamil Kraszewski (IFJ and AGH/UJ, Krakow, Poland), Attila Krasznahorkay Jr. (CERN, Switzerland, and Manchester U., UK), Maciej Kruk (IFJ and AGH/UJ, Krakow, Poland), Yair Mahalalel (Tel Aviv University, Israel), Rustem Ospanov (University of Texas, USA), Xavier Prudent (LAPP Annecy, France), Arnaud Robert (LPNHE Paris, France), Doug Schouten (S. Fraser U., Canada), Fredrik Tegenfeldt (Iowa University, USA, until Aug 2007), Alexander Voigt (CERN, Switzerland), Kai Voss (University of Victoria, Canada), Marcin Wolter (IFJ PAN Krakow, Poland), Andrzej Zemla (IFJ PAN Krakow, Poland).
Copyrights & Credits
38TMVA Workshop, 21 Jan 2011 Andreas Hoecker – Introduction
Software packages for Multivariate Data Analysis/Classification:
Individual classifier software: e.g. “JETNET” C.Peterson, T. Rognvaldsson, L.Loennblad,many, many other packages!
“All inclusive” packagesStatPatternRecognition: I.Narsky, arXiv: physics/0507143http://www.hep.caltech.edu/~narsky/spr.html
TMVA: Hoecker, Speckmayer, Stelzer, Therhaag, von Toerne, Voss, arXiv: physics/0703039http://tmva.sf.net or every ROOT distribution
WEKA: http://www.cs.waikato.ac.nz/ml/weka/
Huge data analysis library available in “R”: http://www.r-project.org/
Literature:
T. Hastie, R. Tibshirani, J. Friedman, “The Elements of Statistical Learning”, Springer 2001C.M. Bishop, “Pattern Recognition and Machine Learning”, Springer 2006
Conferences: PHYSTAT, ACAT,…
A few References