Post on 25-Feb-2016
description
1Helge Voss Zűrich 8th November 2006 Particle and Astrophysics Seminar: TMVA Toolkit for MultiVariate Analysis
TMVA A Toolkit for (Parallel) MultiVariate Data
Analysis Andreas Höcker (ATLAS), Helge Voss (LHCb), Kai Voss (ex. ATLAS)
Joerg Stelzer (ATLAS)
http://tmva.sourceforge.net/
apply a (large) variety of sophisticated data selection algorithms to your data sample
have them all trained and tested
choose the best one for your selection problem and “go”
2Helge Voss Zűrich 8th November 2006 Particle and Astrophysics Seminar: TMVA Toolkit for MultiVariate Analysis
Outline
Introduction: MVAs, what / where / why
the MVA methods available in TMVA
demonstration with toy examples
(some “real example)
Summary/Outlook
3Helge Voss Zűrich 8th November 2006 Particle and Astrophysics Seminar: TMVA Toolkit for MultiVariate Analysis
Introduction to MVAAt the beginning of each physics analysis:
select your event sample, discriminating against background event tagging …
Or even earlier: e.g. particle identifiaction, pattern recognition (ring finding) in RICH trigger applications
MVA -- MulitVariate Analysis: nice name, means nothing else but:
Use several observables from your events to form ONE combined variable and use this in order to discriminate between “signal” and “background”
Always one uses several variables in some sort of combination:
4Helge Voss Zűrich 8th November 2006 Particle and Astrophysics Seminar: TMVA Toolkit for MultiVariate Analysis
Introduction to MVAssequence of cuts multivariate methods
sequence of cuts is easy to understand offers easy interpretation
sequences of cuts are often inefficient!! e.g. events might be rejected because of just ONE variable, while the others look very “signal like”
MVA: several observables One selection criterium
e.g: Likelihood selectioncalculate for each observable in an event a probability that its value belongs to a “signal” or a “background” event using reference distributions (PDFs) for signal and background.
Then cut on the combined likelihood of all these probabilities
E.g. Variable Et“expected” signal and
background distributions
Et
Et (event1):background “like”
Et (event2):
signal “like”
5Helge Voss Zűrich 8th November 2006 Particle and Astrophysics Seminar: TMVA Toolkit for MultiVariate Analysis
MVAs certainly made it’s way through HEP -- e.g. LEP Higgs analyses, but simple “cuts” are probably still most accepted and used
widely understood and accepted: likelihood (probability density estimators – PDE )
often disliked, but more and more used (LEP, BABAR): Artificial Neural Networksmuch used in BABAR and Belle: Fisher discriminantsused by MiniBooNE and recently by BABAR: Boosted Decision Trees
MVA Experience
Often people use custom tools MVAs tedious to implement: few true comparisons between methods !
All interesting methods … but often mixed with skepticism due to:black boxes ! how to interpret the selection ?what if the training samples incorrectly describe the data ?how can one evaluate systematics ?would like to try … but which MVA should I use and where to find the time to code it?
: just look at TMVA !
6Helge Voss Zűrich 8th November 2006 Particle and Astrophysics Seminar: TMVA Toolkit for MultiVariate Analysis
What is TMVA
TMVA presently includes:
Rectangular cut optimisation Likelihood estimator (PDE approach) Multi-dimensional likelihood estimator (PDE - range-search approach) Fisher discriminant H-Matrix approach (2 estimator)Artificial Neural Network (3 different implementations) Boosted Decision Trees upcoming: RuleFit, Support Vector Machines, “Committee” methodscommon pre-processing of input data: de-correlation, principal component analysis
Toolkit for Multivariate Analysis (TMVA): C++ , ROOT‘parallel’ processing of various MVA techniques to discriminate signal from background samples.
easy comparison of different MVA techniques – choose the best one
TMVA package provides training, testing and evaluation of the MVAsEach MVA method provides a ranking of the input variables
MVAs produce weight files that are read by reader class for MVA application
7Helge Voss Zűrich 8th November 2006 Particle and Astrophysics Seminar: TMVA Toolkit for MultiVariate Analysis
TMVA Methods
8Helge Voss Zűrich 8th November 2006 Particle and Astrophysics Seminar: TMVA Toolkit for MultiVariate Analysis
Simplest method: cut in rectangular volume using Nvar input variables
event eventvariabl
cut, , ,min ,mas
xe
(0,1) ,v vv
ivix x x x
Usually training files in TMVA do not contain realistic signal and background abundance cannot optimize for best significance (S/(S+B) )
scan in signal efficiency [0 1] and maximise background rejection
Technical problem: how to perform optimisation random sampling: robust (if not too many observables used) but suboptimal new techniques Genetics Algorithm and Simulated Annealing
Huge speed improvement by sorting training events in Binary Search Tree (for 4 variables we gained a factor 41)
Cut Optimisation
do this in normal variable space or de-correlated variable space
9Helge Voss Zűrich 8th November 2006 Particle and Astrophysics Seminar: TMVA Toolkit for MultiVariate Analysis
Commonly realised for all methods in TMVA (centrally in DataSet class):
Removal of linear correlations by rotating variable space of PDEs
Determine square-root C of correlation matrix C, i.e., C = C C compute C by diagonalising C: transformation from original (x) in de-correlated variable space (x) by: x = C 1x
Note that this “de-correlation” is only complete, if: input variables are Gaussians correlations linear only in practise: gain form de-correlation often rather modest – or even harmful
T TD S SSSC C D
Preprocessing the Input Variables: Decorrelation
Various ways to choose diagonalisation (also implemented: principal component analysis)
10Helge Voss Zűrich 8th November 2006 Particle and Astrophysics Seminar: TMVA Toolkit for MultiVariate Analysis
automatic,unbiased, but suboptimal
easy to automate, can create artefactsTMVA uses: Splines0-5
difficult to automate
Combine probability density distributions to likelihood estimator
Assumes uncorrelated input variables in that case it is the optimal MVA approach, since it contains all the information usually it is not true development of different methods
Technical problem: how to implement reference PDFs
3 ways: function fitting parametric fitting (splines, kernel est.) counting
event
event
event
variables,
signal
speci
P
variabe le
DE,
,ss
( )
( )
v vv
vS
i
iv
i
vS
p xx
p x
discriminating variables
Species: signal, background types
Likelihood ratio for event ievent
PDFs
Projected Likelihood Estimator (PDE Appr.)
11Helge Voss Zűrich 8th November 2006 Particle and Astrophysics Seminar: TMVA Toolkit for MultiVariate Analysis
Carli-Koblitz, NIM A501, 576 (2003)
Generalisation of 1D PDE approach to Nvar dimensions
Optimal method – in theory – since full information is used
Practical challenges: parameterisation of multi-dimensional phase space needs huge training samples
implementation of Nvar-dim. reference PDF with kernel estimates or counting
TMVA implementation following Range-Search method count number of signal and background events in “vicinity” of a data event “vicinity” defined by fixed or adaptive Nvar-dim. volume size
adaptive: rescale volume size to achieve constant number of reference events volumes can be rectangular or spherical use multi-D kernels (Gaussian, triangular, …) to weight events within a volume speed up range search by sorting training events in Binary Trees
Multidimensional Likelihood Estimator
12Helge Voss Zűrich 8th November 2006 Particle and Astrophysics Seminar: TMVA Toolkit for MultiVariate Analysis
Well-known, simple and elegant MVA method: event selection is performed in a transformed variable space with zero linear correlations, by distinguishing the mean values of the signal and background distributions
Instead of equations, words:
• optimal for linearly correlated Gaussians with equal RMS’ and different means• no separation if equal means and different RMS (shapes)
An axis is determined in the (correlated) hyperspace of the input variables such that, when projecting the output classes (signal and background) upon this axis, they are pushed as far as possible away from each other, while events of a same class are confined in a close vicinity. The linearity property of this method is reflected in the metric with which "far apart" and "close vicinity" are determined: the covariance matrix of the discriminant variable space.
Computation of Fisher MVA couldn’t be simpler:
event eventvariab
Fisheres
,l
, i v vv
ix x F
“Fisher coefficients”
H-Matrix estimator: correlated 2 : poor man’s variation of Fisher discriminant
Fisher Discriminant (and H-Matrix)
13Helge Voss Zűrich 8th November 2006 Particle and Astrophysics Seminar: TMVA Toolkit for MultiVariate Analysis
1( ) 1 xA x e
1
i
. . .N
1 input layer k hidden layers 1 ouput layer
1
j
M1
. . .
. . . 1
. . .Mk
2 output classes (signal and background)
Nvar discriminating input variables
11w
ijw
1jw. . .. . .
1( ) ( ) ( ) ( 1)
01
kMk k k kj j ij i
ix w w xA
var
(0)1..i Nx
( 1)1,2kx
(“Activation” function)
with:
TMVA ANN implementations: Feed-Forward Multilayer Perceptrons1. Clermont-Ferrand ANN: used for ALEPH Higgs analysis; translated from FORTRAN2. TMultiLayerPerceptron interface: ANN implemented in ROOT 3. MLP: “our” own ANN, Similar to the ROOT ANN, faster more flexible, no automatic
validation however. recent comparison with Stuttgart NN in ATLAS tau-ID confirmed its excellent performance!
ANNs are non-linear discriminants: Fisher = ANN without hidden layer they seem to be better adapted to realistic use cases than Fisher and Likelihood
Feed
-forw
ard
Mul
tilay
er P
erce
ptro
n
Artificial Neural Network (ANN)
14Helge Voss Zűrich 8th November 2006 Particle and Astrophysics Seminar: TMVA Toolkit for MultiVariate Analysis
Decision TreesDecision Trees: a sequential application of “cuts” which splits the data into nodes, and the final nodes (leaf) classifies an event as signal or background
Training: start with the root-node
split training sample at node into two, using a cut in the variable that gives best separation
continue splitting until: minimal #events reached, or max purity reached
leaf-nodes classify S,B according to majority of events / or give a S/B probability
Testing/Application: a test event is “filled” at the root-node and classified according to
the leaf-node where it ends up after the “cut”-sequence
ROOT-NodeNsignal
Nbkg
de pth : 1de pth : 3
de pth : 2
NodeNsignal
Nbkg
NodeNsignal
Nbkg
Leaf-NodeNsignal<Nbkg
bkg
Leaf-NodeNsignal >Nbkg
signal
1i iVar x 1
i iVar x
2k kVar x
3j jVar x
2k kVar x
3j jVar x
Leaf-NodeNsignal<Nbkg
bkg
Leaf-NodeNsignal >Nbkg
signal
Decision tree
Bottom up Pruning: remove statistically dominated nodes
15Helge Voss Zűrich 8th November 2006 Particle and Astrophysics Seminar: TMVA Toolkit for MultiVariate Analysis
Decision Trees: used since a long time in general “data-mining” applications, less known in HEP (although very similar to “simple Cuts”)
Advantages: easy to interpret: independently of Nvar, can always be visualised in a 2D tree
independent of monotone variable transformation: rather immune against outliers
immune against addition of weak variables
Disadvatages: instability: small changes in training sample can give large changes in tree
structure
Boosted Decision Trees
Boosted Decision Trees (1996): combining several decision trees (forest) derived from one training sample via the application of event weights into ONE mulitvariate event classifier by performing “majority vote”
e.g. AdaBoost: wrong classified training events are given a larger weight
bagging, random weights, re-sampling with replacement
Boosted Decision Trees
16Helge Voss Zűrich 8th November 2006 Particle and Astrophysics Seminar: TMVA Toolkit for MultiVariate Analysis
Boosted Decision Trees: two different interpretations give events that are “difficult to categorize” more weight to get better
classifiers
see each Tree as a “basis function” of a possible classifier
boosting or bagging is just a mean to generate a set of “basis funciton”
linear combination of basis functions gives final classifier or: final classifier is an expansion in the basis functions.
Boosted Decision TreesBoosted Decision Trees
tree
ii xTxF )(),(
AdaBoost: normally, in the forest, each tree is weighed by (1-ferr)/ferr with: ferr =error fraction
then AdaBoost can be understood as:
with minimzed coefficients αi according to an “exponential Loss critirium”
(expectation value of: exp( -y F(α,x)) where y=-1 (bkg) and y=1 (signal)
17Helge Voss Zűrich 8th November 2006 Particle and Astrophysics Seminar: TMVA Toolkit for MultiVariate Analysis
Ref: Friedman-Popescu: http://www-stat.stanford.edu/~jhf/R-RuleFit.html
Model is a linear combination of rules; a rule is a sequence of cuts:
The problem to solve is:1. create the rule ensemble created from a set of decision trees2. fit the coefficients “Gradient direct regularization” (Friedman et al)
Fast, robust and good performance
rules
coefficientsscore
data
01
( ) ( )M
m mm
F x a a r x
5 2( ) ( 3) ( 2) ...
where : (...) 1 if cut satisfied, 0 otherwiseir x I x I x
I
Methods tested:1. Multi Layer Perceptron (MLP)2. Fisher3. RuleFit
Test on 4k simulated SUSY + top events with 4 discriminating
variables
RuleFit
18Helge Voss Zűrich 8th November 2006 Particle and Astrophysics Seminar: TMVA Toolkit for MultiVariate Analysis
Simple toy to illustrate the strength of the de-correlation technique
4 linearly correlated Gaussians, with equal RMS and shifted means between S and B
--- TMVA_Factory: correlation matrix (signal):------------------------------------------------------------------- var1 var2 var3 var4--- var1: +1.000 +0.336 +0.393 +0.447--- var2: +0.336 +1.000 +0.613 +0.668--- var3: +0.393 +0.613 +1.000 +0.907--- var4: +0.447 +0.668 +0.907 +1.000
--- TMVA_MethodFisher: ranked output (top --- variable is best ranked)------------------------------------------------------------------- Variable : Coefficient: Discr. power:------------------------------------------------------------------- var4 : +8.077 0.3888--- var3 : 3.417 0.2629--- var2 : 0.982 0.1394--- var1 : 0.812 0.0391
TMVA output :
A Purely Academic Example
19Helge Voss Zűrich 8th November 2006 Particle and Astrophysics Seminar: TMVA Toolkit for MultiVariate Analysis
Likelihood reference PDFs
Likelihood Distributions
20Helge Voss Zűrich 8th November 2006 Particle and Astrophysics Seminar: TMVA Toolkit for MultiVariate Analysis
Likelihood reference PDFsMLP convergence
ANN convergence test
21Helge Voss Zűrich 8th November 2006 Particle and Astrophysics Seminar: TMVA Toolkit for MultiVariate Analysis
Quality AssuranceLikelihood reference PDFsMLP convergence MLP architecture
ANN architecture
22Helge Voss Zűrich 8th November 2006 Particle and Astrophysics Seminar: TMVA Toolkit for MultiVariate Analysis
Quality AssuranceDecision Tree Structuredecision treestructure BEFORE pruning
23Helge Voss Zűrich 8th November 2006 Particle and Astrophysics Seminar: TMVA Toolkit for MultiVariate Analysis
Quality AssuranceDecision Tree Structuredecision treestructure AFTER pruning
24Helge Voss Zűrich 8th November 2006 Particle and Astrophysics Seminar: TMVA Toolkit for MultiVariate Analysis
TMVA output distributions for Fisher, Likelihood, BDT and MLP
The OutputOutput
25Helge Voss Zűrich 8th November 2006 Particle and Astrophysics Seminar: TMVA Toolkit for MultiVariate Analysis
TMVA output distributions for Fisher, Likelihood, BDT and MLP
The OutputOutput
For this case: Fisher discriminant provides the theoretically ‘best’ possible method
Same as de-correlated Likelihood
Cuts, Decision Trees and Likelihood w/o de-correlation are inferior
Note: About All Realistic Use Cases are Much More Diffic
ult Than This One
26Helge Voss Zűrich 8th November 2006 Particle and Astrophysics Seminar: TMVA Toolkit for MultiVariate Analysis
Simple toy to illustrate difficulties of some techniques with correlations 2x2 variables with circular correlations: each set, equal means and different RMS
x
y
--- TMVA_Factory: correlation matrix (signal):---------------------------------------------------------------------- var1 var2 var3 var4--- var1: +1.000 +0.001 0.004 0.012--- var2: +0.001 +1.000 0.020 +0.001--- var3: 0.004 0.020 +1.000 +0.012--- var4: 0.012 +0.001 +0.012 +1.000
TMVA output :
Academic Example (II)
SignalBackground
27Helge Voss Zűrich 8th November 2006 Particle and Astrophysics Seminar: TMVA Toolkit for MultiVariate Analysis
… resulting in the following “performance”
The OutputAcademic Example (II)
highly nonlinear correlations:
Decision Trees outperform other methods by far
(to be fair, ANN just used default parameters)
28Helge Voss Zűrich 8th November 2006 Particle and Astrophysics Seminar: TMVA Toolkit for MultiVariate Analysis
Some “real life example”
I got a sample (LHCb Monte Carlo) with simply a list of 21 possible variables
“played” for just 1 h
looked at all the variables (they get plotted “automatically”)
filled them in as “log( variable )” where distribution were too “steep”
removed some variables that seem to have no obvious separataion
29Helge Voss Zűrich 8th November 2006 Particle and Astrophysics Seminar: TMVA Toolkit for MultiVariate Analysis
Some “real life example”
I got a sample (LHCb Monte Carlo) with simply a list of 21 possible variables
“played” for just 1 h
looked at all the variables (they get plotted “automatically”)
filled them in as “log( variable )” where distribution were too “steep”
removed some variables that seem to have no obvious separataion
then removed variables with extremely large correlation
30Helge Voss Zűrich 8th November 2006 Particle and Astrophysics Seminar: TMVA Toolkit for MultiVariate Analysis
Tuned a little bit the pruning parameters
31Helge Voss Zűrich 8th November 2006 Particle and Astrophysics Seminar: TMVA Toolkit for MultiVariate Analysis
Had a result:
Unfortunatly… I don’t know how it compares to “his” current selection performance
32Helge Voss Zűrich 8th November 2006 Particle and Astrophysics Seminar: TMVA Toolkit for MultiVariate Analysis
TMVA releases:sourceforge (SF) package to accommodate world-wide access
code can be downloaded as tar file, or via anonymous cvs access home page: http://tmva.sourceforge.net/
SF project page: http://sourceforge.net/projects/tmva
view CVS: http://cvs.sourceforge.net/viewcvs.py/tmva/TMVA/
mailing lists: http://sourceforge.net/mail/?group_id=152074
ROOT::TMVA class index…. http://root.cern.ch/root/htmldoc/TMVA_Index.html
part of the ROOT package (since Development release 5.11/06)
TMVA Technicalities
We enthusiastically welcome new users, testers and developers now we have 20 people on developers list !! (http://sourceforge.net/project/memberlist.php?group_id=152074)
TMVA features:automatic iteration over all selected methods for training/testing etc.
each method has specific options that can be set by the user for optimisation
ROOT scripts are provided for all relevant performance analysis
33Helge Voss Zűrich 8th November 2006 Particle and Astrophysics Seminar: TMVA Toolkit for MultiVariate Analysis
TMVA is still a young project! first release on sourceforge March 8, 2006
now also as part of the ROOT package new devel. release: Nov. 22
it is an open source project new contributors are welcome
TMVA provides the training and evaluation tools, but the decision which method is the best is certainly depends on the use case train several methods in parallel and see what is best for YOUR analysisMost methods can be improved over default by optimising the training options
Aimed for easy to use tool to give access to many different “complicated” selection algorithmsAlready have a number of users, but still need more real-life experience
need to write a manual (use “How To…” on Web page and Examples in distribution)
Concluding Remarks
34Helge Voss Zűrich 8th November 2006 Particle and Astrophysics Seminar: TMVA Toolkit for MultiVariate Analysis
OutlookWe will continue to improve the selection Methods already implemented
implement event weights for all methods (missing: Cuts, Fisher, H-Matrix, RuleFit)improve Kernel estimatorsimprove Decision Tree pruning (automatic, cross validation)improve decorrelation techniques
New Methods are under development
RuleFit
Support Vector Machines“Committee Method” combination of different MVA techniques
35Helge Voss Zűrich 8th November 2006 Particle and Astrophysics Seminar: TMVA Toolkit for MultiVariate Analysis
Using TMVA
36Helge Voss Zűrich 8th November 2006 Particle and Astrophysics Seminar: TMVA Toolkit for MultiVariate Analysis
Curious? Want to have a look at TMVA
Nothing easier than that:
Get ROOT version 5.12.00
(better wait for 22. Nov and get the new development version)
go into your working directory (need write access)
then start: root $ROOTSYS/tmva/test/TMVAnalysis.C (toy samples are provided, look at the results/plots and then how you can implement your stuff in the code)
37Helge Voss Zűrich 8th November 2006 Particle and Astrophysics Seminar: TMVA Toolkit for MultiVariate Analysis
Web Presentation on Sourceforge.nethttp://tmva.sourceforge.net/
Two ways to download TMVA source code:
1. get tgz file by clicking on green button, or directly from here: http://sourceforge.net/project/showfiles.php?group_id=152074&package_id=168420&release_id=397782
1. via anonymous cvs: cvs -z3 -d:pserver:anonymous@cvs.sourceforge.net:/cvsroot/tmva co -P TMVA