RooFit & RooStats tools for data modeling and statistical analysis in ROOT

22
Wouter Verkerke, NIKHEF RooFit & RooStats tools for data modeling and statistical analysis in ROOT Wouter Verkerke (NIKHEF)

description

RooFit & RooStats tools for data modeling and statistical analysis in ROOT. Wouter Verkerke (NIKHEF). Overview of this talk. Talk overview Recently added RooFit features The RooStats project Current release cycle - PowerPoint PPT Presentation

Transcript of RooFit & RooStats tools for data modeling and statistical analysis in ROOT

Page 1: RooFit & RooStats tools for data modeling  and statistical analysis in ROOT

Wouter Verkerke, NIKHEF

RooFit & RooStatstools for data modeling and statistical analysis in ROOT

Wouter Verkerke (NIKHEF)

Page 2: RooFit & RooStats tools for data modeling  and statistical analysis in ROOT

Wouter Verkerke, NIKHEF

Overview of this talk

• Talk overview– Recently added RooFit features

– The RooStats project

• Current release cycle– Have started major new RooFit development cycle in ROOT

development release 5.17, RooFit 2.23

– Stable version to be delivered in ROOT production 5.18/00. Release date Dec 12, RooFit v2.30. Deadline for last code tomorrow.

Page 3: RooFit & RooStats tools for data modeling  and statistical analysis in ROOT

Wouter Verkerke, NIKHEF

New features – Core engineering

• Core engineering – Complete rewrite of optimization algorithms for optimization of likelihood calculations

– Recent versions of classes like RooAddPdf and RooProdPdf extensive use caching of composite function objects that represent partial results for given integration/normalization configurations. Cache objects created is usually deferred till first use and multiple configurations are handled simultaneously

– Old optimization code not equipped to handle optimization and client/server link reconnection of cached objects well

– New support class RooObjCacheManager takes transparently care of all caching and optimization logic for cached function objects

– Many specialized hooks and support functions to work around limitations of old code have now disappeared Code is much cleaner and more maintainable for future

– Can in principle do more optimizations than before but improved robustness in handling certain conditions adds some overhead. Speed is expected to be within ~5% of original RooFit with fluctuations depending on application

– Next version of RooFit will have significant speedups of (complex) plot projections as new optimization engine can also be applied to plot projections (works in principle, but not enabled yet)

Page 4: RooFit & RooStats tools for data modeling  and statistical analysis in ROOT

Wouter Verkerke, NIKHEF

New features – RooMsgService

• All RooFit messaging now routed through new RooMsgService interface

• New service has interface that allows detailed control over what messages are printed. Can filter on– Message severity (DEBUG, INFO, WARNING, ERROR, FATAL)

– Message topic (Plotting, Integration, Generation, …)

– Originating object class (RooGaussian etc…)

– Originating object name (“MySignalPdf” etc…)

– Tags applied to object (arg->setLabel(“DebugMeLabel”))

• Control through RooMsgService::instance()– Default configuration

root [0] RooMsgService::instance().Print("v")All Message streams[0] MinLevel = WARNING Topic = Any [1] MinLevel = INFO Topic = Generation Minization Plotting Fitting \\ Caching Optimization

Page 5: RooFit & RooStats tools for data modeling  and statistical analysis in ROOT

Wouter Verkerke, NIKHEF

New features – RooMsgService

• Add new streams as you like, i.e. – RooMsgService::instance().addStream(kINFO,

Topic(kIntegration),ObjectName("MyPdf"))

• A lot of new INFO level messages have been added the topics of Integration, Generation – Explain how RooFit arrives at its decision to perform integration,

generation etc…

• Note– Adding streams with object-specific message may affect

performance. Mostly intended for your debugging convenience

– Adding any stream with DEBUG level messages, even not object specific affect performance significantly. Again, these exist for your debugging convenience.

• Disabling all message streams will make RooFit completely silent (in case you care…)

Page 6: RooFit & RooStats tools for data modeling  and statistical analysis in ROOT

Wouter Verkerke, NIKHEF

New features – GraphViz support

• You can draw graphs of RooFit object trees of arbitrary complexity using the OpenSouce GraphViz tools for graph visualization– ROOT> pdf->graphVizTree(“pdf.dot”)

– UNIX> dot –Tps –o pdf.ps pdf.dot (directed graph algorithm)UNIX> fdp –Tps –o pdf.ps pdf.dot (spring model algorithm)

‘dot’ ‘fdp’

Page 7: RooFit & RooStats tools for data modeling  and statistical analysis in ROOT

Wouter Verkerke, NIKHEF

New features – RooClassFactory

• Code factory for RooFit classes, writes skeleton class for RooAbsPdf, RooAbsReal– Example that writes function ready to be compiled

RooClassFactory::makeFunction("RooDilution", // class name "w,w_p0,w_p1", // name of variables "1-2*(w_p0+(1-w_p1)*w)") ; // function expression.L RooDilution.cxx+ // load class

• Can also immediately instantiate code RooAbsReal* f = RooClassFactory::defineFunction("f",

"D(1-2w)",RooArgSet(D,w))

– Returns function to dedicated compiled function object

– Fast replacement of RooFormulaVar

• Many more options– Can also specify optional analytical integrals in extra argument

– Can also create functions with RooCategory arguments

Page 8: RooFit & RooStats tools for data modeling  and statistical analysis in ROOT

Wouter Verkerke, NIKHEF

New features – Modular extension of RooMCStudy

• New version of RooMCStudy has hooks to insert chain of modules in study that allow to intervene before and after each generation and fit step to customize behavior

• Two standard modules provided: – RooDLLSignificanceMCSModule calculates significance with delta (-log(L))

method in given parameter. Result is added to RooDataSet with output

– RooRandomizeParamsMCSModule randomize generation value of given parameter before each generation (uniform or Gaussian)

– Abstract base class for modules allows to write your own

• Example use

RooDLLSignificanceMCSModule sigModule(*nsig,0) ;

RooRandomizeParamMCSModule randModule ;randModule.sampleSumUniform(param,loVal,HighVal) ;

RooMCStudy mcs(*model,*mjjj) ;mcs.addModule(sigModule) ;mcs.addModule(randModule) ;

Page 9: RooFit & RooStats tools for data modeling  and statistical analysis in ROOT

Wouter Verkerke, NIKHEF

New operator PDFs – Numeric convolution through FFT

• New generic convolutions operator PDF RooFTTConvPdf that can numerically convolve any two p.d.f.s using FFT techniques

– Use (free) FFTW3 fourier transform engine (www.fftw.org)

– Must build ROOT with –enable-fftw

• Example code

• Amazing speed and precision, ~100x faster than RooNumConvPdf, few num. stability issues

– Unbinned ML fit of Bmix (x) Gauss to 20000 events with dm,tau,D floating = 30 seconds (=about same as analytical calculation)

– Performance will drop if per-event errors are used as FFT calculate precalculates p.d.f in one operation for all observable values. Efficient when p.d.f is evaluated at many points for one set of parameters, not efficient when p.d.f is only evaluated once.

• Future versions will support >1 convolution as well

RooRealVar x("x","x",-10,20) ;x.setBins(1000) ; // Binning controls FFT sampling density. Use at least 1000 for good precisionRooGaussian gx("gx","gx",x,mx,sx) ;RooLandau lx("lx","lx",x,ml,sl) ;

RooFFTConvPdf gxlx("gxbx","gx (X) bx",x,gx,lx) ;

Page 10: RooFit & RooStats tools for data modeling  and statistical analysis in ROOT

Wouter Verkerke, NIKHEF

New pdfs – Generic n-Dim KEYS p.d.f

• Designed as replacement of Roo2DKeysPdf– Written by Max Baak for ATLAS higgs analysis

• NB Several bugs were discovered in Roo2DKeysPdf

– Works in any number of dimensions.

– Takes correlations of input data into account in shape of kernel

– Implementation has optimizations for speed (work best at higher dimensions)

– Analytical integration and analytical partial integrals

Projection with partial analytical integral

Page 11: RooFit & RooStats tools for data modeling  and statistical analysis in ROOT

Wouter Verkerke, NIKHEF

Other miscellaneous new features

• New version of class RooProduct (product of any number of RooAbsReal objects)

– Support for factorizing (analytical) integration of product expression analoguous to RooProdPdf

– Provided by Gerhard Raven

• New class RooProfileLL that represents the profile likelihood for a given likelihood

– Example given a p.d.f F with parameters p1,p2,p3. Construction of likelihood (= function of p1, p2,p3)

RooNLLVar nll("nll","nll",px,*d) ;

– Construction of profile likelihood in p1 (=likelihood minimized w.r.t all parameters except p1)

RooProfileLL pnll1("pnll","profile ll",nll,p1) ;

– Expensive function (MINUIT is called for every evaluation)

– Plotting / scanning of profile likelihood will give correct error estimate on p1

Page 12: RooFit & RooStats tools for data modeling  and statistical analysis in ROOT

Wouter Verkerke, NIKHEF

New concept – RooWorkspace

• One of the main missing features in RooFit is a tool to organize complex projects– A container for composite p.d.f objects, multiple datasets

• New class RooWorkspace provides basic infrastructure for complex project management– Container class for p.d.fs, datasets, functions etc…

– Controlled interface: cannot insert duplicates with same name.

– Automatic reconnects: if a pdf f(x,p) is inserted and an internal RooRealVar x already exists, the copy that is inserted is automatically connected to the copy in the workspace

– Tools for conflict resolution on insertion: Can rename nodes on the fly upon inserted: RooWorkspace::import(pdf,RenameConflictNodes(“_v2”)) ;

– Tools for variable renaming on insertionRooWorkspace::import(pdf,RenameVariable(“x”,”y”)) ;

Page 13: RooFit & RooStats tools for data modeling  and statistical analysis in ROOT

Wouter Verkerke, NIKHEF

New concept – RooWorkspace

• New RooWorkspace can be persisted entirely– Allows to save p.d.fs in addition to data

• Important new concept– Sharing data is between individual physicists, working groups, or

experiments is relatively easy – ROOT TTrees, THx histograms almost universal standard

– Sharing functions (likelihood / probability density) generally much more difficult due to lack of common language

– RooFit makes sharing (probability density) functions very easy: functions can be persisted in ROOT files (NEW)

• Many potential benefits– Easy sharing of results, ideas

– Simplifies cross checks, debugging and result combinations

– Combined fits for CP parameters easily executed by combining likelihood from multiple workspaces

Page 14: RooFit & RooStats tools for data modeling  and statistical analysis in ROOT

Wouter Verkerke, NIKHEF

Persistence of models

• Elementary use case

• Both data and p.d.f. are now stored in file!

• Works for p.d.f.s of arbitrary complexity, e.g. complicated fit with multiple side bands, full Higgs combination

RooAbsPdf& g ; // any p.d.f you madeRooAbsData& d ; // any data you made

RooWorkspace w(“w”,”my workspace”) ;w.import(g) ; // import p.d.fw.import(d) ; // import data

TFile f(“myresult.root”,”RECREATE”) ;w.Write() ;f.Close() ;

Create the workspace

container object

Use standardROOT I/O

to store wspace

Page 15: RooFit & RooStats tools for data modeling  and statistical analysis in ROOT

Wouter Verkerke, NIKHEF

A look at the workspace

• What is in the workspace?

w.Print() ;RooWorkspace(w) my workspace contents

variables---------(x,m,s)

p.d.f.s-------RooGaussian::g[ x=x mean=m sigma=s ] = 0

datasets--------RooDataSet::d(x)

RooRealVar* x = w.var(“x”) ;

RooAbsPdf* g = w.pdf(“g”) ;

RooAbsData* d = w.data(“d”) ;

Typed accessorsto convenientlyretrieve contents

Page 16: RooFit & RooStats tools for data modeling  and statistical analysis in ROOT

Wouter Verkerke, NIKHEF

Using & adapting persisted p.d.f.s.

• Using both model & p.d.f from file

TFile f(“myresults.root”) ; RooWorkspace* w = f.Get(“w”) ;

RooPlot* xframe = w->var(“x”)->frame() ; w->data(“d”)->plotOn(xframe) ; w->pdf(“g”)->plotOn(xframe) ;

// p.d.f.s in workspace work with any data w->pdf(“g”)->fitTo(*myData) ;

// Naming conflicts or mismatches easily // resolved by importing all objects in wspace w->import(*myData,RenameVariable(“y”,”x”)) ;

Make plotof data

and p.d.f

Fit p.d.f other data

outsideworkspace

Alternativelyimport data

in workspace

Page 17: RooFit & RooStats tools for data modeling  and statistical analysis in ROOT

Wouter Verkerke, NIKHEF

A more complex example

• Combining toy ‘ATLAS’ and ‘CMS’ results from persisted workspaces

TFile* f = new TFile("atlas.root") ; RooWorkspace *atlas = f->Get("atlas") ;

TFile* f = new TFile("cms.root") ; RooWorkspace *cms = f->Get("cms") ;

RooAddition nllCombi("nllCombi","nll CMS&ATLAS", RooArgSet(*cms->function(“nll”),*atlas->function(“nll”))) ;

RooProfileLL pllCombi("pllCombi","pll",nllCombi,*atlas->var("mHiggs")) ;

RooPlot* mframe = atlas->var("mHiggs")->frame(-3.5,-2.5) ; atlas->function(“nll”)->plotOn(mframe)) ; cms->function(“nll”)->plotOn(mframe),LineStyle(kDashed)) ; pllCombi.plotOn(mframe,LineColor(kRed)) ; mframe->Draw() ; // result on next slide

Read ATLASworkspace

Read CMSworkspace

Constructcombined LH

PlotAtlas,CMS,combinedprofile LH

Constructprofile LHin mHiggs

NB: You can publish your actual likelihood in digital form in this way

Page 18: RooFit & RooStats tools for data modeling  and statistical analysis in ROOT

Wouter Verkerke, NIKHEF

ROOT, RooFit & RooStats

C++ command line interface & macros

Data management & histogramming

Graphics interface

I/O support

MINUIT

ToyMC dataGeneration

Data/ModelFitting

Data Modeling

Model Visualization

RooFit is extension to ROOT – (Almost) no overlap with existing functionality

Statistical analysisNeyman construction

Bayesian posteriorProfile Likelihood

RooStats

Statistical analysisNeyman construction

Bayesian posteriorProfile Likelihood

Page 19: RooFit & RooStats tools for data modeling  and statistical analysis in ROOT

Wouter Verkerke, NIKHEF

The RooStats project – common statistics tools for LHC

• Initiative by Rene & Kyle to organize suite of common tools in ROOT

– Propose to build tools on top RooFit following survey of existing software and user community

– Idea to have few core developers maintaining the framework and have mechanism for users/collaborations to contribute concrete tools

– Necessary groundwork in RooFit for support of RooStats mostly done

• What should be in there? – There are few major classes of statistical techniques:

– Likelihood: All inference from likelihood curves

– Bayesian: Use prior on parameter to compute P(theory|data)

– Frequentist: Restricted to statements of P(data|theory)

• Even within one of these classes, there are several ways to approach the same problem.

– Aim to collect them all in one set of consistent tools

Page 20: RooFit & RooStats tools for data modeling  and statistical analysis in ROOT

Wouter Verkerke, NIKHEF

Designing the framework

• Kyle & I met early 2007 to discuss how to implement a few statistical concepts on top of RooFit– want class structure to maps

onto statistical concepts

– Successfully worked out a few of the methods

• The first examples were– Bayesian Posterior

– Profile likelihood ratio

– Acceptance Regio

– Ordering Rule

– Neyman Construction

– Confidence Interval

• Many concepts already have an appropriate class in RooFit– New RooWorkspace class key

component of interface

Page 21: RooFit & RooStats tools for data modeling  and statistical analysis in ROOT

Wouter Verkerke, NIKHEF

RooStats progress

• Kyle has done several successful pilot studies to test out feasibility of concept– E.g. multi-channel Higgs sensitivity study

• Now starting with construction of concrete tools– First candidate is real world Tevatron example with input from

Tom Jun

– Aiming for first functional release in course of 5.18 (spring 2008)

Page 22: RooFit & RooStats tools for data modeling  and statistical analysis in ROOT

Wouter Verkerke, NIKHEF

RooFit Developments & Future plans – Overview

• Quite a bit of new code developed in 2007, with more to come in 2008. Will cover this later

• Manpower – Interest for and use of RooFit in ATLAS, CMS, LHCb is increasing.

– I continue to develop and support RooFit at ~10-20% level (which has been support level since 5 years).

– I intend to continue at this level for the foreseeable future.

• Access to code, bundling with ROOT– Development copy of RooFit moved from SourceForge to ROOT

SubVersion repository. Simplifies updates to ROOT

– ROOT SubVersion allows me to easily make development branches

– Intend to make use of more ROOT/CERN facilities for support

– File your bug requests in the ROOT Savannah tracker

– Ask your question on the ROOT forums.