Post on 21-Dec-2015
Preliminaries• Be sure to get a parking permit from
Lorie Neuman (room 4218, X 2180)
• Wireless: Tom Rockwell can help if you can’t get access; you should just get a direct connection to outside world– Dhcp with an address starting with 10.– If you need to print something, email to
• linnemann@pa.msu.edu
• Introductions
Why you?• You—developers—can actually change things!• I would personally like a better analysis
environment for HEP.• I keep hearing about R from statisticians!• I am convinced astronomers and HEP together will
get something better than either has alone. • And maybe we will have some things that
statisticians can use, too.– Suggested to Brad Efron using arxiv.org for statistics
• I subscribe to the “right people in a room” theory.
What Can We Accomplish?
• We won’t convince anyone to drop what they do now and adopt product xxx instead!
• But we might benefit from seeing different development cultures, work styles, or interesting ideas
• We might find ways to make interfaces across projects, or identify common projects
• If this starts to look interesting, we can spend more time on sharpening this up
• The “agenda” can be revised at any time!
Sociology• HEP experiments: own data reduction software (C++)
– Usually develop common tools used by whole collaboration– Use more generic software as tools, and final data analysis– Particle Astrophysics similar, but more Fortran/C
• HEP lab-dominated in cross-experiment software• CERN, Fermilab, SLAC, DESY, KEK, Brookhaven
– Some instances of cross-lab collaboration– Grid computing is one of few non-lab major software projects– Some tools are university based (specific simulations)– Typically free to community, but not gnu…– Smaller packages: repositories not that well developed
• Not much commercial software – Office; mathematica/maple; some mathcad/matlab/kaleidagraph
• IDL much less used than in astronomy: not as image-oriented
– Latex; ghostview; gnuplot-like
• Statistics: more distributed?• Astronomy: more large software grants?
Some Possible Goals• Repository sponsorship
• Web or Python interfaces to libraries
• Root user package repository?
• Interfaces between R and Root– GUI for R?– R scripting in Root? R libraries in Root?– Handling of larger datasets in R?
HEP Small Packages• Example: calculation of significance, limits from
observed counts, estimated background, uncertainties, efficiencies, etc.
• Several competing procedures– Some are published (PHYSTAT; NIM)– Standard programs not on public, recognized web sites:
know the author, or someone in collaboration implements and maybe posts or puts in local repository
• Programs not collected by Particle Data Group – publishes generally-recognized methods review
Questions to see differences:• Goals + strengths • What would you like to add next?• User community: Who? How many? Platforms?• User interface: GUI, Scripting, Web, link library, code?• Documentation: how? Quality?
• How big is developer community?• How are contributions made/tested/integrated?• Releases and bug tracking mechanisms• Implementation language(s)• Licensing/distribution
Proposed Presentations• Rene Brun: Root data mining in HEP
• Eric Feigelson: VOSTATS R in astronomy?
• Luke Tierney: R (and omegastats?)
• Who? Frustrating Examples • Sherry Towers TerraFerMA classification in HEP
• Adam Lyon Using R in HEP • Scott Snyder Alternative Root Interfaces • Tim Beers Rostat robust legacy code
• Right Order? Space out or bunch?• First pass quickly to survey, then
reconsider?• Discussion during presentation or after?
Other possible activities• Discussion/panel:
– What do users want?– How could projects reinforce one another– Selecting achievable goals– What are options for Fermilab projects?
• Technical Working Group(s)– Specifics, e.g. root/R interface (brass tacks)– Planning of joint projects?– Planning of further workshops?
• Developer or user oriented?
• Post Talks to web?– Semi-private (developer use)? – Or public, with publicity to users
Some projects that got awayParticularly Python-based
• StatPy—Tom Loredo
• Python interface to Root—Harrison Prosper
• Orange and related: Python--Aleks Jakulin
• Jas—Java analysis framework
Restaurant: Villegas 6:15pmN. to Grand River; E 3.2 mi. past Okemos Rd, Marsh Rd
1735 W Gr River, 347-2080 (on right before Dobie)
BMPS
Central Park
Dessert: Jim & Ruth Linnemann1217 Ascot Pl 349-6138
Continue E (right) on Grand River Left at Cornell Rd (1 mi)
Right at Ascot Place (3rd right; 2 miles or so) 1st drive on right of Ascot
Example 1: 2 sample classification • Plot signal efficiency vs background rejection curves (ROC)
• Selection based on a set of variables (or combinations of variables).
• Click on efficiency value to find value selection criterion in original variables.
• Superimpose curves for several candidate variable selections.
• Data:
• Look in a coordinated fashion at two separate data sets with related but non-identical data structures
• HEP data usually tree-structured: – many instances, each including variable number of lower-level objects
• Typically 2 or more levels down ,
• I might analyze these by forming a variable number of derived variables from the low level objects.
• Much of this process is algorithmic, but I wind up re-doing it by hand each time I try it.
Ex 2: No integrated repsository
• End of an analysis: sample of data events, and an expected set of possible backgrounds, each with an uncertainty.
• Want to calculate a statistical significance (or 90% CL) for these.
• Usually have to extract these numbers and then find a completely separate
piece of software, either in someone's private area, or on the web, or if
• I'm really lucky, in a macro someone's written.
• There aren't good central mechanisms (repositories or interactive web sites)
or for sharing such algorithms, either.
Ex 3: New Statistical Methods
• While the environment I'm used to is good at exploring and fitting large data sets, the number of statistical methods part of that framework is limited.
• I'd like to be able to apply many of the tests I might find in a textbook to comparing two distributions.
• Or I’d like to perform bootstrap calculations or “ensemble tests” without writing from scratch a “toy Monte Carlo”: to identify the statistical uncertainty of my fitting results with simulated experiments.
• These tests exist in R, but my data is in Root.
Root: key features
• GUI for presentation graphics and selection (“cuts”)
• I/O for tree-structured data: scales to petabytes
• Histogram as base metaphor (akin to vector)• Sophisticated nonlinear fitting• C++ at command line, macros, compiled
macros
R: key features• Elegant data manipulation: S language:
– command prompt and macros– interpreted, heading to byte-compilation– GUI: only now building hooks– most users satisfied command line– Standard tool of professional research statisticians
• Sophisticated graphics– standard statistical plots not used in HEP– missing histograms with error bars– Links to further multidimensional graphics (Ggobi)
• Data in virtual memory– Data frames: vectors are a basic metaphor (cf. histogram in Root)– interfaces to databases (postgres; mysql)– Parallel computation under development
• Broad package library, with trivial download