DC2 at GSFC - 28 Jun 05 - T. Burnett 1 DC2 C++ decision trees Toby Burnett Frank Golf Quick review...
-
Upload
bartholomew-dickerson -
Category
Documents
-
view
221 -
download
1
Transcript of DC2 at GSFC - 28 Jun 05 - T. Burnett 1 DC2 C++ decision trees Toby Burnett Frank Golf Quick review...
DC2 at GSFC - 28 Jun 05 - T. Burnett 1
DC2 C++ decision trees
Toby Burnett
Frank Golf
Quick review of classification (or decision) trees Training and testing How Bill does it with Insightful Miner Applications to the “good-energy” trees: how does it compare?
DC2 at GSFC - 28 Jun 05 - T. Burnett 2
Quick Review of Decision Trees
Introduced to GLAST by Bill Atwood, using InsightfulMiner
Each branch node is a predicate, or cut on a variable, likeCalCsIRLn > 4.222
If true, this defines the right branch, otherwise the left branch.
If there is no branch, the node is a leaf; a leaf contains the purity of the sample that reaches that point
Thus the tree defines a function of the event variables used, returning a value for the purity from the training sample
predicate
predicate
purity
true: rightfalse: left
DC2 at GSFC - 28 Jun 05 - T. Burnett 3
Training and testing procedure
Analyze a training sample containing a mixture of “good” and “bad” events: I use the even events in order to have an independent set for testing
Choose set of variables and find the optimal cut for such that the left and right subsets are purer than the orignal. Two standard criteria for this: “Gini” and entropy. I currently use the former.
WS : sum of signal weights
WB : sum of background weights
Gini = 2 WS WB/(WS +WB)
Thus Gini wants to be small.
Actually maximize the improvement:Gini(parent)-Gini(left child)-Gini(right child)
Apply this recursively until too few events. (100 for now) Finally test with the odd events: measure purity for each
node
DC2 at GSFC - 28 Jun 05 - T. Burnett 4
Evaluate by Comparing with IM results
CAL-Low CT Probabilities
CAL-High CT Probabilities
All Good
Bad
Good All
Bad
Good
From Bill’s Rome ’03 talk:
The “good cal” analysis
DC2 at GSFC - 28 Jun 05 - T. Burnett 5
Compare with Bill, cont
Since Rome: Three energy ranges, three trees.
• CalEnergySum: 5-350; 350-3500; >3500 Resulting decision trees implemented in Gleam by a
“classification” package: results saved to the IMgoodCal tuple variable.
Development until now for training, comparison: the all_gamma from v4r2 760 K events E from 16 MeV to 160 GeV (uniform in
log(E), and 0< < 90 (uniform in cos() ). Contains IMgoodCal for comparison
Now:
DC2 at GSFC - 28 Jun 05 - T. Burnett 7
Another way of looking at performance
Performance of Atwood GoodCal Trees
0
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
1
0 0.2 0.4 0.6 0.8 1
good efficiency
ba
d f
rac
tio
n
LowCal
MedCal
HighCal
Define efficiency as fraction of good events after given cut on node purity; determine bad fraction for that cut
DC2 at GSFC - 28 Jun 05 - T. Burnett 8
The classifier package
Properties Implements Gini separation (entropy now
implemented) Reads ROOT files Flexible specification of variables to use Simple and compact ascii representation of decision
trees Recently implemented: multiple trees, boosing Not yet: pruning
Advantages vs IM: Advanced techniques involving multiple trees, e.g.
boosting, are available Anyone can train trees, play with variables, etc.
DC2 at GSFC - 28 Jun 05 - T. Burnett 9
Growing trees
For simplicity, run a single tree: initially try all of Bill’s classification tree variables: list at right.
The run over the full 750 K events, trying each of the 70 variables takes only a few minutes!
"AcdActiveDist", "AcdTotalEnergy", "CalBkHalfRatio", "CalCsIRLn", "CalDeadTotRat", "CalDeltaT",
"CalEnergySum", "CalLATEdge", "CalLRmsRatio", "CalLyr0Ratio", "CalLyr7Ratio", "CalMIPDiff",
"CalTotRLn", "CalTotSumCorr", "CalTrackDoca", "CalTrackSep", "CalTwrEdge", "CalTwrGap",
"CalXtalRatio", "CalXtalsTrunc", "EvtCalETLRatio", "EvtCalETrackDoca", "EvtCalETrackSep",
"EvtCalEXtalRatio", "EvtCalEXtalTrunc", "EvtCalEdgeAngle", "EvtEnergySumOpt", "EvtLogESum",
"EvtTkr1EChisq", "EvtTkr1EFirstChisq", "EvtTkr1EFrac", "EvtTkr1EQual", "EvtTkr1PSFMdRat",
"EvtTkr2EChisq", "EvtTkr2EFirstChisq", "EvtTkrComptonRatio", "EvtTkrEComptonRatio",
"EvtTkrEdgeAngle", "EvtVtxEAngle", "EvtVtxEDoca", "EvtVtxEEAngle", "EvtVtxEHeadSep",
"Tkr1Chisq", "Tkr1DieEdge", "Tkr1FirstChisq", "Tkr1FirstLayer", "Tkr1Hits",
"Tkr1KalEne", "Tkr1PrjTwrEdge", "Tkr1Qual", "Tkr1TwrEdge", "Tkr1ZDir", "Tkr2Chisq", "Tkr2KalEne",
"TkrBlankHits", "TkrHDCount", "TkrNumTracks", "TkrRadLength", "TkrSumKalEne", "TkrThickHits",
"TkrThinHits", "TkrTotalHits", "TkrTrackLength", "TkrTwrEdge", "VtxAngle", "VtxDOCA", "VtxHeadSep",
"VtxS1", "VtxTotalWgt", "VtxZDir"
DC2 at GSFC - 28 Jun 05 - T. Burnett 10
Performance of the truncated list
"Goodcal" classification tree performance(all energies)
0
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
1
0 0.2 0.4 0.6 0.8 1
good fraction
bad
frac
tion
GLAST classifier
Insightful Miner
Note: this is cheating: evaluation using the same events as for training
1691 leaf nodes
3381 total
Name improvement
CalCsIRLn 21000
CalTrackDoca 3000
AcdTotalEnergy 1800
CalTotSumCorr 800
EvtCalEXtalTrunc 600
EvtTkr1EFrac 560
CalLyr7Ratio 470
CalTotRLn 400
CalEnergySum 370
EvtLogESum 370
EvtEnergySumOpt 310
EvtTkrEComptonRatio 290
EvtCalETrackDoca 220
EvtVtxEEAngle 140
DC2 at GSFC - 28 Jun 05 - T. Burnett 11
Separate tree comparisonIMclassifier
DC2 at GSFC - 28 Jun 05 - T. Burnett 12
Current status: boosting works!
Boosting: tb-mu
0
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
1
0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1
efficiency
ba
ck
gro
un
d
1 tree
5 trees
10 trees
15 trees
20 trees
Example using D0 data
DC2 at GSFC - 28 Jun 05 - T. Burnett 13
Current status, next steps
The current classifier “goodcal” single-tree algorithm applied to all energies is slightly better than the three individual IM trees Boosting will certainly improve the result
Done: One-track vs. vertex: which estimate is better?
In progress in Seattle (as we speak) PSF tail suppression: 4 trees to start.
In progress in Padova (see F. Longo’s summary) Good-gamma prediction, or background rejection
Switch to new all_gamma run, rename variables.