MULTIVARIATE METHODS FOR HADRONIC FINAL STATES IN ELECTRON-POSITRON COLLISIONS...

MULTIVARIATE METHODS FOR HADRONIC FINAL STATES IN

ELECTRON-POSITRON COLLISIONS AT√

S = 500 GEV

Saurav Pathak

A DISSERTATION

Physics and Astronomy

Presented to the Faculties of the University of Pennsylvania

in Partial Fulfillment of the Requirements for the Degree of Doctor of Philosophy

Robert Hollebeek, Supervisor of Dissertation

Randy Kamien, Graduate Group Chair

Saurav Pathak

To my parents

Acknowledgements

The use of data mining techniques in high-energy physics is a novelty, and the solu-

tions that they offer to the challenges posed by the next generation of accelerators are

now being appreciated. I was very fortunate to participate in the National Scalable

Cluster Project (NSCP) that explored this area under Robert Hollebeek, an early

proponent of this approach to high-energy physics data. This thesis is clearly in-

spired by his vision and knowledge. Prof. Hollebeek open up this exciting world and

made it possible for me to explore the limits of newer computational techniques—their

amazing abilities as well as their limitations.

I wish to thank Kevin Sterner for maintaining a strong interest in my work and

setting high standards. His guidance on the design of FastCal was crucial for its

success. I shall ever be grateful for the revisions and editorial scrutiny that he has

subjected this thesis to. Though I have benefited much from this, the errors if any

are my own.

Pavlos Protopapas has been a friend and mentor for the entire period of the gradu-

ate life at the University of Pennsylvania. His interest and expertise in computational

work is a major influence in this thesis.

I wish to thank Gary Bower at SLAC for his encouragement especially for the

development of the CJNN neural network package. I have benefited much from my

interaction with him.

I would like to offer special thanks to Turgut Durduran, Marc Llaguno, Regine

Choe and Eylem Ozkaramanli for enriching my graduate life. Turgut has been very

helpful with his advice, especially on matters computers, during emergencies and

otherwise.

This thesis was possible because of strong unconditional emotional support that

I received from my parents, my soon to be in-laws and my vast extended family in

distant Assam, India. Their interest and their faith in me was an engine that drove

me toward completion of this thesis.

I have no words to convey my thanks for the love and support of Maina, who has

shared with me much of the joys and sorrows of graduate life. Thank you Maina, for

this wonderful journey.

ABSTRACT

MULTIVARIATE METHODS FOR HADRONIC FINAL STATES IN

ELECTRON-POSITRON COLLISIONS AT√

S = 500 GEV

Saurav Pathak

Supervisor: Robert Hollebeek

We approach the hadronic final state events in a future linear collider at√

500 GeV from the knowledge discovery (data mining) point of view. We present

FastCal, a fast configurable calorimeter Monte Carlo simulator for linear collider

detector simulations that produces data at a rate that is 3000 times that of full simu-

lation. Neural networks based on earlystopping are designed for the jet-combinatorial

problem. CJNN, a neural network package is presented for use in the linear collider

analysis environment. Neural network performances are optimized by implementing

an ensemble of neural networks. A binary tree is used to obtain novel automatic cuts

on physics variables. Data visualization is introduced as a crucial component of data

analysis, and principal component analysis is used to understand data distributions

and structures in multiple dimensions. Finally, cluster analyses with fuzzy c-means

and demographic clustering are used to partition data automatically in an unsuper-

vised regime, and we show that for fruitful use of these algorithms, understanding

the data structures is crucial.

Contents

Acknowledgements iv

Abstract vi

Contents vii

List of Tables xiii

List of Figures xv

1 Introduction 1

1.1 Motivation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1

1.1.1 New Physics . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2

1.1.2 Accelerators . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3

1.1.3 Multijet events in LC . . . . . . . . . . . . . . . . . . . . . . . 4

1.2 Multivariate Analysis and Kinematics . . . . . . . . . . . . . . . . . . 5

1.3 Knowledge Discovery as an approach . . . . . . . . . . . . . . . . . . 6

1.4 Mining for the Z boson – CDF data . . . . . . . . . . . . . . . . . . . 6

1.4.1 Demographic Clustering . . . . . . . . . . . . . . . . . . . . . 7

1.4.2 Mining for Z . . . . . . . . . . . . . . . . . . . . . . . . . . . 8

1.4.3 Discussion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 14

1.5 Plan of thesis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 15

1.6 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 16

2 Knowledge Discovery and Physics 17

2.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 17

2.2 A short history of Data Mining . . . . . . . . . . . . . . . . . . . . . 17

2.2.1 Digital databases . . . . . . . . . . . . . . . . . . . . . . . . . 18

2.2.2 Classical Statistics . . . . . . . . . . . . . . . . . . . . . . . . 18

2.2.3 Machine Learning . . . . . . . . . . . . . . . . . . . . . . . . . 20

2.2.4 Meeting together . . . . . . . . . . . . . . . . . . . . . . . . . 20

2.3 Knowledge Discovery . . . . . . . . . . . . . . . . . . . . . . . . . . . 21

2.3.1 An example - Binary Tree . . . . . . . . . . . . . . . . . . . . 21

2.3.2 Description of KDD . . . . . . . . . . . . . . . . . . . . . . . 23

2.4 Knowledge discovery and scientific discovery . . . . . . . . . . . . . . 24

2.5 Knowledge discovery and high-energy physics . . . . . . . . . . . . . 25

2.6 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 26

3 Neural Networks 27

3.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 27

3.2 The problem – Separation of Z from background . . . . . . . . . . . . 28

3.3 Rationale for a neural network . . . . . . . . . . . . . . . . . . . . . . 29

3.3.1 The F -measure . . . . . . . . . . . . . . . . . . . . . . . . . . 30

3.3.2 Comparison with some other methods . . . . . . . . . . . . . 30

3.4 A discussion on neural networks . . . . . . . . . . . . . . . . . . . . . 31

3.4.1 The neuron . . . . . . . . . . . . . . . . . . . . . . . . . . . . 32

3.4.2 The network . . . . . . . . . . . . . . . . . . . . . . . . . . . . 33

3.5 Neural Network Training . . . . . . . . . . . . . . . . . . . . . . . . . 35

3.5.1 Data representation . . . . . . . . . . . . . . . . . . . . . . . . 35

3.5.2 Learning Algorithm . . . . . . . . . . . . . . . . . . . . . . . . 35

3.5.3 Bias-variance dilemma and earlystopping . . . . . . . . . . . . 36

3.6 Neural Network Architecture . . . . . . . . . . . . . . . . . . . . . . . 38

3.6.1 The number of layers . . . . . . . . . . . . . . . . . . . . . . . 38

3.6.2 The number of units in each layer . . . . . . . . . . . . . . . . 39

3.6.3 Fixing ε . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 40

3.6.4 Root mean square weights . . . . . . . . . . . . . . . . . . . . 40

3.7 What do neural networks do . . . . . . . . . . . . . . . . . . . . . . . 42

3.8 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 44

4 FastCal: A fast Monte Carlo simulator for the LCD calorimeter 45

4.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 45

4.2 Simulated data and real data . . . . . . . . . . . . . . . . . . . . . . 46

4.3 Rationale for FastCal . . . . . . . . . . . . . . . . . . . . . . . . . . . 46

4.3.1 Description of existing FastMC . . . . . . . . . . . . . . . . . 48

4.4 Description of FastCal . . . . . . . . . . . . . . . . . . . . . . . . . . 49

4.4.1 Design philosophy and approach . . . . . . . . . . . . . . . . . 49

4.4.2 Geometry of Calorimeter . . . . . . . . . . . . . . . . . . . . . 50

4.5 FastCal simulation . . . . . . . . . . . . . . . . . . . . . . . . . . . . 51

4.5.1 Particle Propagation . . . . . . . . . . . . . . . . . . . . . . . 52

4.5.2 High-energy physics processes . . . . . . . . . . . . . . . . . . 56

4.6 Single Particle Comparison . . . . . . . . . . . . . . . . . . . . . . . . 60

4.6.1 Low-energy physics processes . . . . . . . . . . . . . . . . . . 63

4.6.2 A synopsis of the hadronic particle simulations . . . . . . . . . 69

4.7 FastCal gain . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 70

4.8 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 70

5 Neural Networks – comparison between GISMO and FastCal 73

5.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 73

5.2 Calorimeter deposition . . . . . . . . . . . . . . . . . . . . . . . . . . 74

5.2.1 Cluster level comparison . . . . . . . . . . . . . . . . . . . . . 75

5.2.2 Jet level comparison . . . . . . . . . . . . . . . . . . . . . . . 77

5.2.3 Jet-Quark Association . . . . . . . . . . . . . . . . . . . . . . 79

5.3 Neural network training . . . . . . . . . . . . . . . . . . . . . . . . . 81

5.3.1 FastCal-GISMO comparison . . . . . . . . . . . . . . . . . . . 81

5.4 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 83

6 Neural Network – results 84

6.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 84

6.2 Jet-Boson Association . . . . . . . . . . . . . . . . . . . . . . . . . . 84

6.3 Ensemble of neural networks . . . . . . . . . . . . . . . . . . . . . . . 87

6.3.1 Ensemble Results . . . . . . . . . . . . . . . . . . . . . . . . . 88

6.4 Classifying Z and W jet-pairs . . . . . . . . . . . . . . . . . . . . . . 90

6.4.1 Distinguishing W and Z . . . . . . . . . . . . . . . . . . . . . 93

6.5 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 93

7 Unsupervised Methods 97

7.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 97

7.2 Visualization . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 99

7.2.1 The combinatorial problem and Principal Component Analysis 100

7.2.2 PCA to 2 dimensions . . . . . . . . . . . . . . . . . . . . . . . 101

7.2.3 The density plot . . . . . . . . . . . . . . . . . . . . . . . . . 106

7.2.4 PCA at the quark level . . . . . . . . . . . . . . . . . . . . . . 109

7.3 Clustering methods . . . . . . . . . . . . . . . . . . . . . . . . . . . . 111

7.3.1 Fuzzy clustering . . . . . . . . . . . . . . . . . . . . . . . . . . 113

7.3.2 Demographic clustering on the combinatorial problem . . . . . 117

7.3.3 Comparing clustering result . . . . . . . . . . . . . . . . . . . 119

7.4 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 119

8 Conclusion and Future work 121

8.1 The goal . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 121

8.2 The Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 122

8.2.1 FastCal . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 122

8.2.2 Neural Network . . . . . . . . . . . . . . . . . . . . . . . . . . 122

8.2.3 Exploratory Data Mining . . . . . . . . . . . . . . . . . . . . 124

8.2.4 Clustering . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 125

8.3 Data mining and the future . . . . . . . . . . . . . . . . . . . . . . . 126

8.4 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 127

A Using FastCal in JAS 128

A.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 128

A.2 Use of the FastCal package . . . . . . . . . . . . . . . . . . . . . . . . 129

B CJNN – Neural Network GUI package for JAS 131

B.1 A brief description of CJNN . . . . . . . . . . . . . . . . . . . . . . . 131

B.2 Use of CJNN . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 132

B.3 Data . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 132

B.4 Training interface . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 132

B.5 Error Display . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 135

B.6 Application of CJNN in code . . . . . . . . . . . . . . . . . . . . . . 136

C Demographic Clustering – an example 138

C.1 Condorcet’s Criterion . . . . . . . . . . . . . . . . . . . . . . . . . . . 138

C.2 Demographic Clustering example . . . . . . . . . . . . . . . . . . . . 139

D Binary Trees 141

D.1 SPRINT . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 141

D.1.1 Split Points . . . . . . . . . . . . . . . . . . . . . . . . . . . . 142

E Fuzzy Clustering 143

E.1 Algorithm . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 143

F Principal Component Analysis and Multidimensional Scaling 145

F.1 Classical Multidimensional Scaling . . . . . . . . . . . . . . . . . . . 145

F.2 MDS Algorithm . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 145

Glossary 147

Bibliography 151

List of Tables

2.1 Alternative cuts for Z reconstruction. CUTS II are obtained from a

decision tree modeled on the jet pair invariant mass mjj. . . . . . . . 23

3.1 F -measure (Equation 3.3) for different classifiers. . . . . . . . . . . . 31

3.2 The binary tree rule for one of the branches on trained on the neural

network result. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 44

4.1 Detector Parameters ldmar01, a design specification for the LD design,

used in this study. . . . . . . . . . . . . . . . . . . . . . . . . . . . . 51

4.2 Table of parameters . . . . . . . . . . . . . . . . . . . . . . . . . . . 59

4.3 Table of lengths. Tyvek is not included in the calculation. Since it

forms such a thin layer in comparison to the others, it is expected to

produce minor corrections. . . . . . . . . . . . . . . . . . . . . . . . 59

4.4 A comparison between FastCal and GISMO. Four jet events are those

in which both the Z’s decay hadronically, and the events have at least

four jets. Good pairs are those pairs of the four jets that can be

associated with the quarks from the same Z. . . . . . . . . . . . . . 71

5.1 The F -measures (Equation 3.3) for the three data sets. . . . . . . . 82

6.1 Comparison of the Boson Content and the Angular Proximity methods

in Z pair identification. 1 denotes a good jet pair and 0 denotes a bad

jet pair. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 86

7.1 The transformation matrix. Note that the transformation is between

the scaled and standardized (zero mean, unit variance) values of vari-

ables in the left column and the principal components . . . . . . . . . 102

7.2 Fuzzy clustering for 2 clusters (c = 2). The cluster 1 is interpreted as

the incorrect jet-pair cluster. The F -measure is calculated for cluster

0 and interpreted as the correct jet-pair cluster. . . . . . . . . . . . . 114

7.3 Fuzzy clustering for 3 clusters (c = 3). The F -measure for cluster 0

does not improve with the increase of c from 2 to 3. . . . . . . . . . 115

7.4 The centroids of the three clusters. Note that the variables are stan-

dardized to zero mean and unit variance. . . . . . . . . . . . . . . . . 115

7.5 Comparison between fuzzy c-means and demographic clustering. . . 119

List of Figures

1.1 Invariant mass distribution m of jet pairs, for events with a good e+e−

pair (electrons with energies greater than 20 GeV). . . . . . . . . . . 9

1.2 Clustering of jet pairs with a pair of e+e−. Clustering in performed on

the following variables: jet-jet invariant mass (m ij), jet-jet opening

angle (dphi), the transverse momenta of the first and the second jets

(pt i and pt j respectively) and the rapidities of the first and second

jets (y i and y j respectively). The two jets in the jet pairs are ordered

so that the first jet has higher energy than the second. The gray filled

histograms denote the distributions of the entire dataset, whereas the

histograms with thick lines denote distributions for the specific cluster. 10

1.3 Invariant mass distribution m of jet pairs with lowered e+e− threshold

energy (5 GeV). Note that the Z bump clearly visible in Figure 1.1 is

washed out. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11

1.4 Clustering of jet pairs with lowered e+e− threshold energy, representing

Z. The variables are those given in Figure 1.2. . . . . . . . . . . . . 12

1.5 Clustering of jet pairs with lowered e+e− threshold energy, representing

W±. The variables are explained in Figure 1.2 . . . . . . . . . . . . . 13

1.6 The Z cluster in the 2 maximum energy jet variables. The variables

are the same as in Figure 1.2, but the subscripts (1,2) now represent

the two massive jets. . . . . . . . . . . . . . . . . . . . . . . . . . . . 14

1.7 The Z cluster in the 2 maximum energy jet variables. The variables are

the same as in Figure 1.2, but the subscripts (1,2) now represent the

two massive jets. The threshold jet energy is lowered to 5 GeV. The

distance unit is set for invariant mass m 12 (15 GeV), the transverse

momenta pt 1 (5 GeV) and pt 2 (5 GeV). . . . . . . . . . . . . . . . 15

2.1 The graph on the left is one obtained from cuts. The one on the right

is from cuts obtained from the binary tree method. . . . . . . . . . . 22

3.1 Performance of different classifiers. For a quantitative comparison, see

Table 3.1. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 31

3.2 The distribution of the two output neural units trained on output data

that is either (1,0) or (0,1). The straight line is not a fitted line, but

the output distribution of the two neural units, showing the sum of the

outputs is unity, as is required for a probabilistic interpretation. There

are a few outliers present, though, which shows that some regions of

the variable space were not adequately represented and these regions

were under-trained. . . . . . . . . . . . . . . . . . . . . . . . . . . . 34

3.3 (a) With ε = 0.01, the weights decay too fast for the neural network

to learn. This is denoted by the flat error curve in the figure. The

difference in the error values on the training and the validation datasets

is due to the difference in size. (b) With ε = 0.00001, the error curves

exhibit the same behavior as for ε = 0. . . . . . . . . . . . . . . . . . 41

3.4 The RMS weights of neural units in the first and second hidden layers.

In (a), the low-valued RMS weights is clearly isolated, and are all less

than 1. In (b), the low-valued RMS weights are not isolated. So we

follow (a) and delete the neural units that have RMS weights less than

1.0. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 42

3.5 With a√

W 2 cutoff of 1.0 a 6-21-22-2 network is obtained, which after

the initial spike in the learning curve, continues in the expected fashion.

Also note that the error on the validation dataset is lower compared

to the error from the validation set on the older network. This indi-

cates that the generalization capability has improved with pruning. By

contrast in the more drastic pruning (b), we see that after the spike,

the error curves do not follow the behavior of the unpruned network,

thus exhibiting poor learning. This also reinforces our assertion that

we require at least two hidden layers to solve the problem. . . . . . . 43

4.1 Plot for Hadronic Barrel Calorimeter for neutral hadrons. Each circle

represents an intersection point of a neutral particle with the inner

surface of a barrel calorimeter. . . . . . . . . . . . . . . . . . . . . . 54

4.2 Plot for Hadronic Barrel Calorimeter for neutral hadrons. . . . . . . 54

4.3 Plot for Hadronic Barrel Calorimeter for charged hadrons. . . . . . . 55

4.4 Plot for Hadronic Barrel Calorimeter for charged hadrons. . . . . . . 56

4.5 Event with charged final state particle trajectories in FastCal and

GISMO for comparison. Figure (a) is obtained from FastCal and Fig-

ure (b) is the same event simulated by GISMO and viewed using LCD-

Wired. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 57

4.6 Calorimeter response to single e− events at 50 GeV. The left column of

histograms show the FastCal response, the right column show GISMO

response. The top figures show the total energy deposit (EHCAL +

EECAL). The middle figures show the energy in the ECAL (EECAL),

and the bottom figure the HCAL response (EHCAL). Note that there

is no energy leakage from the ECAL in FastCal, represented by the flat

distribution of HCAL energy in the bottom left figure. . . . . . . . . 58

4.7 Calorimeter response to single π− particle at 50 GeV without any fluc-

tuation. The placement of the histograms is the same as that described

in Figure 4.6. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 61

4.8 Calorimeter response to single π− particle at 50 GeV with fluctuations

in the shower origin. The placement of the histograms is the same as

that described in Figure 4.6. . . . . . . . . . . . . . . . . . . . . . . 62

4.9 Calorimeter response to single π− particle at 50 GeV with fluctuations

in the shower origin. The placement of the histograms is the same as

4.10 Calorimeter response to single π− particle at 50 GeV with additional

fluctuations in the shower length. The placement of the histograms is

the same as that described in Figure 4.6. . . . . . . . . . . . . . . . 65

4.11 The distribution of w. . . . . . . . . . . . . . . . . . . . . . . . . . . 66

4.12 Calorimeter response to single π− particle at 50 GeV with additional

fluctuations in the w. The placement of the histograms is the same as

4.13 The quadratic fit for dE/dx. . . . . . . . . . . . . . . . . . . . . . . 68

4.14 Comparison of low energy depositions in the ECAL for 50 GeV negative

π with minimum ionization and δ-ray simulations. Note that the lowest

energy deposit matches well due to the fitting on the minimum energy

given by Equation 4.15. The spread to the right is due to the δ-ray

simulation. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 69

4.15 The algorithm for FastCal energy deposition in ECAL due to hadronic

showers. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 71

5.1 The ECAL energy deposition obtained in (a) FastCal and (b) GISMO.

The two distributions are in qualitative agreement. . . . . . . . . . . 74

5.2 The HCAL energy deposition obtained in (a) FastCal and (b) GISMO.

The two distributions are in qualitative agreement. . . . . . . . . . . 75

5.3 The total calorimeter energy deposition obtained in (a) FastCal and

(b) GISMO. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 76

5.4 The total energy per event in final state e+, e−, γ and hadrons (gen-

erator level). . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 76

5.5 (a) The comparison of cluster energy distribution in FastCal and GISMO

(FullSim). There is a general agreement between FastCal and GISMO,

except for low energies. (b) The first bin in (a) blown up to show

GISMO has a high number of low energy clusters (<1 GeV). The very

low energy “garbage” clusters do not impact the jet finding algorithms. 77

5.6 Jet energy distribution in FastCal and GISMO (FullSim). The number

of jets match in comparison, except for low energy jets. . . . . . . . 78

5.7 (a) The comparison of cluster energy distribution in FastCal and GISMO.

There is a general agreement between FastCal and GISMO, except for

low energies. (b) The first bin in (a) blown up to show GISMO has a

high number of low energy clusters (<1 GeV). . . . . . . . . . . . . . 78

5.8 The jet-quark energy plot for GISMO events with one Z decaying

hadronically, while the other decays into muons and neutrinos. The the

most energetic jets are then associated with the quarks, and a match

is established if the bigger of the two jet-quark opening angles is less

than 0.3 radians. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 80

5.9 The jet-quark energy plot for FastCal events with one Z decaying

hadronically, while the other decays into muons and neutrinos. The the

most energetic jets are then associated with the quarks, and a match

is established if the bigger of the two jet-quark opening angles is less

than 0.3 radians. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 80

5.10 The neural network training graph. The lowest validation error is

reached at the 780-th training cycle. The network at this stage is used

in the testing phase. . . . . . . . . . . . . . . . . . . . . . . . . . . . 82

6.1 The energy fraction of each jet for the first boson, given by Equation 6.1 86

6.2 The ε-p graph illustrating the ensemble result. See Section 6.3.1. . . 89

6.3 The ε-p curves display the performance of a neural network trained on

e+e− → W+W− data and tested (1) on e+e− → ZZ data (Z NN on

W data) and (2) on e+e− → W+W− data (W NN on W data). Note

that network, trained on W data performs equally well on Z data. . 91

6.4 The ε-p curves display the performance of a neural network trained on

e+e− → ZZ data and tested (1) on e+e− → ZZ data (Z NN on Z data)

and (2) on e+e− → W+W− data (W NN on Z data). As in Figure 6.3,

this shows that the networks trained on right and wrong jet pairs are

picking up the features of a general heavy boson. . . . . . . . . . . . 92

6.5 The ε-p curve for the Z jets pairs against a background of W jet pairs

is given by the dark line. For comparison, the ε-p curve for the Z

jets pairs against the background of wrong jet pairs is given. The dark

line is straighter, denoting a poor classification performance in the mid

probabilities (See Figure 6.6). . . . . . . . . . . . . . . . . . . . . . . 94

6.6 The neural network output of the first unit (probability of Z jet pair).

Since there are two classes, the probability for W is complementary.

The classification is not optimal between the output values of 0.15

and 0.7, and it is particularly suboptimal between 0.15 and 0.5. The

peaks most likely denote subclasses of the jet pairs that have particular

kinematic variable distributions. . . . . . . . . . . . . . . . . . . . . 95

7.1 The surface matrix plot for the four variables mentioned in Section 7.2.1.

7.2 The scree plot, that displays the eigenvalues of the principal compo-

nent. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 103

7.3 The two orthogonal components with the highest standard deviations

after principal component analysis. In this plot, the Z jet-pairs are

given in black and the wrong combinations are given in gray. . . . . 104

7.4 The surface plot for the density distribution, calculated using the

Gaussian-like kernel described above. Notice the peaks. The higher

of the two peaks represent the signal. . . . . . . . . . . . . . . . . . 107

7.5 The contour plot of the density distribution. . . . . . . . . . . . . . 107

7.6 The efficiency-purity curve. The two curves compare the result of (a)

the classical multidimensional scaling procedure and (b) the neural

network. In (a), a density cutoff is used to identify the Z. The density

is obtained in a 2-dimensional reduced subspace obtained from a clas-

sical scaling on the following variables: E1E2, m12, θ12 and δm212. The

neural network is trained on E1, E2, cos θ1, cos θ2, m12 and θ12. . . . 108

7.7 The PCA plot using the same transformation as the jets but on quark

information. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 110

7.8 The right and and wrong combinations at the quark level . . . . . . . 111

7.9 Four perspectives of three-dimensional PCA at the quark level. The

dark (black) points are Z and the light (green) points are wrong com-

binations. The two classes form different distributions. Section 7.2.4

for description. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 112

7.10 The plot in the first two principal components on the fuzzy cluster data.

Note the structures in the data distribution. The dark datapoints are

the correct jet-jet pairs, and the lighter datapoints are the incorrect

jet-jet pairs. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 116

7.11 Demographic Clustering: The purity as a function of the distance pa-

rameter on the θ12 variable. . . . . . . . . . . . . . . . . . . . . . . . 118

7.12 Demographic Clustering: The efficiency-purity graph obtained by vary-

ing the distance parameter on the θ12 variable. Comparison is made

with fuzzy clustering, θ12-cut and an ensemble of neural networks. . 120

B.1 The network setup panel. The working directory sets up the folder

in which the data files exist, and all CJNN created files are copied.

Network configuration configures the basic network architecture. The

neural network can alternatively be read from an already existing con-

figuration file. Further, the configuration of a network can be written

to a file for later use. . . . . . . . . . . . . . . . . . . . . . . . . . . 133

B.2 The second panel in the GUI sets up the training parameters. eta is

the learning coefficient and alpha is the momentum coefficient term.

Rescaling the data transforms variables such that they have zero mean

and unit variance along columns. The training parameters can be

written to and read from a file. . . . . . . . . . . . . . . . . . . . . . 134

B.3 The third panel, under Weights Initialization, sets the weights in the

network. This can be done either from a file, or the weights can be

randomized. Once the weights are set, the iterative training can begin. 135

Chapter 1

Introduction

1.1 Motivation

The experimental program in high-energy physics has seen tremendous achievements

and successes in recent times. In particular, a unified electroweak model, the Standard

Model (SM) [36, 87, 77] was put in place by the discovery of the W [6] and the

Z [9] bosons. The number of precision tests at the linear e+e− colliders have further

vindicated this model and it is now generally accepted that it does indeed describe

the physics at the energy scale accessible in the present particle colliders.

Though the Standard Model has seen such success in the precision tests, all its

assertions have not been verified.

The Standard Model, which has a SU(2)×U(1) symmetry, has a spectrum of mass-

less particles. The masses are generated by what is known as the Higgs mechanism

which requires a boson multiplet to provide the mass term via vacuum expectation

values, thus breaking the symmetry. Though there have been suggestions of evidence

for the Higgs boson, its existence is not proven yet. And further, it is still not proven

that the Higgs boson indeed provides the masses to all the fermions and gauge bosons

in the SM particle spectrum. A lighter Higgs is presently favored, and with the in-

dication for a Higgs at about 115 GeV from Large Electron-Positron Collider (LEP)

runs, and a lower bound of about 113 GeV at 95% CL [10], it is now highly proba-

ble that the Higgs will either be discovered at the Large Hadron Collider (LHC), or

sooner at the Tevatron. To prove that the Higgs is responsible for the gauge boson

masses as well as the fermion mass, it is essential to obtain the absolute couplings of

these particles with that of the possible extended Higgs sector. This will be possible

only in the linear collider [3].

The SM Higgs prescription, of a single complex scalar doublet breaking the gauge

symmetry and then mediating the finite masses to fermions through Yukawa cou-

plings, can be further extended. Supersymmetry (SUSY), in particular, demands at

least two complex doublets.

Therefore, one of the major concerns in high-energy physics is the discovery of

the Higgs boson and the mechanism of the mass generation. But the discovery of

a Standard Model Higgs boson is itself not sufficient. A single fundamental Higgs

bosons has theoretical problems, which is also called the hierarchy problem [83].

1.1.1 New Physics

The hierarchy problem is characterized by the need to fine-tune the renormalized

coupling constants and masses to 26 decimal places, if the Standard Model is to be

considered as an effective theory of a more fundamental unified theory. The objection

against fine-tuning comes not from formal mathematics, but from the naturalness of

the gauge theory.

To solve the hierarchy and other theoretical problems that are inherent in the

Standard Model, many alternative extensions have been suggested. A particularly

popular solution is provided by supersymmetry, which relates fermions with bosons.

Under this symmetry, all particles in nature have a supersymmetric partner that has

identical properties except spin and R-parity. Supersymmetry solves the hierarchy

problem elegantly by producing spontaneous symmetry breaking in the electroweak

sector as a by-product of the supersymmetry breaking in the soft terms. Supersym-

metric particles have not been observed because the supersymmetry breaking itself

gives the super-partners masses in the range that has not been reached in the parti-

cles colliders built until today. At higher energies, it is possible super-partners will

be produced, and their indirect signals will be observed.

Supersymmetry does not provide a single prescription to the problems of the Stan-

dard Model Higgs sector and instead offers a number of individual supersymmetric

models. There are other possible non-supersymmetric models that too have been

suggested—models based on new gauge interactions or extra spatial dimensions and

quantum gravity.

At the linear collider, it is expected that the Higgses will be produced by various

production processes, depending upon the correct model. But for a light Higgs, the

Higgsstrahlung process is the most favored, which produces a Z boson and a Higgs.

The expected backgrounds are the SM ZZ and WW processes. A lighter Higgs is

expected to decay mostly to bb, with WW and ZZ decays being important. For a

heavier Higgs, the WW and ZZ decays become dominant. In the decoupling limit of

the minimal supersymmetric model (MSSM), the light CP-even (charge-parity) Higgs

has properties close to the SM Higgs. For lighter mA0 , all the physical Higgs would

be produced in the linear collider [3].

1.1.2 Accelerators

Some of these issues that confront high-energy physics will be addressed in the new

accelerators that are being built. The next high energy frontier machine is the Large

Hadron Collider which will operate at 14 TeV and collide protons. Although there

have been indications of a Standard Model Higgs at LEP, it is expected to be conclu-

sively found at either the Tevatron or the LHC. The LHC, with a much higher energy

reach, will also begin to explore the new physics.

Although the LHC is expected to discover the Higgs and signals of the new physics,

it will have to be complemented by a program of precision high-energy physics ex-

periments at a e+e−linear collider (LC). It is expected that a 500 GeV linear collider

with luminosity above 1034 cm−2 sec−1, and energy extensible to the TeV range, will

be required for the precision experiments that will be performed [3].

1.1.3 Multijet events in LC

A linear collider running at high energy will mostly produce heavy particles: Z, W±,

and heavy quarks. The heavy quarks produce multiple jets. The massive gauge

bosons decay hadronically with a branching ratio of about 70%. Therefore, physics

processes in the linear collider at 500 GeV produce copious multi-jet events. With

the decay particles carrying high energy, the jets are well collimated and well defined.

The physics experiments that will be carried out in the linear collider will deal largely

with jets.

For example, for the Higgsstrahlung process, e+e− → Zh, which is the main

Higgs boson discovery process, the background is the double Z production process:

e+e− → ZZ [2]. In the hadronic decay of both the Z and the h in the two processes,

it is not clear which jets arise from which particle.

The jet multiplicity exists for many of the processes that will be studied at the

linear collider. In the tt process there are at least 6 jets; in the tth process there are

at least 8 jets and in the AH process there are at least 12 jets.

A linear collider running at 500 GeV or higher will thus be characterized by

multi-jet events. A primary issue in such events is associating jets with decaying

particles. In a collider that is expected to see many novel processes never before seen,

methods are required that identify jets and associate them with decaying particles

with minimum theoretical bias.

In the scenarios held up by different possible models, the Higgs decay to the

heavy gauge bosons, and quarks are significant. Since W and Z themselves decay

predominantly into hadrons, the study of hadronic jets becomes of paramount impor-

tance. Typically, events will have multiple hadronic decays, and the problem will be

to classify the event and associate the jets with the parent particle.

1.2 Multivariate Analysis and Kinematics

The problem mentioned above is essentially a multivariate problem. As we try to

tackle the issues mentioned in the previous section, we are forced to handle many

different measured quantities simultaneously. The traditional approach in high-energy

physics analysis is to perform cuts one variable at a time, which is a monovariate

approach. This can be done in a more efficient way by recognizing the multivariate

nature of the problem, and using multivariate methods.

One popular method presently used in the high-energy physics (HEP) community

is the feedforward neural network. We propose to use neural networks as the bench-

mark and develop multivariate algorithms to efficiently classify events, associate jets

and understand the algorithms themselves, in a multivariate way.

In this thesis, we propose to use kinematics alone. Although for specific studies

additional signals like muon distribution, strange particles and jet charges will provide

significant inputs, in general kinematics will be sensitive to new physics. Therefore,

we will confine the use of multivariate methods to the use of kinematics.

We attempt here to develop an approach to multijet events that is based on

knowledge discovery, which is introduced in the following section.

1.3 Knowledge Discovery as an approach

The knowledge discovery methods and approach are those that have developed within

the discipline of machine learning outside traditional statistics. In general, these data-

centric methods require an iterative learning process by which algorithms produce a

set of rules. These rules, which are the results of the data mining algorithm, constitute

the discovered knowledge; and the entire process, including the algorithmic learning

process, constitutes the discovery phase. It should be noted here that even though

the discovery phase possibly uses some prior domain knowledge (in this case physics

knowledge), the algorithmic learning process does not use any domain knowledge.

The knowledge discovery approach is further discussed in Chapter 2. Often the

phrase “data mining” is used interchangeably with “knowledge discovery in databases”

(KDD). Furthermore, we use the phrase “knowledge discovery approach” to mean the

use of the data-centric approach of data mining.

To illustrate the knowledge discovery approach, we use the a Demographic Clus-

tering method in the following section to search for Z bosons. The Demogaphic

Clustering algorithm is explained in Appendix C.

1.4 Mining for the Z boson – CDF data

In a typical particle collision, particles interact at high energies giving off many other

particles. As collider energies increase, the number of particles produced in every

event and the number of processes involved increase, resulting in very big datasets.

The problem of finding the relevant events with the desired process and particles is,

therefore, a formidable one. As the data produced by the colliders increase, there

exists a need for efficient algorithms that can sieve out the events a physicist might

be interested in. One such approach to retrieving relevant data from large datasets is

Data Mining. Data Mining techniques are being increasingly used to look for patterns

in large datasets. In addition to the data that one might have been looking for, these

algorithms are also capable of finding patterns in the data that are hitherto unknown.

As a first illustration, we will examine the efficacy of such a technique in detecting

the Z boson in Collider Detector at Fermilab (CDF) events. The pp collisions pro-

duce Z bosons which decay via the hadronic and leptonic modes. The hadronic decay

modes are predominant, from branching ratios. In leptonic decays, neutrinos escape

detection and muons leave a minimal signal in the calorimeters. The dominant tau

particle decay modes include neutrinos, which make them difficult to use for recon-

struction. Therefore, the electron-positron decay channel is the only leptonic decay

mode that could be used for Z boson reconstruction from calorimeter information.

Here we use Data Mining to look for significant events that have Z boson decays.

As an initial step, jet-pair data for events with good e+e− pairs are used for mining.

Then the threshold energies of the e+e− are lowered to include more events. Finally

we introduce even more noise by considering only the two most massive jets with

energies above certain threshold energies.

1.4.1 Demographic Clustering

The Data Mining algorithm used is demographic clustering [37]. In Appendix C an

example is given to illustrate the algorithm. We use this method to partition the data

into different clusters in the n-dimensional phase space of the chosen jet parameters.

Clusters are groups of data points that have the following two properties [41] (1)

Homogeneity within clusters, i.e., data that belong to the same cluster should be

as similar as possible and (2) Heterogeneity between clusters, i.e., data that belong

to different clusters should be as different as possible. We examine here whether

the clusters that demographic clustering will provide us have physically significant

characteristics. Also we shall be trying to examine which choice of parameters is best

suited for this kind of analysis.

1.4.2 Mining for Z

In the leptonic decay mode, the Z boson decays into an energetic back-to-back lepton

and anti-lepton pair. The invariant mass of the Z is calculated from the four-momenta

of the electron/positron pairs. Since the mass of the Z boson is high, we expect the

transverse momenta to be high, and the rapidities to have a narrow distribution.

Therefore, as a first choice, we choose the following jet parameters for clustering: the

invariant mass (mij), transverse momenta (pTi,j) of the two jets, their rapidities (yi,j)

and the angle (δφ) between the jet pairs. The subscript i (j) represents the jet with

higher (lower) energy. Note that the choice of these variables is the use of domain

(physics) knowledge.

Events with good electrons

As a preliminary first step, the jet+2 events are clustered, with 2 “good” electrons

having energies above 20 GeV. Because of the strong QCD background, the bump

in the jet invariant masses is feeble compared to the electron invariant masses [Fig-

ure 1.1]. The bump at 91 GeV is clearly visible against the QCD background. This

is the mass distribution of the jet pairs for events with exactly an electron and a

positron.

Figure 1.2 shows the distribution of the six different variables for one cluster

0 50 100 150 200 250 300

Figure 1.1: Invariant mass distribution m of jet pairs, for events with a good e+e−

pair (electrons with energies greater than 20 GeV).

among many. The distributions indicate that this cluster is a good candidate for

the Z boson. Both the transverse momenta are high and the rapidities are low.

The invariant mass distribution has a peak at around 90 GeV and the two decaying

particles are back-to-back. The cluster is not purely a Z because it contains a low-

mass particle and small opening angles. The clustering algorithm was thus able

to isolate a physically significant class of phenomenon, though it was contaminated

with another phenomenon. A strong candidate for the contaminating phenomenon is

photon to electron-positron conversion.

Events with lowered threshold energy

As a second step, more noise is introduced into the data by lowering the electron

energy threshold (include electrons with energies above 5 GeV). Figure 1.3 shows the

ge00 Cluster 3 10.3% of population

0 40 80 120 160 2000

dphi0 1 2 3 4 65

0 20 40 600

pt_j0 10 20 30 40 50

y_i-4 -2 0 2 40

y_j-4 -2 0 2 40

Figure 1.2: Clustering of jet pairs with a pair of e+e−. Clustering in performed onthe following variables: jet-jet invariant mass (m ij), jet-jet opening angle (dphi),the transverse momenta of the first and the second jets (pt i and pt j respectively)and the rapidities of the first and second jets (y i and y j respectively). The two jetsin the jet pairs are ordered so that the first jet has higher energy than the second.The gray filled histograms denote the distributions of the entire dataset, whereas thehistograms with thick lines denote distributions for the specific cluster.

resulting jet invariant mass distribution, within which the 91 GeV bump is signifi-

cantly washed out.

Figure 1.4 shows the 91 GeV clustering result on the loose cut sample. Signifi-

cantly the low mass and small angle jets separate out and form a different cluster.

This cluster can be identified with the Z-decay. The result is shown in the table

below.

Mean Std Devmij 95.89 GeV 21.23δφ 3.09 rad 0.51

0 50 100 150 200 250 300

Figure 1.3: Invariant mass distribution m of jet pairs with lowered e+e− thresholdenergy (5 GeV). Note that the Z bump clearly visible in Figure 1.1 is washed out.

Figure 1.5 shows another cluster with mij=89.14 and√

σ=36.57. It shows one of

the jets (the j-jet) has high transverse momentum and low rapidity. This indicates

that this cluster is a good candidate for the W± boson, which decays into an electron

and a neutrino. The standard deviations are rather large, which could be because

the distance unit has not been properly set, and the clustering is not tight enough to

sieve out some spurious data.

Clustering of the two massive jets

Since the Z boson decays into a pair of energetic e+e−, there exists a high probability

that they are picked up as the two most massive jets in the event. But when only the

two most massive jets are used to reconstruct Zs, a lot of QCD background is also

0 40 80 120 160 2000

0 1 2 3 4 5 60

pt_i0 10 20 30 40 50 60 70

pt_j0 10 20 30 40 50 60

y_i-4 -3 -2 -1 0 1 2 3 40

y_j-4 -3 -2 -1 0 1 2 3 40

Figure 1.4: Clustering of jet pairs with lowered e+e− threshold energy, representingZ. The variables are those given in Figure 1.2.

picked up, and any information regarding the invariant mass bumps are completely

washed out.

With a change of the distance unit (30) and the m12 (subscript 1 represents the

jet with higher energy), a cluster is found with the same characteristics as that for the

Z boson in the other cases: high transverse momenta, low rapidity and back-to-back

angle [Figure 1.7]. This cluster is a strong candidate for Z decay.

Mean Std Devmij 115.11 50.84δφ 3.06 0.66

Even though the mass of the Z falls within the standard deviation, the error is

very high. To increase the confidence level that the Z decay events are solely included

m_ij0 40 80 120 160 200

dphi0 1 2 3 4 5 6

pt_i0 10 20 30 40 50 60 70

pt_j0 10 20 30 40 50 60

y_i-4 -2 0 2 4

y_j-4 -2 0 2 40

Figure 1.5: Clustering of jet pairs with lowered e+e− threshold energy, representingW±. The variables are explained in Figure 1.2

in the cluster, further clustering is required to pick out the significant events.

For clustering with more noise, the distance parameter becomes very important.

In Figure 1.7, the clustering result is shown, with the threshold of the jet energies

lowered to 5 GeV. The distance parameters are set for invariant mass m12 (15 GeV),

and the transverse momenta, pt1 (5 GeV) and pt

2 (5 GeV). The cluster number 16 can

be identified with Z boson. The cluster is 0.5% of the whole data set, which is parti-

tioned into more than 60 clusters. The average invariant mass (m12) is 119.97 GeV,

with a standard deviation of 38.43 GeV. Though this data set had more noise, the

clustering resulted in a lower standard deviation, which underlines the importance of

the distance parameter.

gj12x Cluster 0 8.6% of population

0 100 200 300 400 5000

dphi_120 1 2 3 4 5 6

pt_20 40 80 1200

y_2-4 -3 -2 -1 0 1 2 3 4

y_1-4 -3 -2 -1 0 1 2 3 40

pt_10 20 40 60 80 100 120

Figure 1.6: The Z cluster in the 2 maximum energy jet variables. The variables arethe same as in Figure 1.2, but the subscripts (1,2) now represent the two massive jets.

1.4.3 Discussion

The clustering method used above illustrates a naive application of the knowledge

discovery method. Although the domain knowledge input in the example above was

made in the choice of the physically significant variables, the Demographic Clustering

method is unaware of this. It performs the partition of the input data into clusters

solely on the basis of the distribution in the six-dimensional variable space. The in-

variant mass of the jet pair is explicitly used in the dataset since it is the strongest

discriminant of parent particles. The Demographic Clustering method orders it con-

sistently as the most significant discriminator in all the clusters seen above. It lists

the opening angle δφ as the second most important discriminant. The significance of

this result is that the clustering method could find out automatically a physically sig-

gj33x (15,5,5) Cluster 16 0.5% of population

pt_15 15 25 35 45 550

pt_25 15 25 35 45 55

dphi_120 1 2 3 4 5 6

y_1-4 -2 0 2 40

y_2-4 -2 0 2 40

Figure 1.7: The Z cluster in the 2 maximum energy jet variables. The variables arethe same as in Figure 1.2, but the subscripts (1,2) now represent the two massive jets.The threshold jet energy is lowered to 5 GeV. The distance unit is set for invariantmass m 12 (15 GeV), the transverse momenta pt 1 (5 GeV) and pt 2 (5 GeV).

nificant cluster, and further it could identify the most important discriminant among

the six variables.

1.5 Plan of thesis

This thesis is structured in the following way. Chapter 2 discusses the knowledge

discovery approach and how it has been applied in this work. Chapter 3 discusses

the neural network approach to the problem. Chapter 4 discusses a fast Monte Carlo

Simulation of the Linear Collider Detector Calorimeter, as designed by the Ameri-

can Linear Collider Working Group. Chapter 5 compares the use of two simulation

schemes FastCal and GISMO (a package for particle transport and detector simula-

tion) and validates the use of FastCal. Chapter 6 provides results on neural networks.

Chapter 7 discusses the unsupervised approach to the problem. In Chapter 8, we

summarize and conclude the work.

1.6 Conclusion

In this chapter we have laid out the context in which the next generation of particle

accelerators will operate and the physics interests that they hold. We have asserted

that these particle accelerators are expected to produce copious well-collimated and

well-defined jets on which important physics studies will be done. We have further

asserted that this situation poses a multivariate problem, that could be gainfully

addressed in the data mining realm. We have tried to motivate the use of data

mining tools applying the demographic clustering method in the search for the Z

boson and explained the promising result.

Chapter 2

Knowledge Discovery and Physics

2.1 Introduction

Our approach to high-energy physics data is that of knowledge discovery. Knowledge

discovery, as defined more precisely below, is an approach to data analysis that has

developed at the confluence of many different fields, namely artificial intelligence,

statistics, database management and others. This approach has been successfully

applied in business, economics, and in a more modest way in the sciences. In this

thesis, we adopt the approach and the tools of knowledge discovery to analyze linear

collider data. This chapter elaborates on this approach and its application to our

problem, and compares and contrasts the approach to the traditional high-energy

physics approach.

2.2 A short history of Data Mining

Data Mining, and knowledge discovery as an approach has been possible because

of the confluence of largely three independent traditions: large digital databases,

classical statistics and artificial intelligence via machine learning. In this section we

give a brief description of these fields.

2.2.1 Digital databases

According to Moore’s Law [63, 64] , the density of integrated circuits in electronic

chips increases exponentially over time. This prediction proved correct, as the density

doubled every 12 months in the 1970’s and every 18 months since then until today.

Though it is likely that present approach to miniaturization will meet a physical

barrier, this does not guarantee the end of the law. Moore’s Law has been observed

over many different substrates, from electro-mechanical devices to vacuum tubes to

transistors and finally to integrated circuits [49].

Moore’s Law has been extended to the data storage devices. Just as the micro-

processor is being miniaturized, digital storage devices too have seen an exponential

increase in capacity. Furthermore, this has been accompanied by an even faster in-

crease in the storage capacity per unit price. As a result, the availability of digital

data has seen an exponential growth.

The nature of this accumulating data is as varied as the sources from which they

come. The data could be textual, pictorial, categorical or numerical, and frequently,

these data are noisy and contaminated. To eke out meaningful information from this

varied nature of digital databases the traditional approach to data, classical statistics,

falls short.

2.2.2 Classical Statistics

The traditional approach to understanding data has been classical statistics [30].

Classical statistics is closely associated with the notion of an experiment in which

each event produces an outcome according to an underlying probability distribution.

Each outcome is considered independent and the total set of outcomes is considered as

the sample space. This classical statistics is very closely defined, and most scientific

experiments are designed on the principles of classical statistics [29].

In a typical scientific situation, an experiment is designed with a definite hy-

pothesis that has to be tested, or a parameter that has to be measured. In this

hypothetico-deductive paradigm, the exercise begins with a lack of data. An ade-

quate amount of data is then collected either to prove or disprove a hypothesis or to

provide sufficient data to measure a quantity. This is the standard approach of most

laboratory based scientific endeavors where controlled experiments are conducted.

In an alternative paradigm, the data are already available not from a measurement

but in the form of a record of an event. This data could be a business transaction,

the image recorded during an astronomical observation of the sky or the record of

an event in a particle detector. These records do not fall neatly in the design of

single-phenomenon experiments that are so well described by classical statistics [39].

A typical dataset can be visualized as an n×p matrix, with n rows or records, and

p columns or simply variables. The output for each row could be a further r variables,

called the response. The large datasets that are available today are typically large in

both n and p. This poses special problems in classical statistics. For very large n,

methods that scale higher than O(n) are better replaced by adaptive or sequential

methods [39]. For large p, the explanatory variable space becomes sparse, rendering

structures impossible to identify, a problem called the “curse of dimensionality” [12]

or COD.

Further, the data that are available in digital form are often categorical and not

numerical. Digital data also include images, audio text and geographical data that are

not easily handled by classical statistics [39]. Furthermore, the data are contaminated

or noisy. These data characteristics pose special challenges to classical statistics.

2.2.3 Machine Learning

An alternative to the hypothetico-deductive paradigm of classical statistics is the

inductive approach that includes the techniques that have been developed under the

classification of machine learning [62]. Machine learning techniques owe their origins

to initial research on artificial intelligence. These methods are typically based on

a learning mechanism. That is, they are iterative techniques that improve in their

predictive abilities as the number of iterations increases.

Some of the well-known techniques in use under machine learning include decision

trees, neural networks, genetic algorithms, etc. These techniques are usually based

on a variety of learning procedures. Some of the different learning procedures are

supervised learning, unsupervised learning and reinforced learning. In this thesis, the

kind of learning procedure that is employed is discussed in the context.

Among the many features associated with these methods, machine learning meth-

ods are robust in the presence of noise. Methods based on classical statistics perform

poorly in the presence of noise, whereas neural networks actually improve in the

presence of noise [4].

2.2.4 Meeting together

The three major trends: the growth of large digital recorded datasets, the inadequacy

of classical statistics and the growth of machine learning based algorithms, have re-

sulted in the development of the the core data mining techniques and approach. These

techniques consist of such methods as neural networks, clustering, etc, resulting in a

corpus of tools that are able to tackle modern and large datasets. Machine learning,

which began in the wombs of Artificial Intelligence, is now being recast in the lan-

guage of statistics, and statistics is now being extended to include heuristic processes

(Statistical Learning Theory [85]). This has further resulted in the development of

such methods as Support Vector Machines [23] which has formally brought neural

networks into the realm of statistics [70].

Together, these techniques are now grouped under the heading of data mining,

in recognition of the original intention of mining already available data. These tech-

niques have given rise to a unique approach to data, which is now called Knowledge

Discovery in Databases, or KDD for short.

2.3 Knowledge Discovery

There are various definitions suggested for knowledge discovery. A general one is [31]:

Knowledge discovery is the nontrivial extraction of implicit, previously

unknown and potentially useful information from data. Specifically, given

a set of facts (data) F , a language L, and some certainty C, we define a

pattern as a statement S in L that describes relationships among a subset

Fs of F with certainty c, such that S is simpler than the enumeration of

all facts in Fs.

Below, we illustrate this definition with the use of the binary tree algorithm in

the search for Z bosons.

2.3.1 An example - Binary Tree

Consider an example of the Z boson reconstruction in the linear collider context. In

e+e− → ZZ simulated events with√

s = 500 GeV that have two hard muons or large

(a) Traditional Cuts (b) Binary Tree Cuts

Figure 2.1: The graph on the left is one obtained from cuts. The one on the right isfrom cuts obtained from the binary tree method.

missing energy (two neutrinos), all the energy deposited in the electromagnetic and

hadronic calorimeters is due to the decay of a single Z boson. Using the calorimeter

depositions, jets are constructed using the JADE algorithm, which yields in general

not two but a variable number of jets. Using the two most energetic jets to recon-

struct the Z boson will not yield a clean Z peak. Using a traditional fiducial cut of

cos θjj < 0.8 and opening angle cut of θjj < 0.5, we obtain a much cleaner sample of

Z (Figure 2.1(a)).

Alternatively, we may use a binary tree to partition the data to give us automatic

cuts. We use SLIQ/SPRINT [58, 79] to partition the data to obtain a cleaner sample

(see Appendix D). The data consists of the following discriminant variables: cos θj1,2 ,

θjj and δjj with the invariant jet-jet mass, mjj as the classification label. The algo-

rithm partitions the data at each node of the binary tree using the gini coefficient

(see Appendix D) to choose the partition variable and the split point. On examining

the different leaves that the binary tree yields, the Z boson can be easily picked up

(Figure 2.1(b)).

This is an example of the knowledge discovery method. We began with a given

dataset, (F ), from which we obtained a subset of the data Fs, described by the

statement 0.75 < θjj ≤ 1.04 (S).

mZ σCUTS I cos θ1, cos θ2 < 0.8; θ12 > 0.5 90.83± 0.71 18.13± 0.68CUTS II 0.74 < θ12 ≤ 1.04 90.27± 0.64 13.78± 0.52

Table 2.1: Alternative cuts for Z reconstruction. CUTS II are obtained from a decisiontree modeled on the jet pair invariant mass mjj.

Note that the statement S in this case is a high-level and physically significant

statement, that is very similar in language to the one that an expert (physicist)

uses while applying cuts to data. The statement S obtained from most data mining

techniques are not stated as clearly. For example, training neural networks results in

fixing a model to the training dataset, and the statement that is analogous to the one

from binary tree in the example above is an implicit complex decision boundary.

2.3.2 Description of KDD

The knowledge discovery process thus begins with data. In the example above, though

we have used some knowledge of physics (domain knowledge), we did not make any

assumption on the relationship between the different discriminant variables.

Typically, the knowledge discovery process begins at the data exploration stage

in which data is prepared. This involves using domain knowledge to identify rele-

vant discriminatory variables, like cos θj, θjj, mjj, etc, or it could involve statistical

methods that would reduce dimensionality like principal component analysis, or other

feature extraction methods.

The next stage involves using an algorithm that would perform the model building.

The algorithms used at this stage could be a binary tree, as above, or neural networks,

or a clustering method, which have been developed as machine learning techniques.

The model is validated using a separate dataset. Since the performance of the model

is dependent on the dataset, it is possible that the first stage is revisited and the

second stage is repeated with a new dataset.

The third stage is the application of the model on new data.

The three stages along with the use of the iterative, and non-linear algorithms

constitute the knowledge discovery approach to data.

2.4 Knowledge discovery and scientific discovery

Though the basic process of scientific discovery is similar to knowledge discovery,

they are very different in the contents. Chiefly, in scientific discovery, the data are

taken under very controlled conditions to eliminate unwanted parameters and isolate

desired ones. The data are, therefore, neither general nor copious. In the analysis

of the data, the domain knowledge and the methods used for the analysis are rarely

automated, and frequently not based on computers.

Though the aim and methods of scientific and knowledge discovery differ in gen-

eral, certain areas of science have successfully used the knowledge discovery process.

Sky Image Cataloging and Analysis Tool (SKICAT) [28] is an automated system

for cataloging the objects in the sky. It involves a first stage of feature extraction

from digital images, and then the use of a binary classifier to catalog the objects as

either stars or galaxies. In addition to the broad classification, clustering was able

to provide two sub-classes of galaxies (with and without cores). Uses of knowledge

discovery have been extended to geology and biogenetics [27].

2.5 Knowledge discovery and high-energy physics

The data produced by the high-energy physics colliders are indeed copious and gen-

eral. This data sets are large, and have a high dimensionality, the two chief criteria

that invite knowledge discovery methods. The large amount of data requires an effi-

cient data management system. The large dimensionality of high-energy physics data

makes it appropriate for the use of knowledge discovery techniques.

With newer particle colliders, the combined database of real and simulated data

is in hundreds of terabytes (1 terabyte = 240 bytes), and possibly in petabytes

(1 petabyte = 250 bytes). The DØ experiment is accumulating 100 terabytes of

data every year, and the forthcoming Large Hadron Collider is expected to collect

petabytes of data every year during its operational time [13]. Though collider events

are controlled in certain ways and devised to isolate relevant parameters, the data

obtained are general enough to make it possible to use it for different kinds of analy-

sis. A typical analysis would consist of a preselection of events and calculating some

quantities based on these preselection. This is precisely the exercise that is possible

in the knowledge discovery process.

Some of the well-known tools used in knowledge discovery are already in use in

high-energy physics. Hardware based neural networks are often used in trigger circuits

at the data collection stage. In the data analysis stage, neural networks have been

used for jet identification [55] and jet feature extraction [53].

In this thesis, we will examine the use of these techniques on the problem of

boson classification using kinematic variables. In particular, we will examine the use

of the neural network (Chapter 3), and means to improve its performance. We will

also examine different multivariate methods to improve the discriminant ability of

the variables using principal component analysis and multidimensional scaling along

with clustering methods (Chapter 7).

2.6 Conclusion

In this chapter we have introduced the concept of knowledge discovery. The origin of

this approach was identified as the confluence of three disparate disciplines: database

management, statistics and machine learning, which in turn owes its origin to artificial

intelligence. The salient features of this approach were described and discussed, and

the relevance to high-energy physics data emphasized.

Chapter 3

Neural Networks

3.1 Introduction

In high-energy physics, neural networks have been used for on-line as well as off-line

applications. The on-line applications have been in hardware implementations in trig-

gers in the CDF, H1 and LHC experiments. For off-line applications, neural networks

have been very widely used for many different uses. In both types of applications,

neural networks have been almost exclusively used for classification purposes.

In this chapter, we discuss the design of a neural network used for the purpose of

identifying the correct combination of jet-pairs from kinematic variables in e+e− →

ZZ events in which both the Z’s decay hadronically.

The neural network package written for this work is called CJNN, which has

been developed in both C++ and Java languages. CJNN is a general-purpose neural

network package easily configurable and reusable. The Java version has been imple-

mented for use in the Java Analysis Studio (JAS) environment and designed so that

trained neural networks could be included as part of a bigger analysis package. The

CJNN graphical interface for training is described in Appendix B. This chapter is in

part a description of this neural network package.

3.2 The problem – Separation of Z from back-

ground

In e+e− → ZZ events we face a combinatorial problem as the decay products of

the two Z bosons can easily be interchanged. This problem is particularly acute

when both the Z bosons decay hadronically, which happens more than half the time.

Hadronic decays result in jets, and because of gluon radiation, jet misidentification

and contamination, the number of jets that are observed in such events could be

something other than four. In the case of exactly four decay products, reconstructing

the second Z becomes trivial, once the first is reconstructed. Reconstructing the first

Z then becomes a combinatorial problem of one in three. In the case of hadronic

decays into an arbitrarily large number of jets, the combinatorial problem becomes

worse. Moreover, the successful reconstruction of one Z does not trivially lead to the

reconstruction of the second Z.

The approach we adopt here is that the four highest-energy jets carry the max-

imum amount of information about the decay products, and so we look for the Z

bosons in these four jets. Thus, we get six combinations of jets, out of which at most

two are the right combinations. We would try to reconstruct the Z from the kinematic

information of each pair of jets, and expect to find at most two right combinations

per event.

Each jet, constructed from the calorimeter energy and position of cluster four-

vectors with masses assumed to be zero, has three independent kinematic quantities.

Therefore each jet pair should have six independent kinematic quantities. In our

problem, therefore, we will have at most six independent kinematic quantities.

In this chapter, we design and optimize neural networks to identify the correct jet

pairs. For this purpose we use a sample of e+e− → ZZ events in which both the Z

bosons decay hadronically (produce jets). The training data are pre-classified using a

jet-quark association rule described later in Section 5.2.3. The neural network takes

six variables as input: the two jet energies, Ej; the two cosines of polar angles, cos θj;

the jet-jet opening angle θjj and the jet-jet invariant mass, mjj.

3.3 Rationale for a neural network

An important characteristic of neural networks is the ability to perform robust non-

linear modeling. In contrast to traditional statistical non-linear modeling, a user of

neural networks need not specify, a priori, the exact nature of non-linear dependence.

Therefore, they are particularly appropriate for modeling unknown dependencies.

Further, as the number of independent variables increases, statistical multivariate

methods become unstable while neural networks do not.

Neural networks further provide a robust method for classification problems. That

is, neural networks do not degrade when noise is introduced. On the contrary, noise

adds to the generalization capabilities of the network, which a user can exploit to get

better results.

Neural networks typically perform better than other classifiers. Therefore, in this

thesis, we use them as a benchmark while we discuss and compare other classification

methods. In the rest of the section, we define a measure of comparison and then

demonstrate the assertion that neural networks do indeed perform better than certain

others.

3.3.1 The F -measure

The two performance criteria generally used for comparing classification algorithms

are the efficiency (ε) and purity (p). They are defined as follows:

ε =TP

TP + FN(3.1)

TP + FP. (3.2)

Here TP, FN and FP are the numbers of true positives, false negatives and false

positives, respectively. Therefore, the efficiency is the percentage of the signal that

the classifier could identify and the purity is the percentage of the classifier in the

identified sample. These two numbers are displayed in a two-dimensional plot.

On their own, neither ε nor p are good measures of performance. Therefore, we use

the harmonic mean of the efficiency and purity to reduce the two-number comparison

to a one-number comparison. The F -measure [52], is defined as:

F = 2εp

ε + p. (3.3)

The higher the F -measure, the better is the network performance, with unity rep-

resenting a perfect classifier for when both efficiency and purity are unity. The lowest

F -measure possible is zero, when either efficiency or purity is zero. The measure is

bounded between these two values.

3.3.2 Comparison with some other methods

A neural network is compared with two other classifiers in Figure 3.1. The two

classifiers are part of the Intelligent Miner (IM) suite of data mining tools. They

are (1) a decision tree classification (IM-DT) [79, 58] and (2) a feedforward neural

0.4 0.5 0.6 0.7 0.8Purity

Efficiency−PurityComparison

Ensemble NetworkEnhanced signalDecision TreeIntelligent MinerCut on theta

Figure 3.1: Performance of different classifiers. For a quantitative comparison, seeTable 3.1.

network (IM-NN), with learning via backpropagation. They are trained and tested

on the same datasets. We also compare the performance with the cuts obtained from

Section 2.3.1 (θjj-cuts).

F -measureEnsemble Network (NN) 0.5773Enhanced Signal (NN) 0.5719Decision Tree (Intelligent Miner) 0.538Neural Network (Intelligent Miner) 0.447θjj-cut 0.4309

Table 3.1: F -measure (Equation 3.3) for different classifiers.

3.4 A discussion on neural networks

The neurological brain inspired the neural network [44]. An early challenge was to

develop a mathematical neuron that would mimic the biological one.

Having been inspired by the brain, the newly emerging neural network began

moving away from its strictly neurological groundings. With the first simulation

of the neural network on a digital computer [71], and their universal computational

abilities [65], a branch of the neural network was on the verge of moving away from the

moorings of neurobiology and branching off into a role of problem solving. With the

first simulation of the neural network on a digital computer, and the belief that it had

universal computational abilities, researchers, usually physicists and engineers, began

using them for solutions in the non-biological fields. Theoretically, neural networks

are capable of universal computations [80].

3.4.1 The neuron

One of the early successes was the binary McCulloch-Pitts neuron [57], sometimes

called the MP neuron. This neuron calculates a weighted sum of its input and has two

states: active and inactive. The neuron is active when the sum of the inputs times

their weights crosses a threshold value. This is formally performed by an activation

function, the Heaviside function, with the threshold named the bias of the neuron.

This mimics the activity of a real neuron, with the active state representing the neural

firing, and the inactive state representing the quiescent state.

This design of a neuron, that of sum of inputs and the activation function was

a major breakthrough. Though the neuron itself is simple, in a network they are

capable of complex behavior [40].

The MP neuron is a binary unit on account of the Heaviside function, and the

non-linearity of the network is attributed to the use of this function. The MP neuron

can be generalized with the use of a non-linear, continuous function instead of the dis-

continuous Heaviside. In general sigmoid (S-shaped) functions are used. The popular

functions are the logistic function and the hyperbolic-tangent functions. Generally a

sigmoid function is used which is a monotonically increasing function.

The neuron in a network performs the following basic computational function:

g(a) =1

1 + exp (−∑W · x)

, (3.4)

where W is the vector of weights, including the threshold value, and x is the vector

of inputs to the neuron, including the constant-valued (-1) input for the threshold.

g(a) is the output of the neuron. Equation 3.4 is the definition of a logistic activation

function.

An important consequence of a network based on the logistic activation function

(Equation 3.4) and trained by the backpropagation of a squared error function is that

the output can be interpreted as the a posteriori Bayesian class probability [88, 75,

86, 69]. This makes it possible to construct combinations of networks and attribute

errors to neural network classifications. Figure 3.2 demonstrates this probability

interpretation.

3.4.2 The network

There are many kinds of neural networks, each suited for solving different problems.

The broad classes of neural networks are: multilayered perceptrons, recurrent net-

works and self-organizing maps. Multilayered perceptrons (MLP) consist of neural

nodes [73] that are arranged in more than one layer [61], with connections restricted

to the adjoining layers. In these kinds of networks, the input is provided to one of the

end layers, and the output is obtained at the other end. These networks are suited

for function fitting, optimization and classification problems. The recurrent network

is similar to the MLP, but it is characterized by connections between the outputs of

subsequent layers to neural nodes in the upstream layers [25]. This kind of network

is suited for problems with sequential or time-series data. The last broad category

0 0.2 0.4 0.6 0.8 1output unit 1 (Z)

NN output

Figure 3.2: The distribution of the two output neural units trained on output datathat is either (1,0) or (0,1). The straight line is not a fitted line, but the outputdistribution of the two neural units, showing the sum of the outputs is unity, as isrequired for a probabilistic interpretation. There are a few outliers present, though,which shows that some regions of the variable space were not adequately representedand these regions were under-trained.

of neural network is the self-organizing map, or Kohonen map [45, 46]. This kind of

neural network is suited for feature extraction or dimension reduction, and follows an

unsupervised learning algorithm.

In this work, we wish to use neural networks to classify pairs of jets as the decay

product of either a heavy boson or a wrong-combination pair. We will be using the

jet kinematic variables to provide the training data. Thus, the network of choice for

our problem is the MLP.

This MLP consists of layers of neural units that are fully connected with the neural

units of adjacent layers. CJNN, the implementation of the MLP, allows us to adjust

the number of layers and neural units and optimize the network.

3.5 Neural Network Training

3.5.1 Data representation

The input data are continuous in all the variables and there are no missing values.

Nevertheless, variables have different variances. For instance, values for mjj, the

jet-jet invariant mass, range from 0 to a few hundred GeV, whereas values for cos θj

range from -1 to 1. Though the raw data could be used for network training, the high

input values are likely to introduce pathologies in the values of weights. Therefore,

we normalize the input data as follows:

xni →

xni − xi

, (3.5)

where xi and σi are the mean and standard deviation of the distribution of variable

xi on the training data set. In this form, all variables in the training data have unit

variance and zero mean. Since the transformation is linear, we have not introduced

any artifact into the data.

The output class label data are encoded using the 1-of-c encoding scheme. For

example, in the combinatorial problem (Section 3.2) we have two classes: (1) jet pairs

are Zs and (2) jet pairs are wrong combinations. Therefore, we require that there are

two output neural nodes. The Z jet pair is encoded (1,0) and the wrong combination

jet pair is encoded (0,1).

3.5.2 Learning Algorithm

CJNN uses the classical backpropagation method for training neural networks [76].

Backpropagation methods, coupled with earlystopping (see Section 3.5.3) perform

better for bigger networks and do not overfit, as opposed to faster learning algorithms

like conjugate gradient [20].

To prevent the training from languishing in a local minimum, simulated annealing

is implemented. A random noise added to the weights at every update shakes up the

network, and throws it out of any local minimum it might be trapped in [72]. Thus,

the change in the weight is given by

∆w = ∆w0 + σ (3.6)

∆w0 = −η∇E (3.7)

1 + exp α(Nc − β)×∆w0 × ran (3.8)

Above, α and β are numbers, η the learning rate, Nc the learning cycle (or epoch num-

ber) and ran a random number with a Gaussian distribution. E is the generalization

error.

3.5.3 Bias-variance dilemma and earlystopping

A central issue of effective neural network training is the bias-variance trade-off [35].

Bias is defined as the average difference between the neural network result and the

function the network is trying to model. The variance on the other hand is defined

as the sensitivity of the result to the dataset in use [16]. These two terms occur

additively in the error term but behave differently. The goal of network training is to

reduce both the terms simultaneously.

Theoretically it is possible to progressively train a sequence of networks with

increasing size and training dataset that will reduce both the bias and variance to-

gether [88]. But computationally, this is an expensive procedure, on account of large

networks and large datasets taking a very long time to train. From this point of view,

the goal of neural training is to provide a network that is optimized on network size

and data size. Therefore, the goal of network training in this chapter is not only to

decrease the generalization error but also to provide a choice for an optimized network

size as well as training data size.

Validation dataset

The concept of the bias-variance dilemma can be used to obtain optimized neural

network training. The calculation of the error on the training dataset indicates only

the bias of the network and not the variance. This is rectified with the use of a

validation dataset—a dataset that is independent of the training data sample.

The validation dataset performs the function of a general dataset. As the neural

network is being trained on the training dataset, the error is calculated alongside on

the validation dataset. A network that is closely fitted on the training dataset is

likely to pick up the idiosyncratic nature of the dataset thus giving rise to overfitting,

a condition with low bias but high variance. The validation dataset is used monitor

the training and an increase in the variance will result in an increase in the error on

the validation set. The network, for which the validation error is the lowest, has the

largest generalization capability and the best bias-variance trade-off. In practice, the

network is trained beyond the minimum point to make sure that that indeed is the

minimum. The use of this saved network state is called earlystopping [78].

For optimal performance of earlystopping, it is important to make sure that the

variance (overfitting), does not increase in the early phase of the training, and the

decrease in the error is solely due to a decrease in the bias. Since overfitting is due to

fitting the training dataset too closely, the network is characterized by large values of

assorted weights. Therefore, the networks are initialized by a random set of weights

that are very close to zero. In this work, we use random numbers between -0.01 and

0.01 to initialize weights and thresholds of the neurons.

3.6 Neural Network Architecture

The first task in setting up a neural network is deciding on an architecture—fixing the

number of layers and the number of neural units in each layer. There are some guide-

lines available [51], but in most applications, the architecture is invariably problem-

specific and empirical. The chief trade-off in fixing an architecture is between accuracy

and overfitting. A neural network with an insufficient number of neural units or in-

sufficient number of layers will be unable to learn a problem with sufficient accuracy.

But on the other hand, a neural network with too many neural units and layers would

easily overfit a problem, leading to inaccuracies again, in addition to taking longer to

train.

3.6.1 The number of layers

It has been shown theoretically that a single hidden layer with a sufficient number

of neural units is a universal approximator [42, 33]. It has been heuristically shown

that for data with continuous numeric data, a neural network with a minimum of two

hidden layers of sigmoids is sufficient to model any <n → <m mapping [50], provided

that each layer has a sufficient number of neural units. Though MLPs with a single

hidden layer are capable of universal approximations, additional hidden layers (and

the simultaneous removal of neural units) add to the computational efficiency of the

network by decreasing the number of weights in the network [54]. Thus a network

with two hidden layers is sufficient for training on data for which the nature of the

decision boundary is not known.

Even when a second hidden layer is not warranted by the decision boundary,

adding an extra layer, and decreasing the number of units in the first hidden layer

might lead to a decrease in the net number of weights, which would result in a faster

network [21].

3.6.2 The number of units in each layer

Though there exist theoretical results and heuristic proofs on the required number

of hidden layers, no such result exists for the number of hidden units [16]. Getting

an optimal number of neural units is important because it has a bearing on the

bias-variance trade-off [35], discussed above. Computationally, a network with an

insufficient number of neural units will be unable to model the problem adequately,

and a larger network will be computationally slower.

Therefore, we require a method to fix an optimal size of the network.

Pruning networks

There are two approaches one may take in deciding the optimum number of units in

each layer. The first is to begin with a large network and prune the number of units.

The other is to start with a smaller network and progressively add more units.

One way of pruning a big network is to let the weights decay. We let the connection

strengths of neural units decay with each iteration. The decay is done using the

following prescription:

wnewij = (1− γ)wold

ij , (3.9)

γ =ε′

(1 +∑

2. (3.10)

Here, ε′ is the product of the learning rate (η = 0.1) and a parameter ε. The summa-

tion in the denominator is over all weights of the input connections for a particular

neural unit.

Since weights with lower absolute values connect neural units weakly, they play a

diminished role in the output of the network. Therefore the prescription above tends

to decay the neural units faster if the sum of square values of the weights per unit is

low. This lowering of the value of weights plays against the iterative updates resulting

unreinforced weights to decrease faster.

3.6.3 Fixing ε

To fix the value of the parameter ε, we note that too large a value would decrease the

weights faster than they can be reinforced during network training (Figure 3.3(a)).

Therefore, the choice of ε should be large enough so that redundant neural weights

decrease during regular training, but the overall error during the training process

shows similar behavior as for ε = 0. Using this empirical requirement, we find that

ε = 0.00001 is a good value (Figure 3.3(b)).

3.6.4 Root mean square weights

Next the weights of each neural unit are examined to ascertain important ones. We

use the root-mean-squared (RMS) value of all the weights for any particular neural

unit to represent the overall value of the neural unit. Neural units with smaller RMS

weights are considered as less important than those with larger values, since they

have a much smaller effect on the next layer of neurons on average.

The distributions of RMS of weights for the two hidden layers are shown in Fig-

ures 3.4(a) and 3.4(b). In the first layer, some neural units show low RMS of weights.

The set of neural units with low RMS of weights is consistent over training cycles.

0 500 1000 1500 2000Cycles

Train, ε=0.01Valid, ε=0.01Train, ε=0Valid, ε=0

0 500 1000 1500Cycles

Train, ε=0.00001Valid, ε=0.00001Train, ε=0Valid, ε=0

Figure 3.3: (a) With ε = 0.01, the weights decay too fast for the neural network tolearn. This is denoted by the flat error curve in the figure. The difference in the errorvalues on the training and the validation datasets is due to the difference in size. (b)With ε = 0.00001, the error curves exhibit the same behavior as for ε = 0.

This suggests that a good cutoff for the RMS weights is 1.

Using this cutoff, we obtain a new network: 6–21–22–2. This network has 677

parameters (weights and biases). This is approximately half of that in the original 6–

34–24–2 network. The errors in the training and validation sets are given in Figure 3.5.

We observe an initial spike when the network is pruned, but the training follows closely

that of the unpruned network. The errors in the validation set show that the pruning

probably has an effect on the generalization capability of the network, as is expected.

Since computation time goes as W , the number of weights and biases, the reduc-

tion of W by about a half is a significant gain.

0 2 4 6 8 10 12 14rms(W)

RMS weightsFirst hidden layer

(a) First hidden layer

0 2 4 6 8 10 12 14rms(W)

RMS weightsSecond hidden layer

(b) Second hidden layer

Figure 3.4: The RMS weights of neural units in the first and second hidden layers.In (a), the low-valued RMS weights is clearly isolated, and are all less than 1. In (b),the low-valued RMS weights are not isolated. So we follow (a) and delete the neuralunits that have RMS weights less than 1.0.

3.7 What do neural networks do

What does a neural network do? As will be seen, neural networks do not offer rules

that are easily understood. The information it provides is low-level and does not

constitute “knowledge”, as defined in knowledge discovery. The decision boundary

which it constructs after it has been sufficiently trained cannot be constructed in terms

of another object, other than the neural network itself since the decision boundary is

in general complicated.

Here we employ a binary tree on data that has been classified by a neural net-

work. Binary trees offer rectangular decision boundaries in the form of cuts and

limits on variables, which are closest in form to those employed in traditional physics

classifications. Therefore, a binary tree approximation of the neural network decision

boundary could be instructive.

0 500 1000 1500 2000Cycles

Deleting Neural Units(I: 6−34−24−2); (II: 6−21−22−2)

train Ivalid Itrain IIvalid II

(a)√

W 2 < 1.0 yields 6-21-22-2

0 500 1000 1500Cycles

Weight Decayrms(Wij) < 2.3

train2.3valid2.3train0valid0

(b)√

W 2 < 2.3 yields 6-21-2

Figure 3.5: With a√

W 2 cutoff of 1.0 a 6-21-22-2 network is obtained, which afterthe initial spike in the learning curve, continues in the expected fashion. Also notethat the error on the validation dataset is lower compared to the error from thevalidation set on the older network. This indicates that the generalization capabilityhas improved with pruning. By contrast in the more drastic pruning (b), we see thatafter the spike, the error curves do not follow the behavior of the unpruned network,thus exhibiting poor learning. This also reinforces our assertion that we require atleast two hidden layers to solve the problem.

In the combinatorial problem, one of the branches offers cuts (Table 3.2) for the

correct Z boson. It offers a cut that is similar to the one obtained by a binary tree

alone (see Section), in addition to two cuts on the cosine of the polar angles which

are very close to the fiducial cuts in traditional physics, as well as a cut on the jet-jet

invariant mass.

0.6 ≤ φij < 1.38cos θi < 0.76cos θj < 0.78

mij ≥ 81.28 GeV

Table 3.2: The binary tree rule for one of the branches on trained on the neuralnetwork result.

3.8 Conclusion

In this chapter, we have motivated the use of the neural network for solving the

combinatorial problem. We have discussed the different types of neural networks that

are possible and picked the multilayered perceptron as the most appropriate network

for our problem. We solved the first hurdle in network design by fixing the number

of layers from empirical arguments and the number of neural layers by a weight

decay and pruning algorithm. Finally, we gave a brief solution for approximating the

complicated and unknown neural network decision with the help of a binary tree.

Chapter 4

FastCal: A fast Monte Carlo

simulator for the LCD calorimeter

4.1 Introduction

This chapter describes FastCal, a fast linear collider detector (LCD) simulation pack-

age developed to provide data for this thesis. The central feature of this package is the

hadronic shower parameterization to provide a fast implementation of the detector

simulator. The main goal of this package is provide a replacement for a full simulator

(for example, GISMO and GEANT) that will offer data at the cluster level and not

at the level of calorimeter cells, and is suited for statistics-limited studies that do not

depend on finer details of shower development.

Section 4.2 compares simulated and real data, and emphasizes the need for simu-

lated data. Section 4.3 provides the rationale for the development of a fast simulator.

Section 4.4 provides a general description and design philosophy of the of FastCal.

Section 4.5 describes the process that is being simulated. Section 4.6 compares hadron

showers from FastCal with those from GISMO. Section 4.8 concludes this chapter with

a summary of FastCal features and results.

4.2 Simulated data and real data

Simulated data are limited by extant knowledge of physics on which the simulations

are based. In spite of this limitation, they play a very important function in the design

and development accelerator equipment and data analysis algorithms. When real data

are not available, simulated data provide the necessary input for development efforts.

Further, for the development of data analysis routines, simulated data provide internal

information of physics processes that are not available in real data. In this thesis, we

make use of the secondary heavy boson and quark information to create training and

testing data sets. Training and testing data created from real data will be biased by

the methods used to classify the training sample. To circumvent this, simulated data

are indispensable.

4.3 Rationale for FastCal

Monte Carlo simulations of detectors have become integral to every part of experi-

mental high-energy physics, from detector design to data analysis. They have become

the primary tools by which the theories of particle physics are turned into testable,

quantitative predictions. The first hurdle in a detailed simulation is generally the

cost in terms of CPU time, though this is becoming less of an issue with the availabil-

ity of ultra-fast modern processors and multi-processor systems. Two such detailed

simulation packages (generically called full simulation) in use today in the LCD com-

munity are GISMO [19, 7] and GEANT4 [81]. In these full simulation schemes, a high

fraction of the CPU time is expended in the detailed simulation of the calorimeter.

The most time consuming parts of these full simulations are the simulation of

shower development. Energetic particles passing through matter lose energy by pro-

ducing more particles. These particles in turn produce other particles, thus forming

cascades of particles which are called particle showers. There are two major classifi-

cation of showers:

Electromagnetic showers At energies higher than 1 GeV, electrons and positrons

lose energy mainly by the process of Bremsstrahlung producing additional high-

energy photons. Photons lose energy predominantly by the process of electron-

positron pair production. Therefore, incident electrons, positrons and photons

produce a cascade of particles containing mainly these three kinds of particles.

Hadronic showers Hadrons undergo a variety of processes which include hadron

production, nuclear de-excitation, pion and muon decays etc, resulting in a

cascade of hadronic particles that constitute the hadronic showers. In these

processes, a significant fraction of the secondary particles produced are π0’s,

which do not take part in hadronic interactions. Since the predominant π0

decay mode is to a photon pair, hadronic showers have electromagnetic shower

components.

Full simulation methods simulate all the processes that contribute to the elec-

tromagnetic and hadronic showers, all the way down to individual particles. That

is, new particles are created, their paths are calculated and their interactions with

the calorimeter components simulated. A significant fraction of the computation is

attributable to the simulation of particle showers.

In general, particles moving through calorimeters deposit energy in regions which

are spread over many calorimeter cells. Particles are reconstructed from the energy

deposits in these cells by clustering them. For studies that do not require individual

particle identification, clusters could be used directly to form jets.

In applications such as these, which require large statistics, a faster simulation

method is needed. FastCal provides us with such a simulation package. It replaces the

most compute-intensive part of the calorimeter simulation by parameterized functions

to be evaluated quickly. As a result, an enormous statistical improvement is obtained.

The LCD collaboration has devised such a software package, a fast Monte Carlo

(FastMC), to supplement its full Monte Carlo system. It is used for roughly establish-

ing the physics reach of the several LCD prototype designs for a wide array of physics

channels. One of the major drawbacks of FastMC is that it lacks a facility for simu-

lating calorimeter responses. Another fast Monte Carlo package, also named FastMC

and part of the LCD suite of tools for the ROOT environment (LCDROOT) [43]

provides smearing of particle energy and merging of clusters.

4.3.1 Description of existing FastMC

The current FastMC package can be executed either from the JAS or the LCDROOT

environments. In the JAS environment, FastMC uses generator-level particle lists,

along with a detector geometry file, to produce information that can be handled by

the JAS analysis package.

This information must be fully reconstructed before being passed to the analysis

stage, meaning that FastMC is responsible for track, cluster and vertex reconstruction.

Generator-level particle lists are provided to FastMC in the form of StdHep [34]

files. They can be produced by any number of standard generator packages. For

the purposes of FastCal development, we used files generated by the Pandora-Pythia

package, which is based upon Pythia. Because these generator files are also used

for the GISMO-based LCD full simulation, this enabled us to do direct comparisons

between the full simulation and FastCal.

4.4 Description of FastCal

FastCal is a fast Monte Carlo simulation of the electromagnetic (ECAL) and hadronic

(HCAL) calorimeter responses to linear collider events in the√

s = 500 GeV range. It

aims to simulate the physical response of calorimeters in as short a time as possible by

forgoing as much detail as necessary, while still yielding sufficiently accurate results

so that kinematic properties of jets from FastCal would match closely with those

from GISMO. This it does by offering energy depositions at the level of clusters, with

each cluster forming at most one cluster each in the ECAL and HCAL, as opposed

to GISMO, which yields information at the calorimeter cell level.

The simulation is written in Java, for use in the JAS environment. For input, it

can accept data from event generators (e.g. Pandora-Pythia) [82, 67], in the StdHep

format [34].

4.4.1 Design philosophy and approach

The design goal for FastCal was that the running time should be comparable to that

of the existing LCD FastMC. In order to achieve this, we had to adhere to certain

policies.

The first policy was to avoid “swimming” particles through the detector (i.e.,

moving them along their trajectories in a series of fixed steps). At each step, rather,

we calculate the next “significant event” in each particle’s journey (e.g., initiating

a shower or crossing a detector boundary) and analytically solve for the particle’s

position and energy at that point. This approach speeds up the process by eliminating

the loop over many small step sizes.

The second policy was to avoid iterative numerical integration whenever possible.

The third policy was not to create new particles. All decays, like the V decays,

are handled by the generator (Pandora-Pythia).

The fourth policy is to parameterize physical processes.

The fifth policy is not to simulate the calorimeter in its granularity (different layers

of materials and segments), but to use average properties of calorimeters.

Not only does this latter policy save compute time directly, but it also allows

us to avoid the necessity of performing cluster reconstruction and recognition on

detector hits. Each shower directly becomes a cluster that can be used in the JAS

analysis. Though this introduces the serious error of removing the artifacts of cluster

recognition and reconstruction, this effect is acceptable in this since the use is made

of jets constructed from clusters, and the individual clusters are not important in

these examples.

4.4.2 Geometry of Calorimeter

There are a number of detector designs that are under development for the LC ex-

periments [1]. These detector designs are described by a generic format using the

eXtensible Markup Language (XML). This enables a unified and rapid simulation

and testing of these detectors designs. In keeping with this philosophy, FastCal

calorimeter geometrical information is not hardcoded, but is read from the relevant

detector files.

Three reference designs are currently used for simulation studies in the Next Lin-

ear Collider (NLC) context: a large tracking volume detector to optimize tracking

precision for the high-energy interaction region with a magnetic field strength of

3 Tesla (LD); a silicon tracking detector to optimize Particle Flow calorimetry in the

high-energy interaction region with a field strength of 5 Tesla (SD); and a precision

detector for Giga-Z in a low-energy interaction region (P) [3].

Both the LD and the SD detectors have similar geometries. At the inner core

reside the vertex detector and tracking chambers. The calorimeters consist of an

inner ECAL, and an outer HCAL. The magnetic coil lies between the ECAL and the

HCAL in the SD design, whereas it lies outside the HCAL in the LD design.

ECAL HCALBarrel Inner Radius 196.0 cm 233.4 cmBarrel Outer Radius 220.0 cm 365.4 cmEndCap Z Inner 297.5 cm 334.0 cmEndCap Z Outer 321.5 cm 466.0 cmEndCap Inner Radius 29.0 cm 31.0 cmNumber of layers 10 3Active material Polystyrene PolystyreneInactive material Lead LeadField Strength 3.0 T 3.0 T

Table 4.1: Detector Parameters ldmar01, a design specification for the LD design,used in this study.

The calorimeters in both the detector designs consist of a cylindrical barrel and

annular end cap calorimeters. The calorimeters are sampling calorimeters that consist

of stacked layers of active and inactive materials.

4.5 FastCal simulation

For FastCal we adopt the following picture for simulation. Incident e+, e− and γ

trigger electromagnetic showers in the ECAL very close to the inner surface. As such,

ECAL completely contains these showers. For our scheme in which single particles

produce single clusters, there is no need to simulate the electromagnetic shower ex-

plicitly. Instead, all the energy of the particle is fluctuated according to the energy

resolution of the ECAL and deposited in the calorimeter. Since electromagnetic show-

ers are initiated close to the inner surface of the ECAL, the energy is deposited at

the intersection of the particle trajectory and the inner surface itself.

For other leptons, the neutrinos escape detection in the calorimeter. Muons escape

with a minimum ionization deposit. The τ -leptons decay into other particles before

entering the calorimeter. The incident hadrons trigger hadronic showers, not all of

which are contained in the ECAL. The hadronic shower parameterization, described

below, is used to calculate the energy deposited in the ECAL. The rest of the energy

is deposited in the HCAL. We assume that ECAL and HCAL contain the entire

hadronic shower.

4.5.1 Particle Propagation

Particles in the Monte Carlo tables (obtained from Pandora-Pythia event generation)

that are flagged “final state” are made to propagate from their initial positions to the

calorimeter surfaces. Charged particles follow a helical trajectory in the magnetic field

and neutral particles follow straight trajectories. The trajectories depend upon the

initial momenta and positions of the particles. We assume that the particles passing

through the tracking chambers do not undergo any process that results in any loss of

energy, or creation of new particles. The significant events for each particle trajectory

are the points of intersection with the different calorimeter detector surfaces: the inner

and outer walls of the barrel calorimeter for both ECAL and HCAL; the inner and

outer walls of the two annular detectors at the positive and negative ends for both

ECAL and HCAL and the beam pipe circular holes.

Neutral particles

For neutral particles, the z when it hits the barrel is the solution of a quadratic

equation given by:

z =(z0 −

√√√√(B

A− z0

−[z0

(z0 −

x20 + y2

0 − r2

](4.1)

where r is the radius of the Barrel Calorimeter surface, hz is the sign of pz and

x + p2y

B =pxx0 + pyy0

Figures 4.1 and 4.2 display the intersection of neural particles on one of the calorime-

ter surfaces and demonstrate that the analytic solution of the intersection, given in

Equation 4.1, is correct.

Charged particles

For charged particles, the x and y positions along the helical trajectory are given as

a function of z as follows:

x = x0 + RH

(Φ0 + hq

z − z0

RH tan λ

)− cos Φ0

](4.4)

y = y0 + RH

(Φ0 + hq

z − z0

RH tan λ

)− sin Φ0

](4.5)

tan λ =pz

p⊥(4.6)

-300 -200 -100 0 100 200 300x (cm)

HAD Barrel CalorimeterNeutral Hadrons

Figure 4.1: Plot for Hadronic Barrel Calorimeter for neutral hadrons. Each circlerepresents an intersection point of a neutral particle with the inner surface of a barrelcalorimeter.

-400 -300 -200 -100 0 100 200 300 400z (cm)

HAD Barrel and EndCap CalorimeterNeutral Hadrons

Figure 4.2: Plot for Hadronic Barrel Calorimeter for neutral hadrons.

-300 -200 -100 0 100 200 300x (cm)

HAD Barrel CalorimeterCharged Hadrons

Figure 4.3: Plot for Hadronic Barrel Calorimeter for charged hadrons.

tan Φ0 =

0 + y20

RH =p cos λ

|κqB|(4.8)

Here κ is a constant, q is the charge of the particle and B is the field in the positive

z direction. hq is the sign of the charge.

Figures 4.3 and 4.4 display the charged particle trajectory intersection with the

inner surface of the HCAL. The big circle consists of small circles, each representing

an intersection of a charged particle with the inner surface of the barrel calorimeter.

This displays that the analytic solution for a helical path of a charged particle given

in Equations 4.5 and 4.5 gives a correct interaction point.

Figures 4.5 display the trajectories of the final state particles of the same exam-

ple event in FastCal and GISMO for comparison. Note that the trajectories match

in orientation, and in propagation distance. This displays that the match between

-400 -200 0 200 400z (cm)

HAD Barrel calorimeterCharged Hadrons

Figure 4.4: Plot for Hadronic Barrel Calorimeter for charged hadrons.

particle paths in FastCal and GISMO is good.

4.5.2 High-energy physics processes

Electrons/positrons and photons

During their passage through matter electrons and positrons lose energy to ionization.

In addition photons are produced by Bremsstrahlung. Prompt photons, as well as

those from Bremsstrahlung, convert by pair production predominantly to electrons

and positrons. Therefore, electrons, positrons and photons together produce a cascade

of particles, referred to as an electromagnetic shower [59]. The longitudinal shower

development depends on the density of electrons in the material and, therefore, can

be described in terms of the radiation length (X0) [26]. The thickness of the ECAL is

∼ 25 X0, and as a result most of the electromagnetic (EM) shower is contained in the

ECAL. Therefore, in the first approximation, we assume there is no leakage of the

HAD Barrel Intersectioncharged particles -- Event 95

(a) FastCal (b) GISMO

Figure 4.5: Event with charged final state particle trajectories in FastCal and GISMOfor comparison. Figure (a) is obtained from FastCal and Figure (b) is the same eventsimulated by GISMO and viewed using LCDWired.

EM shower, and deposit the entire energy from electrons, positrons and photons in

the ECAL (Figure. 4.6). Any correction to this will contribute a very small correction

to the jets which are constructed from the cluster-level calorimeter energy response

in both the hadronic and electromagnetic showers.

Hadronic Showers

Hadronic particles passing through matter interact with the nuclei in a series of

inelastic nuclear interactions. These interactions result in more hadronic particles, in

a cascade called a hadronic shower. A significant fraction of the particles produced

is π0 particles, which decay mostly into photons. Therefore hadronic showers have a

significant electromagnetic component.

The origins of hadronic shower depths are dependent on a single parameter, the

interaction length. The energy deposited at the ECAL is calculated from the longitu-

dinal profile of the hadronic shower development, parameterized after Bock et al [17].

Energy [GeV]0 10 20 30 40 50 60 70 80 90 100

Total Energy (FastCal)

Energy [GeV]0 10 20 30 40 50 60 70 80 90 100

Total Energy (GISMO)

Energy [GeV]0 10 20 30 40 50 60 70 80 90 100

ECAL Energy (FastCal)

Energy [GeV]0 10 20 30 40 50 60 70 80 90 100

1500ECAL Energy (GISMO)

Energy [GeV]0 10 20 30 40 50 60 70 80 90 100

400HCAL Energy (FastCal)

Energy [GeV]0 10 20 30 40 50 60 70 80 90 100

400HCAL Energy (GISMO)

Figure 4.6: Calorimeter response to single e− events at 50 GeV. The left columnof histograms show the FastCal response, the right column show GISMO response.The top figures show the total energy deposit (EHCAL + EECAL). The middle figuresshow the energy in the ECAL (EECAL), and the bottom figure the HCAL response(EHCAL). Note that there is no energy leakage from the ECAL in FastCal, representedby the flat distribution of HCAL energy in the bottom left figure.

Longitudinal profiles of hadronic showers are simulated in FastCal by the well known

parameterization [17, 8, 89]

0dE = E0 (wP (a, bt) + (1− w)P (c, du)) (4.9)

The parameterization consists of two parts. The first part, which depends on the

radiation length, is due to the electromagnetic decay of the π0 produced, whereas

the second part is due to the purely hadronic part of the shower. The proportion of

the two components is controlled by the parameter w. The normalization constant

is proportional to the energy of the particle at the shower origin. The parameters

a, b, c and d are taken from [17] (Table 4.2), and are obtained by fitting Equation 4.9

to incident π− test beam data obtained from the WA1, 379 FNAL and the UA1

experiments.

t x/X0

u x/λa 0.6165 + 0.3183log E0

b 0.2198c ad 0.9099 - 0.0237log E0

w 0.4634

Table 4.2: Table of parameters

X0 (cm) λ (cm) xi (cm)Pb 0.56 17.09 1.60Air 30420 69600 0.32

Polystyrene 34.40 79.36 0.40Tyvek 47.9 82.5 0.08

Table 4.3: Table of lengths. Tyvek is not included in the calculation. Since itforms such a thin layer in comparison to the others, it is expected to produce minorcorrections.

The total energy deposited in the calorimeter is calculated by performing the

integration in Equation 4.9 from the shower origin to the end of the calorimeter along

the particle trajectory. In Equation 4.9, the P (a, x)s are partial gamma functions [68]

P (a, x) =

∫ x0 ta−1e−tdt∫∞0 ta−1e−tdt

(4.10)

that are evaluated numerically.

In the calculations above, the radiation length and the interaction length of the

calorimeter are obtained by the weighted average of the values for each material

according to the thickness in each layer.

XECAL,HCAL

δi∑i δi

, (4.11)

where Xi is the interaction/radiation length of calorimeter materials.

The energy that is obtained from Equation 4.9 is deposited in the ECAL. The

shower continues to develop in the HCAL. We assume that the entire hadronic shower

is contained in the HCAL. Therefore there is no need to develop the hadronic shower

explicitly. Instead the remaining energy is deposited in the HCAL according to the

HCAL energy resolution, given below.

43%√E⊕ 4% (4.12)

4.6 Single Particle Comparison

For a closer match between FastCal and GISMO, a single particle match is done

using negative π particles at 50 GeV. In particular, fluctuations are introduced in the

hadronic shower model parameters for a realistic energy deposition.

Energy [GeV]0 10 20 30 40 50 60 70 80 90 100

ECAL Energy (GISMO)

Energy [GeV]0 10 20 30 40 50 60 70 80 90 100

HCAL Energy (FastCal)

Energy [GeV]0 10 20 30 40 50 60 70 80 90 100

Figure 4.7: Calorimeter response to single π− particle at 50 GeV without any fluctu-ation. The placement of the histograms is the same as that described in Figure 4.6.

Hadronic shower parametrization fluctuations

The energy deposits in FastCal for π particles without any fluctuations are given in

Figure 4.7.

Shower origin fluctuation: For hadronic particles, first hadronic shower origins

are obtained that have a randomized distribution of the form es/λ calculated from the

ECAL inner surface. Here λ is the nuclear interaction length, the values of which

are obtained from the Particle Data Group tables [38]. The inverse of the interac-

tion length is averaged for ECAL according to Equation 4.11. With the addition of

fluctuations in the shower origin, the energy deposit is given in Figure 4.8.

Energy [GeV]0 10 20 30 40 50 60 70 80 90 100

ECAL Energy (GISMO)

Energy [GeV]0 10 20 30 40 50 60 70 80 90 100

HCAL Energy (GISMO)

Figure 4.8: Calorimeter response to single π− particle at 50 GeV with fluctuations inthe shower origin. The placement of the histograms is the same as that described inFigure 4.6.

Energy deposit fluctuation: The energy deposited is fluctuated according to

the calorimeter energy resolution.

17%√E⊕ 0% (4.13)

The fluctuation is given by a Gaussian distribution with average energy given by the

parameterization, E and standard deviation δE. This spread in the energy completely

accounts for the energy spread in the total energy deposit in both the calorimeters

(Figure 4.9). The values displayed in the Equation 4.13 are obtained from the detector

design files.

Shower length scaling and fluctuation: The shower length fluctuation arises

because of the fluctuation of the shower center of gravity from the shower origin. Here

we adopt the distribution obtained in [66] for lead, and π− beams at 13 and 20 GeV.

Thus in Equation 4.9, t → t/f and u → u/f .

π0/π+/− fluctuation: This is achieved by fluctuating the w in Equation 4.9 above.

The distribution of this is adopted from the one used in the CDF collaboration [32]: a

uniform probability between 0.001 and 0.4. Above this, the probability is given by a

Gaussian with mean 0.4 and standard deviation 0.25 (Figure 4.11). w is not sampled

above 0.99.

4.6.1 Low-energy physics processes

In this section processes are described which account for the ECAL energy distribution

of the 50 GeV π− particles at low values. The minimum energy deposit is simulated

by the ionization due to the charge of the particle and the spread in the energy is

simulated with the delta rays.

Energy [GeV]0 10 20 30 40 50 60 70 80 90 100

ECAL Energy (GISMO)

Energy [GeV]0 10 20 30 40 50 60 70 80 90 100

HCAL Energy (GISMO)

Figure 4.9: Calorimeter response to single π− particle at 50 GeV with fluctuations inthe shower origin. The placement of the histograms is the same as that described inFigure 4.6.

Ionization energy loss

Charged particles passing through the ECAL and HCAL deposit a small amount of

energy on account of ionization. The rate of energy loss for a moderately relativistic

charged particle is given by [15]

dx= Kz2Z

2mec2β2γ2Tmax

I2− β2] (4.14)

Energy [GeV]0 10 20 30 40 50 60 70 80 90 100

ECAL Energy (GISMO)

Energy [GeV]0 10 20 30 40 50 60 70 80 90 100

Figure 4.10: Calorimeter response to single π− particle at 50 GeV with additionalfluctuations in the shower length. The placement of the histograms is the same asthat described in Figure 4.6.

Figure 4.11: The distribution of w.

where the meaning of each variable is that given in the above mentioned reference.

In this form, note that the energy loss does not compensate for the density.

To obtain a close match with GISMO at low energies, FastCal energy deposits

from muons are obtained from Equation 4.14 and fitted to the energies obtained from

GISMO. Figure 4.13 represents this fit, where each data point represents energy loss

from muons with a given initial energy-momentum. The best fit is obtained with a

quadratic fit, given by Equation 4.15.

EMIP = 0.0092 + 0.37E + 9E2 (4.15)

Continuous energy loss of hadrons

The ionization energy loss accounts for the minimum energy deposits, but does not

account for the low energy spread spread in GISMO. The energy deposit high energy

Energy [GeV]0 10 20 30 40 50 60 70 80 90 100

ECAL Energy (GISMO)

Energy [GeV]0 10 20 30 40 50 60 70 80 90 100

Figure 4.12: Calorimeter response to single π− particle at 50 GeV with additionalfluctuations in the w. The placement of the histograms is the same as that describedin Figure 4.6.

dE/dx (FastCal) [GeV/cm]0.005 0.006 0.007 0.008 0.009 0.01 0.011

0.0115

0.0125

0.0135

0.0145

/ ndf 2χ 2.306 / 4p0 0.0001755ℜ± 0.009219 p1 0.03858ℜ± 0.3752 p2 2.112ℜ± 9.78

dE/dx Fit (Muons)

Figure 4.13: The quadratic fit for dE/dx.

tail with a minimum peak in the GISMO data indicates that the charged particles

are losing energy to additional processes. We approximate this spread by simulating

delta rays, (production of knock-on electrons) [15, 74].

Figure 4.14 displays the energy distribution in FastCal and GISMO at low energies

for π− at 50 GeV. The good match in the minimum deposit is due to the energy fit

to muon data, in the last section. The spread in the energy is due to delta rays in

FastCal and the FastCal and GISMO energy have a good agreement in the low energy

deposits.

Both the physical processes, the minimum ionization by charged hadrons, and the

simulation of δ-ray make small energy deposits in the ECAL that do not significantly

alter the overall energy deposition in the calorimeters, or alter the result significantly

at the level of jets, that we compare in the next chapter. But this section provides a

close match between FastCal and GISMO for single particle comparisons.

Energy [GeV]0 0.2 0.4 0.6 0.8 1 1.2 1.4 1.6 1.8 2

ECAL Energy (GISMO)

Figure 4.14: Comparison of low energy depositions in the ECAL for 50 GeV negative πwith minimum ionization and δ-ray simulations. Note that the lowest energy depositmatches well due to the fitting on the minimum energy given by Equation 4.15. Thespread to the right is due to the δ-ray simulation.

4.6.2 A synopsis of the hadronic particle simulations

Figure 4.15 displays flowchart for hadronic particle energy deposit in FastCal, de-

scribed below:

• Particle paths are traced to the front of ECAL.

• From the front of ECAL, a random shower origin according to a exp−s/λ distri-

bution.

• The particle path is traced to the back of the ECAL.

• If the particle is charged and the hadronic shower originates in the ECAL, min-

imum ionization and δ-ray simulations are performed from the front of ECAL

to the shower origin, and the energy is deposited in the ECAL. The deposited

energy is removed from the particle.

• If the particle is charged and the hadronic shower does not originate in the

ECAL, the minimum ionization and δ-ray simulations are performed from the

front of the ECAL to the back of the ECAL and the energy is deposited in the

ECAL. The deposited energy is removed from the particle.

• If the hadronic shower originates in the ECAL, hadronic shower parameteriza-

tion is performed with the remaining energy. The energy obtained is deposited

in the ECAL, and that amount is removed from the particle.

• The remaining particle energy is deposited in the HCAL after fluctuating it

according to the HCAL energy resolution.

4.7 FastCal gain

FastCal has yielded results that show very good agreement with those from GISMO,

for single particle tests with electrons and negative pions. The optimization scheme,

implemented to speed up simulation, yields very good results, as Table 4.4 demon-

strates.

4.8 Conclusion

In this chapter we have motivated and described the fast parameterized Linear Col-

lider Detector calorimeter simulator FastCal. FastCal simulates the calorimeter re-

sponse to final state e, e−, γ, µ and hadrons produced by the event generator Pandora-

shower?

getShowerDeposit

showerInRegion?

DeDxUptoEnd

DeDxUptoShowerOrigin

showerPossible?

initShowergetShowerDepositeDeDx+eShower

outputEnergy

Figure 4.15: The algorithm for FastCal energy deposition in ECAL due to hadronicshowers.

FastCal GISMOEvents 9,900 9,900Hadronic events 4,877 4,877Four Jet Events 4,750 4,809Good Pairs 6,752 6,380Generation (100 events) 8.125 sec (PII) 8.125 sec (PII)Full Simulation (100 events) – 12,276 sec (SLAC)Clustering (100 events) 4.11 sec (PII) 61.27 sec (PII)Total (100 events) 4.11 sec (PII) 12,337.27 sec (PII)

Table 4.4: A comparison between FastCal and GISMO. Four jet events are those inwhich both the Z’s decay hadronically, and the events have at least four jets. Goodpairs are those pairs of the four jets that can be associated with the quarks from thesame Z.

Pythia. Hadronic showers are simulated explicitly. The single electron test events

indicate that most of the electromagnetic showers are contained in the ECAL and for

single cluster simulation, explicit simulation of showers is not required.

For hadronic showers, it was shown using single π− particles that shower origin and

energy deposit fluctuations alone are not sufficient to obtain a good match between

FastCal and GISMO. In addition, fluctuations in the shower peak (shower length)

and the electromagnetic shower component are important. All four fluctuations,

taken together, yield a good match between FastCal and GISMO.

For a GISMO/FastCal match at low ECAL energy deposits, ionization energy and

simulation of delta-rays are important. For a closer fit with GISMO, muons are used

to fit FastCal energy to GISMO energy.

Finally, FastCal implemented in the way described provides a significant gain in

time, with FastCal performing the simulation at about 0.03% the time required to do

it in GISMO.

Chapter 5

Neural Networks – comparison

between GISMO and FastCal

5.1 Introduction

In Chapter 3, an optimal neural network was designed with a training data set of

80,000 pairs of jets. This computes to about 50,000 events for each neural network.

For a full-fledged neural network training program many times that number of events

are required for efficient training. Since full simulation under GISMO is CPU inten-

sive, we employ FastCal to generate detector data.

This chapter compares the simulated data provided by GISMO and FastCal and

their application in data analysis. For the tests in this chapter, the GISMO data used

are from the SLAC repository. The GISMO data comprises full simulation of ldmar01

detector design, and the FastCal data are based on the corresponding Pandora-Pythia

StdHep events. Thus GISMO and FastCal come from two different simulations of the

same underlying events with the same detector.

sE0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1

Fractional ECAL Energy (FastCal)

(a) FastCal

sE0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1

Fractional ECAL Energy (FullSim)

(b) GISMO

Figure 5.1: The ECAL energy deposition obtained in (a) FastCal and (b) GISMO.The two distributions are in qualitative agreement.

5.2 Calorimeter deposition

The ECAL depositions are shown in Figure 5.1 for both FastCal and GISMO. For

FastCal, the energy per event is the sum of the energy deposited in the ECAL. For

GISMO, the energy is the sum of ECAL calorimeter hits energy times a scaling factor.

There is a good qualitative and quantitative agreement between the two distributions.

Figure 5.2 shows the HCAL energy depositions for FastCal and GISMO, and

they are obtained similarly to the depositions in the ECAL. Though the FastCal and

GISMO depositions are similar for the HCAL, they differ from the deposition in the

ECAL in the absence of the double hump.

The total calorimeter energy deposition is compared for FastCal and GISMO in

Figure 5.3. The simulated energy depositions can be compared with the total energy

of the final state e+, e−, γ and hadrons in Figure 5.4.

sE0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1

Fractional HCAL Energy (FastCal)

(a) FastCal

sE0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1

Fractional HCAL Energy (FullSim)

(b) GISMO

Figure 5.2: The HCAL energy deposition obtained in (a) FastCal and (b) GISMO.The two distributions are in qualitative agreement.

5.2.1 Cluster level comparison

The particles passing through the calorimeter deposit energy at calorimeter cells. For

a detector simulated by GISMO, these are represented by the CalorimeterHit objects

in hep.lcd.event. The clusters are constructed from the hits by the ClusterCheater

algorithm, a proximity algorithm that uses the particle Monte Carlo table to identify

the particle content of the calorimeter hit. The energy of the cluster is calculated by

multiplying the energy recorded for each cluster by the sampling fraction.

In the FastCal simulation of the calorimeters, there are no finer details of the

calorimeter beyond the average property. Since no new particles are created beyond

the ones created by the event generator (Pandora-Pythia), each event generator final

state particle passing through the calorimeter produces at most a single cluster. Since

the calorimeter simulated is not a sampling calorimeter, no sampling correction is

required.

The cluster energy distributions thus obtained in FastCal and GISMO are com-

pared in Figure 5.5(a). There is a general agreement in the distribution except in

sE0 0.2 0.4 0.6 0.8 1 1.2 1.4 1.6 1.8 2

Fractional CAL Energy (FastCal)

(a) FastCal

sE0 0.2 0.4 0.6 0.8 1 1.2 1.4 1.6 1.8 2

Fractional CAL Energy (FullSim)

(b) GISMO

Figure 5.3: The total calorimeter energy deposition obtained in (a) FastCal and (b)GISMO.

sE0 0.2 0.4 0.6 0.8 1 1.2 1.4 1.6 1.8 2

Fractional Total Energy (Generator)

Figure 5.4: The total energy per event in final state e+, e−, γ and hadrons (generatorlevel).

the low energy range. Figure 5.5(b) shows an order of magnitude higher number of

clusters with energy <1 GeV. The difference is too big to be explained by the particles

created by GISMO during the detector simulation, and is possibly an artifact.

[GeV]cE0 50 100 150 200 250 300

105 FastCal

FullSim

Cluster Energy

[GeV]cE0 1 2 3 4 5 6 7 8 9 10

FastCalFullSim

Cluster Energy

Figure 5.5: (a) The comparison of cluster energy distribution in FastCal and GISMO(FullSim). There is a general agreement between FastCal and GISMO, except for lowenergies. (b) The first bin in (a) blown up to show GISMO has a high number of lowenergy clusters (<1 GeV). The very low energy “garbage” clusters do not impact thejet finding algorithms.

5.2.2 Jet level comparison

The clusters obtained as described above in FastCal and GISMO are then used in the

construction of jets. The JadeEJetFinder algorithm available in the lcd.hep package

is used with ycut = 0.005.

Figure 5.6 compares the jet energy distribution in FastCal and GISMO. There is

a general agreement in the two distributions with a slight discrepancy in the number

of low energy jets.

In Figure 5.7(a), the number of jets per event in FastCal and GISMO are com-

pared. As expected, the GISMO peaks at a higher value. Fitting a Gaussian and

comparing the mean, we obtain a difference of approximately half a jet between Fast-

Cal and GISMO. This is seen in Figure 5.7(b). Further, with increasingly higher jet

energy cuts, the difference progressively decreases, with not statistical difference in

the jets per event between FastCal and GISMO with a cut of 40 GeV.

[GeV]jE0 50 100 150 200 250 300

FastMCFullSim

Jet Energy

Figure 5.6: Jet energy distribution in FastCal and GISMO (FullSim). The numberof jets match in comparison, except for low energy jets.

Jets per Event0 1 2 3 4 5 6 7 8 9

450FastMCFullSim

Jets per Event

Jet Energy Cut [GeV]0 10 20 30 40 50

4.4FastCalFullSim

Gaussian Fit to Jets per Event

Figure 5.7: (a) The comparison of cluster energy distribution in FastCal and GISMO.There is a general agreement between FastCal and GISMO, except for low energies.(b) The first bin in (a) blown up to show GISMO has a high number of low energyclusters (<1 GeV).

5.2.3 Jet-Quark Association

Jets are constructed using the Jade [11] jet finding algorithm (JadeE) available in JAS.

For GISMO data, JadeE is applied to the clusters, constructed by the ClusterCheater

algorithm that builds clusters using a proximity rule to each other. In FastCal, since

the data are available at the cluster level directly, no cluster builder algorithm is used.

For this chapter, the jet-quark association is done using the following algorithm:

• Consider the four most massive jets.

• Compute their opening angles with the quarks.

• Associate the jet with the quark closest to it.

• Break the association if the opening angle is larger than 0.3 radians.

• Break the association if another jet is closer to quark.

• Break the association if the energy of the quark is less than 10 GeV.

Using these association rules, we have a jet-quark association only if the jet is

closest to the quark, is less than 0.3 radians apart and the jet energy is larger than

10 GeV. The rule given above is used later to associate jets and quarks in ZZ → qqqq

events, and can be used for association in events in which one of the Zs decays

hadronically while the other decays into muons or neutrinos. Since neutrinos and

muons leave negligible energy depositions in the calorimeters, these events show up

in the calorimeter as single Z decay events. For such events, we may associate the jets

and quarks as follows. For the two most energetic jets, the jet-quark opening angle

is examined, and the pairs are associated if the greater of the two angles is less than

0.3 radians. Figure 5.8 displays the energies for GISMO jets. Figure 5.9 displays the

result for FastCal jets.

Figure 5.8: The jet-quark energy plot for GISMO events with one Z decaying hadron-ically, while the other decays into muons and neutrinos. The the most energetic jetsare then associated with the quarks, and a match is established if the bigger of thetwo jet-quark opening angles is less than 0.3 radians.

Figure 5.9: The jet-quark energy plot for FastCal events with one Z decaying hadron-ically, while the other decays into muons and neutrinos. The the most energetic jetsare then associated with the quarks, and a match is established if the bigger of thetwo jet-quark opening angles is less than 0.3 radians.

5.3 Neural network training

5.3.1 FastCal-GISMO comparison

Here, a comparison is made of the FastCal and GISMO neural network training.

For a head-to-head comparison of the FastCal and GISMO, see Table 4.4. The

data consist of that available in the SLAC repository of 9,900 generated and fully

simulated e+e− → ZZ events. Hadronic events are those events in which both the

Z’s decay hadronically and are identified by the particle decays in Monte Carlo tables.

The Z bosons decay into quarks that hadronize and form jets. It is assumed that

the four most energetic jets represent the fragmentation and hadronization of the

four daughter quarks of the Z boson. To find the right combination, each jet pair

(six, from four jets) is compared with both quark pairs and matched. The jet-quark

association algorithm is given in Section 5.2.3.

In addition to the SLAC generated data, we compare the results from newly

generated Pandora-Pythia data. Pandora-Pythia events are generated using Pandora

Pythia V3.2. This interface uses Pandora V2.2 (with the patches available), and

Pythia in the CERNLIB 2002 package. The data consists of e+e− → ZZ with

√s = 500 GeV. The electrons and positrons are unpolarized. Each beam carries half

the center-of-mass energy of 500 GeV. For use with FastCal, the following particles

are decayed: (1) k0s , (2) k0

l , (3) Λ, (4) Σ+, (5) Σ−, (6) Ξ0, (7) Ξ− and (8) Ω−.

Neural Network

A 6-21-22-2 neural network is implemented, with inputs from jet pairs. The input

variables are jet-jet invariant mass mjj, the jet direction cosines cos θj, the jet energies

Ej and the jet-jet opening angle θjj. There are two hidden layers with 21 and 22 nodes

respectively. The two output nodes represent the probabilities for the right and the

0 500 1000 1500 2000Training Cycles

Neural Net Error(6−21−22−2)

TrainingValidation

Figure 5.10: The neural network training graph. The lowest validation error is reachedat the 780-th training cycle. The network at this stage is used in the testing phase.

wrong jet pair combinations.

ε-p comparison

Since neither efficiency nor purity are good measures of the neural network per-

formance, we compare their F -measure instead. Table 5.1 displays the F -measure

for three sets of data: SLAC-GISMO, SLAC-FastCal and Pandora-Pythia FastCal.

SLAC-FastCal has a slightly higher performance, which may not be significant, but

could possibly be due to the non-decay of the V0 particles. The Pandora-Pythia

FastCal F -measure is the mean of thirteen testing datasets, with a spread of 0.004.

F -measureSLAC GISMO 0.858SLAC FastCal 0.865Pandora-Pythia FastCal 0.864

Table 5.1: The F -measures (Equation 3.3) for the three data sets.

5.4 Conclusion

In this chapter, we have compared the results from FastCal at the cluster and jet

levels and compared them with those from GISMO and have shown that FastCal can

be appropriately used in our analysis routines in lieu of GISMO. FastCal and GISMO

produce qualitatively and quantitatively similar ECAL and HCAL depositions. At

the cluster level and jet levels, the energy distributions agree qualitatively and quan-

titatively, though there is some divergence at low energies. This divergence effect

does not impact the studies done here since the four most energetic jets are used in

the analysis. The neural network training on FastCal and GISMO data show simi-

lar behavior and results. Thus, as developed for this study, FastCal provides a fast

calorimeter simulator that replaces GISMO with a significant increase in simulation

speed but without loss of physical realism.

Chapter 6

Neural Network – results

6.1 Introduction

In this chapter, the neural network designed in Chapter 3 is used to solve and explore

the combinatorial problem in ZZ production events in which both bosons decay

hadronically, using kinematic variables. In addition, a neural network is also used to

distinguish between the Z and W bosons using kinematic variables. The data used in

this chapter are simulated using the FastCal detector simulator that was introduced

in Chapter 4, the results of which were compared those to those of the full simulation

in Chapter 5.

6.2 Jet-Boson Association

In order to train a neural network, the training program needs to know the true

identity of each jet. This is provided by the Monte Carlo internal particle tables that

document quark fragmentations and hadronizations. In the previous examples the

jets were associated with the quarks that the bosons decayed into with the use of an

angular proximity rule described in Section 5.2.3. In that method for jet identification,

Monte Carlo generator level information, in the form of quark energy-momenta, is

used directly, and the jet content is ignored. This introduces two sources of errors

into training data. First, because of the strict proximity rule, it ignores jets that

legitimately originate from a boson, but which have been deflected from the original

path because of physics processes like gluon radiation. Furthermore, it does not

provide any control or measure of jet contamination from clusters that have different

origins.

In this chapter, this deficiency in the angular proximity rule is rectified by the use

of an alternative boson content rule. Using the Monte Carlo table from the generator

level, we define the fractional boson content of a jet as follows:

fb =eb

Σeb′ + eunknown, (6.1)

where eb is the sum of energies of clusters in the jet that originate from a particular

boson. The denominator is the total jet energy, which is the sum of energies from

all decaying bosons in the event as well as energies from other sources. Figure 6.1

displays the distribution of the fractional boson content eb of the first Z in all jets in

e+e− → ZZ events. Since there are two bosons in the events, the plot is symmetric.

In this fractional boson content scheme, a jet is associated with a boson if more

than 65% of the jet energy comes from that particular boson. The 65% cutoff ignores

the relatively flat portion in the middle of the graph; the boson content begins to rise

sharply above 65%. This allows for a small percentage of the jets to be identified as

from unknown origin, since both bosons contribute to its energy nearly equally.

This association scheme, therefore, provides a measure of jet contamination and

is not dependent on angular proximity of the jet to the original quark.

0 0.2 0.4 0.6 0.8 1Boson energy fraction (First Boson)

Boson Energy Fraction Distribution

Figure 6.1: The energy fraction of each jet for the first boson, given by Equation 6.1

In Table 6.1, a comparison is made between jet-boson association (based on boson

content) and jet-quark association (based on angular proximity). It shows that the

angular proximity and the boson content rules give a generally similar association,

represented by large values of true positives and true negatives. Of the jets associated

by angular proximity, about 10% are contaminated, and more than 20% of the correct

jet-boson association are lost due to the strict proximity rule.

Boson ContentAngular Proximity 1 0

1 425 540 133 1388

Table 6.1: Comparison of the Boson Content and the Angular Proximity methods inZ pair identification. 1 denotes a good jet pair and 0 denotes a bad jet pair.

For our training and testing purposes, we use the boson-content based jet-boson

association scheme.

6.3 Ensemble of neural networks

In Chapter 3, we designed a neural network, adopted a training scheme and imple-

mented them. In this section, we suggest a means to improve the performance of

neural networks in the classification of Z bosons, and their separation from a back-

ground of wrong jet combinations.

It has been suggested by many authors that to improve the performance of neural

networks, an ensemble of neural networks should be used instead of a single neural

network ([56] and references therein). The accuracy of such ensembles is seen to

improve under two conditions: that the classifiers are themselves as different from

each other as possible, and that each classifier is itself a very accurate classifier on its

own [48], but perhaps in a restrictive domain.

Here we consider an ensemble of neural networks, each of which has the same

architecture as the optimized network designed in Chapter 3. The networks differ in

the initial weights on which the networks begin training. From the best classification

criterion mentioned above the training is made to go on for much longer than what

earlystopping (see Section 3.5.3) would require. As we shall see, this is crucial for a

good result. For such an ensemble of networks, the ensemble average is given given

Oi (6.2)

where Oi is the output vector for the i-th neural network.

In preparing the training data sample, we adopt a variation of the bagging tech-

nique [18]. The bagging technique consists of resampling, with replacement, the

training dataset for each individual network. This means that the training data for

individual networks are a subset of the entire dataset with some records repeated.

Thus in a given training cycle the network training is reinforced on the repeated

records. The bagging technique is particularly appropriate for unstable networks

that produce widely different results for a small change in the training dataset. Here,

the neural network, which has been designed to classify the correct and incorrect jet

combinations, is not unstable. Instead the finite size of the network imposes a restric-

tion on the size of the training dataset on which it can train. Since we are interested

in improving the performance of a stable network in overcoming the limitations on

the size of the training data imposed by the network architecture, we generate our

individual datasets for each network by resampling 80,000 records without replace-

6.3.1 Ensemble Results

Figure 6.2 displays the result of two kinds of ensemble neural networks: (1) networks

that were stopped early using a validation dataset, and (2) networks that were trained

for a much longer period until the error on the training dataset was sufficiently low.

The error in the training dataset does not reach a minimum, and so a qualitative

judgment is required for stopping the training.

There are a number of inferences that could be drawn from the graph. First,

networks that were stopped early showed different characteristics. The earlystopped

neural networks are grouped with a higher efficiency-purity performance than the

group of networks that have been trained for a lower error in the training data set.

This indicates that the assertion made in Section 3.5.3 is correct. Earlystopping of

0.83 0.84 0.85purity

Ensemble NN (Early Stopping)Individual NN (Early Stopping)Ensemble NN (Min Training Error)Individual NN (Min Training Error)

Efficiency-Purity (Z Boson)

Figure 6.2: The ε-p graph illustrating the ensemble result. See Section 6.3.1.

individual neural networks improves performance.

We notice however that though individual networks that were not stopped early

performed worse than those that were stopped early, in ensembles they performed

better. We may understand this in terms of overfitting. Overfitting is the situation

when the neural network fits the given data well, but at the same time it creates

features in the model that are not actually present. These features are introduced at

random. Therefore, each individual neural network creates its own random features

which are washed out when the averaging takes place i.e. the overfitted features are

averaged out. Alternatively, since each dataset represents a fraction of the original

dataset, individual neural networks trained on such datasets without earlystopping

can be considered to be highly fitted to the region in variable space the sample dataset

represents. The averaging of the neural network outputs thus represents the average

features of these overlapping and highly fitted regions.

This also means that earlystopping has some limitations. As the training pro-

gresses and the neural network fits the data, the overfitted features are also getting

stronger. As a result, the earlystopping condition is a point where the effect of over-

fitting overwhelms the fitting. That is not the point where perfect fitting has taken

place. Therefore, an ensemble average gives us a better result than earlystopping.

It is to be expected that if this is indeed the case, then a significant overlap of the

training datasets of individual neural networks could be crucial.

Furthermore, the ensemble of neural networks as implemented here, can be used

to circumvent the limit to the size of the dataset that neural networks impose. In

many high-energy physics applications, it is likely that a large simulated data sample

is available. Using the training sample procedure described here, it becomes possible

to utilize a larger data set for training.

6.4 Classifying Z and W jet-pairs

In this section, we wish to explore the classification of W and the Z bosons from

hadronic decays using kinematic variables. This is a harder problem since the two

objects, the Z and the W are kinematically very similar in their hadronic decay

modes.

The neural network has the (6-21-22-2) architecture as in the previous example,

described in Chapter 3. The training data consist of events of the type e+e− →

W+W− where jet pairs are characterized by the variables used earlier (Ej1, cos θj1,

Ej2, cos θj2, mjj, θjj). The variables in the training data are standardized to zero mean

and unit variance according to Equation 3.5. The jet pairs are classified according to

the jet-boson association scheme described in Section 6.2.

0.5 0.6 0.7 0.8 0.9 1Purity

W NN on W dataZ NN on W data

NN on W jet-pair data

Figure 6.3: The ε-p curves display the performance of a neural network trained one+e− → W+W− data and tested (1) on e+e− → ZZ data (Z NN on W data) and (2)on e+e− → W+W− data (W NN on W data). Note that network, trained on W dataperforms equally well on Z data.

0.5 0.6 0.7 0.8 0.9 1Purity

Z NN on Z dataW NN on Z data

NN results on Z jet-pair data

Figure 6.4: The ε-p curves display the performance of a neural network trained one+e− → ZZ data and tested (1) on e+e− → ZZ data (Z NN on Z data) and (2)on e+e− → W+W− data (W NN on Z data). As in Figure 6.3, this shows that thenetworks trained on right and wrong jet pairs are picking up the features of a generalheavy boson.

Figure 6.3 displays the efficiency-purity curve for the neural network distinguishing

the correct jets pairs from the incorrect jet pairs in ZZ and WW events. Trained

on e+e− → W+W− data, the neural network performs nearly equally on both Z and

W data. This behavior is repeated in neural networks trained on e+e− → ZZ data

(Figure 6.4). For this network too, the performance of the network on W test data

is only marginally inferior to its performance on Z test data.

Thus the neural network designed for the combinatorial problem has picked up

features for the identification of a general heavy boson against a background of wrong

jet combinations, but it does not distinguish W from Z.

6.4.1 Distinguishing W and Z

In this section, we examine in more detail the performance of the network in distin-

guishing between W and Z. The training data now have 40,000 jet-boson-associated

jet pairs identified as W and an equal number of jet-boson-associated jet pairs iden-

tified as Z. The neural network has a fully connected 6-21-22-2 architecture trained

on the same six variables as above.

Figure 6.5 displays the efficiency-purity curve for the network that distinguishes

the Z from the W jet pairs. Since there are two classes in the data, the efficiency-

purity curve for the same network that distinguishes the W from the Z jet pairs

is complimentary and is not shown. In comparison to the Z boson classification

against the wrong jet pair combinations, the neural network does not perform as

well. Moreover, the curve is straighter and does not have a well defined concave

shape.

This straighter shape in Figure 6.5 can be understood from the neural network

probability output shown in Figure 6.6. It shows that the neural network output

between 0.15 and 0.7 is equally distributed between the Z and W jet pairs, following

identical distributions, indicating that this mid region is not a good discriminator.

The straighter efficiency-purity curve for the Z from Z,W test data in Figure 6.5 is

attributable to the low performance of the neural network in this region.

6.5 Conclusion

In this chapter, we have designed a training scheme for an ensemble of neural net-

works. It was seen that an ensemble of neural networks performs better than a single

neural network, though the performance does not improve drastically. There are two

crucial factors determining the success of this scheme. The first is that the neural

0 0.2 0.4 0.6 0.8 1purity

Z from Z,W test dataZ from Z, wrong combination jet-pairs

Efficiency-Purity Graph for ZW NN

Figure 6.5: The ε-p curve for the Z jets pairs against a background of W jet pairs isgiven by the dark line. For comparison, the ε-p curve for the Z jets pairs against thebackground of wrong jet pairs is given. The dark line is straighter, denoting a poorclassification performance in the mid probabilities (See Figure 6.6).

0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1Unit 1

NN Output

Z pairsW pairs

Figure 6.6: The neural network output of the first unit (probability of Z jet pair).Since there are two classes, the probability for W is complementary. The classifica-tion is not optimal between the output values of 0.15 and 0.7, and it is particularlysuboptimal between 0.15 and 0.5. The peaks most likely denote subclasses of the jetpairs that have particular kinematic variable distributions.

networks should be trained on bagged data, and second, the neural network should

not be regularized by methods like earlystopping.

The variation of the bagging method used in the ensemble of networks is a means

by which neural networks can be trained on larger datasets. The neural network

architecture imposes a restriction on the size of the dataset on which the networks

can optimally train. When a large dataset is available for training, the ensemble

technique can be used to improve the performance of neural networks.

We have also examined the training of a neural network to classify heavy bosons

into Z and W separately. A naive implementation of such a classifier does not operate

perfectly, and there is a range of neural network output (i.e., Z probability output

between 0.15 and 0.7) where the classifier does not work. For an optimal classification,

a better understanding of the kinematic variable space is needed. This exploration of

the kinematic variable space is done in the next chapter.

Chapter 7

Unsupervised Methods

7.1 Introduction

The training of the feedforward neural network used so far is an example of supervised

learning. Under this training paradigm, the classifier is given a dataset with records

that contain training variables as well as the pre-determined class each record be-

longs to. The classifier implements an iterative learning procedure and learns on this

training set. In neural networks this learning procedure occurs with the iterative ad-

justments of the connection strengths (weights) between neural units. In binary trees,

the learning procedure occurs via repeated binary partition of the data according to

an optimizing criterion that depends on the specific binary tree implementation.

An alternative form of learning is unsupervised training. In this scheme, the

dataset on which the training takes place does not contain information on the correct

classification of each record. In this scenario, the learning algorithm learns the data

without supervision and produces the partitions (classes) spontaneously.

There are two basic advantages unsupervised learning methods have over super-

vised learning methods. First, unsupervised methods are not susceptible to the bias

of the training data sample that is characteristic of data samples that contain a pre-

determined classification. In the high-energy physics context, the training datasets

are generally obtained from simulation studies. A bias is introduced into the train-

ing data sample by the method, usually domain knowledge based (see Jet-Boson

association in Section 6.2), that determines the classes that each record belongs to.

Unsupervised methods, which do not train on data that have been pre-classified, are

therefore not limited by the domain knowledge inherent in training datasets.

The other advantage is the possible discovery of novel signals in the data. Since

supervised methods are trained on a fixed number of classes, any novel signal in

new data will likely be missed by supervised methods. Unsupervised methods, not

confined by this restriction of supervised methods, are more sensitive to novel signals

in new data.

From a computational point of view, a third advantage is that unsupervised meth-

ods are faster than supervised methods like neural networks. Combined with the first

two advantages, unsupervised methods are more suited for examining large datasets

for the presence of novel signals.

These advantages of unsupervised methods are offset by the ability of supervised

methods to model data in greater detail and thus offer greater classification perfor-

mance.

In the previous chapter we had encountered the problem of a neural network

not performing optimally in the classification of the Z against the W . A possible

explanation could be that the problem was ill posed. That is, the variables on which

the training took place were not adequate to separate the two different species of jet-

pairs. A significant section of the Z and W data overlap which the neural network

was unable to train on.

The solution would be to look at the distribution of the data variables (perhaps

using visualization techniques) and then to try to find better variables. To understand

the data distributions, we first provide an exercise in data visualization and explo-

ration. Most unsupervised methods rely on finding features in the data distributions.

In Section 7.2, we look at Principal Component Analysis as a tool to understand

and visualize data distributions. We also present a density cutoff procedure based on

principal components as an alternative method of signal separation. In Section 7.3

we introduce two clustering methods and examine their efficacy in separating signals

and backgrounds.

7.2 Visualization

Data visualization is important in high-energy physics, since it is the first step toward

data analysis. A popular form of one-dimensional data visualization and presentation

is the histogram. This idea can be extended to two dimensions, and then it is referred

to as a lego plot. This type of visualization cannot be extended to higher dimensions.

Another form of visualization aid is the scatter plot for two or at most three

dimensions. Scatter plots are particularly helpful in the search for structures in the

data distribution and classification. Scatter plots are sensitive to the size of the

dataset. A very large dataset would render the plot entirely black and thus of no use.

Too small a data set would produce too sparse a scatter plot, again of no use. It is

easy for significant clusters to hide within clusters.

Where the number of dimensions is more than three, visualization of the data

becomes more difficult. In particular, it becomes important to choose the variables

that best characterize the data and exhibit the separation between the classes.

A priori, it is not possible to order variables in importance without specific domain

knowledge. In the examples earlier, the choice of the invariant mass, an important

discriminating variable, was motivated by domain knowledge. Even if an ordering of

variables is possible, there are possibly correlations within the choosen variables that

hide the particular distribution of the data that would best display the structure of

the data.

We discuss below a method, Principal Component Analysis (PCA), that addresses

these issues, enabling us to visualize the data optimally.

7.2.1 The combinatorial problem and Principal Component

Analysis

In the combinatorial problem, we have data for all possible pairs of jets in an event,

from which we would like to classify the correct pairs. The question is: can we distin-

guish the correct pair from these kinematic quantities using a visualization method?

The four variables we examine are the product of the two jet energies (E1E2),

the jet pair invariant mass (m12), the jet opening angle (θ12) and the error in the

invariant mass (δm212). These variables are known to be good discriminants. Ideally,

a knowledge discovery tool should be able to pick up the relevant discriminant on its

own. However, if the discriminant is a nonlinear function of several variables, this is

very unlikely to happen. Since the method we examine here, Principal Component

Analysis, is a linear method, we need to include non-linear discriminants explicitly.

Principal Component Analysis

PCA is a linear transformation from the p-dimensional space to another p-dimensional

space. The components in the new space are ordered according to the proportion of

the variance in the data in that direction. This can be explained geometrically as

follows: the first component is a general straight line fit to the data in the original

p-dimensional space. Therefore, it is the direction along which the data is most

widely spread. The second component is orthogonal to the first and accounts for the

next biggest spread in the data. The third component accounts for the third highest

variance in the data and it is orthogonal to the first two components. And so on.

Since the components are now ordered according to the variance, the structure

of the data becomes obvious in the first few components, now called the principal

components. Using the complete set of new p components does not yield anything

new, since the data in the new space has as much information as the data in the

old space. The advantage with the principal components lies in the fact that they

are ordered according to the variance along the components. For data with a large

number of variables (dimensions), the principal components that account for the

smallest measure of the variance in the data can be dropped. This results in a loss

of some information, but this loss is compensated by a gain because of the decrease

in the dimensionality of the data [12].

7.2.2 PCA to 2 dimensions

Figure 7.1 shows the distribution of combinations in use for our unsupervised training

in a scatter matrix plot, matched against each other. The sample consists of both

Z boson pairs and wrong combinations. We are trying to address the problem of

visualizing the separation of the two kinds of pairs. To do this we apply PCA.

The transformation matrix of the four variables given in Section 7.2.1 that results

in PC1, PC2, PC3 and PC4 is given in Table 7.1, where PC1, PC2, PC3 and PC4

are the principal components, given in the order of maximum variance to minimum

variance. Their relative importances are displayed in the scree 1 diagram (Figure 7.2).

1A scree is characterized by a sharply rising region and a flat region. The point where the tworegions meet (sometimes called the knee) can be used as a cutoff on some parameter. Here, we usethe scree diagram to identify the number of significant principal components.

Figure 7.1: The surface matrix plot for the four variables mentioned in Section 7.2.1.

PC1 PC2 PC3 PC4E1E2 (COM) 0.5008767 -0.4829708 -0.6611164 0.28068994

m12 0.5233797 -0.2480016 0.2479882 -0.77657627θ12 0.4426680 0.8397824 -0.3069293 -0.06786034

δm212 0.5284328 -0.0000699 0.6381390 0.55994413

Table 7.1: The transformation matrix. Note that the transformation is between thescaled and standardized (zero mean, unit variance) values of variables in the leftcolumn and the principal components

The knee of the scree diagram suggests that the first two principal components are

the most important. The transformation is on normalized variables.

Figure 7.3 shows the distribution of the two most important principal components

in a scatter plot. This view shows the data in the two dimensions in which it has the

maximum spread. The background wrong combinations of the jets are more spread

out, and the Z jet-pairs are considerably localized.

0 1 2 3 4 5Principal Component Index

PCA Eigen Values

Figure 7.2: The scree plot, that displays the eigenvalues of the principal component.

Interpretation of the transformation

A major reason for the use of principal component analysis is the ordering of compo-

nents, a property that makes it attractive for dimension reduction. We use the order-

ing property here to visualize the data and later as inputs to unsupervised training.

The two components shown in Figure 7.3 account for nearly 90% of the variance in

the data.

The transformation equations, including the initial rescaling and the subsequent

principal component transformation, are shown in Equations 7.1 and 7.2. In the

equations below, PC1 denotes the first principal component and PC2 denotes the

second principal component.

PC1 = 6.56× 10−5E1E2

−2 0 2 4 6 8

Figure 7.3: The two orthogonal components with the highest standard deviationsafter principal component analysis. In this plot, the Z jet-pairs are given in blackand the wrong combinations are given in gray.

+6.25× 10−3m12

+4.98× 10−1θ12

+3.43× 10−3δm212

−2.87 (7.1)

PC2 = 6.33× 10−5E1E2

+3.00× 10−3m12

−9.38× 10−1θ12

+6.49× 10−6δm212

+1.00 (7.2)

Here the transformations are from the unscaled and unstandardized variables as op-

posed to the transformations in Table 7.1, which transform from scaled and standard-

ized variables (Section 3.5.1).

In this space, the Z jet pairs are localized in a smaller region, whereas the wrong

jet pairs are spread out. The wrong combinations do not occupy the entire space, but

are themselves bounded. These bounds are due to kinematic constraints.

Multidimensional scaling

Multidimensional scaling [22] (MDS) is another technique for transforming the vari-

ables. Given an n×p dataset, this can be used to map from a p-dimensional space to

n-dimensional space with the variables ordered according to the discriminating power

as in PCA. Traditionally MDS is used to map to a 2-dimensional space. The chief

characteristic of MDS is that the inter-record distance in the dataset is preserved as

best as possible. Multidimensional scaling methods that are based on the Euclidean

distance give a result that is equivalent with that from PCA. Thus the interpreta-

tion of MDS—that the transformation maintains the relative distance between data

points—can be extended to the PCA transformation above.

Though the Euclidean-based MDS can be extended further by using different

distance measures to explore different kinds of data distributions, MDS methods do

not scale well with increasing data. This is due to the fact that the method requires

a distance matrix D, which for a data matrix given by n × p, is an n × n matrix.

Thus MDS methods scale as O(n2) and, therefore, are much slower than PCA and

require bigger computer memory sizes. Though the extensions to the basic MDS

might provide more insight into the data, we do not consider MDS methods further

in this thesis.

7.2.3 The density plot

As seen in Figure 7.3, the signal (Z jet pairs) is localized against the background of

noise (wrong jet-pair combinations). Therefore, we explore the viability of using a

density cutoff to identify the Z jet pairs in this section.

A density plot is obtained using a Gaussian-like kernel of the following form

K(d, θ) =

(1− ( d

3c)2)2 d < c

0 d ≥ c(7.3)

The kernel is normalized by πc2, and c is radius of an arbitrary circle. Figures 7.4

and 7.5 show the plot of this density. The density peak is located in the region of

the Z boson jet pairs. A cutoff on the density can thus be used to define Z. With

pairs falling within a grid square with density above the cutoff identified as Z pairs,

we obtain the efficiency-purity curve given in Figure 7.6.

Figure 7.4: The surface plot for the density distribution, calculated using theGaussian-like kernel described above. Notice the peaks. The higher of the two peaksrepresent the signal.

0.0 0.2 0.4 0.6 0.8 1.0

Figure 7.5: The contour plot of the density distribution.

0.2 0.4 0.6 0.8 1purity

CMS-density cutoffNN

The efficiency-purity ComparisonNN vs CMS

Figure 7.6: The efficiency-purity curve. The two curves compare the result of (a) theclassical multidimensional scaling procedure and (b) the neural network. In (a), adensity cutoff is used to identify the Z. The density is obtained in a 2-dimensionalreduced subspace obtained from a classical scaling on the following variables: E1E2,m12, θ12 and δm2

12. The neural network is trained on E1, E2, cos θ1, cos θ2, m12 andθ12.

PCA and Density Cutoff

Figure 7.6 shows the efficiency-purity curves of the PCA density-cutoff procedure and

compares it with the supervised neural network result. The neural network gives a

better result. The better result from neural networks is due to two aspects of neural

networks. First, the neural network is a non-linear procedure, as opposed to PCA,

which is a linear procedure. Second, the neural network is a supervised training

procedure, whereas the PCA is unsupervised.

Even though the PCA procedure lacks the precision of the neural network, it is

expected to perform better with a minimum-bias sample, and is expected to do better

when unknown samples are given to it.

7.2.4 PCA at the quark level

To understand better the distribution of the Z and the wrong-combination jet pairs

in the Principal Component Analysis above, the PCA transformations are applied at

the quark level. In this dataset, there are exactly four jets, of which two are Z pairs

and four are bad combinations. The corresponding first two principal component

distributions for the quark-level information is given in Figure 7.7. Because of the

absence of the energy spread at the quark level, which is unavoidable at the jet level,

Figure 7.7 lets us examine the nature of the Z decay against the wrong combinations

more clearly. We observe that the general distribution of the wrong combinations is

very similar to the one observed in Figure 7.3. The distribution of the correct quark

pairs shows a much sharper and significant shape, which stands out against the wrong

combinations. Figures 7.8 display the correct and wrong quark-pair distributions

separately to bring out the shapes of these distributions better.

In Figure 7.8(a), the sharp upper bound is the lower bound on the opening angle

Figure 7.7: The PCA plot using the same transformation as the jets but on quarkinformation.

of the quarks from the decaying Z. The long tail denotes the spread in the opening

angle.

The quark-level distributions in the two principal components indicate that there

exist some structures in the data distribution that are not clearly displayed in the

two-dimensional plot. To examine the distribution further, we add the third principal

component and look at the three-dimensional distribution of the data in the following

section.

PCA to three dimensions at the quark level

With the additional view of the third principal component, some more structure

becomes visible. In Figure 7.9, several views are shown. The blue denotes the wrong-

(a) Right combinations (b) Wrong combinations

Figure 7.8: The right and and wrong combinations at the quark level

combination quark pairs, whereas the red denotes Z quark pairs.

The blue data points fall on a curved surface and the red data points do not fall

on that surface. This is an indication that the two data classes are separable on the

basis of these four variables alone. This is significant because the data now show a

structure in the three dimensions.

7.3 Clustering methods

Since an important task in high-energy physics data is classification, our goal is to

be able to use classification methods in an automatic manner. Visualization tech-

niques described above are a means to look at preliminary features in the data. The

visualization technique, based on Principal Component Analysis, was extended for

classification. In this section we examine the use of clustering methods to classify

data automatically.

(a) (b)

(c) (d)

Figure 7.9: Four perspectives of three-dimensional PCA at the quark level. The dark(black) points are Z and the light (green) points are wrong combinations. The twoclasses form different distributions. Section 7.2.4 for description.

7.3.1 Fuzzy clustering

Here we employ a simple fuzzy clustering [41] method called fuzzy C-means or FCM

(see Appendix E). FCM is a k-means class of classifier, which means that the number

of clusters have to be specified at the outset. The salient feature of this classifier is its

fuzziness, which is defined by the membership attribute of each data point calculated

for each cluster. The membership attribute in fuzzy methods is a number between

0 and 1, in contrast to regular k-means method where the membership attribute is

either 0 or 1, thus forming a hard boundary. The fuzzy attribute offers soft boundaries

and higher interpretability.

The method consists of selecting a determined number of clusters c based on our

understanding of the problem, which in the jet combinatorial problem is two, i.e. there

are good and bad combinations of jet pairs, and ideally c = 2. The clusters are ini-

tialized by randomly selecting the centroids in the parameter space. The membership

function that data point k belongs to cluster i is given by

d−1ki∑

j d−1kj

where dki is the Euclidean distance between the datapoint k and cluster centroid i.

The soft membership of the datapoints, as implemented in the equation above,

enables the algorithm to handle situations in which the clusters overlap at the bound-

aries. The cluster centroids are calculated by

Ci =∑k

mki Pk∑

j mjkPk

where Pk is the k-th datapoint.

The membership attributes are calculated for each datapoint with respect to each

centroid using Equation 7.4. The new centroids are then calculated using Equa-

tion 7.5. This is iterated until the centroids converge. The convergence criterion is

decided by a tolerance 10−8 on the Euclidean distance change in each centroid. This

tolerance is acceptable, since a decrease in the value does not change the result.

Fuzzy cluster application

The FCM method is applied to the combinatorial problem. Since there are two

classes in the problem, the algorithm is initialized with c = 2. The results are given

in Table 7.2. Fuzzy clustering with a Euclidean distance d, and probability 1/d is

performed for two and three clusters. The clustering algorithm has been able to

Cluster index Z-pair non-Z-pair F -measure0 11,379 17,759 0.561 66 20,796

Table 7.2: Fuzzy clustering for 2 clusters (c = 2). The cluster 1 is interpreted as theincorrect jet-pair cluster. The F -measure is calculated for cluster 0 and interpretedas the correct jet-pair cluster.

partition the data into two clusters, one with nearly all the correct jet-pairs (Z)

pairs (cluster 0). This cluster is the Z-pair cluster, which has a very high efficiency

but lower than 50% purity. The other cluster (cluster 1), has a high purity in the

incorrect pair, and just over 50% efficiency. In effect, the clustering for c = 2 has

been successful in identifying a little over 50% of the incorrect pairs.

The effect of partitioning the data into more than two clusters is examined. The

clusters for c = 3 are displayed in Table 7.3. Comparing the c = 2 and c = 3

clusters, it is seen that the additional cluster does not change change the partition

drastically. Instead, the third cluster is created out of the cluster 1 in the c = 2

clustering (Table 7.2). By increasing the number of clusters c, the cluster 0 in c = 2

Cluster index Z-pair non-Z-pair F -measure0 11,380 17,595 0.561 4 862 61 20,874

Table 7.3: Fuzzy clustering for 3 clusters (c = 3). The F -measure for cluster 0 doesnot improve with the increase of c from 2 to 3.

could not be partitioned further, which would have resulted in a better efficiency and

purity for the correct jet pair (Z).

The cluster centroids are given in Table 7.4. Note that the centroids cannot

be interpreted as the average of datapoints belonging to that cluster in the FCM

algorithm, since the cluster points are weighted averages of all datapoints.

Cluster index E1 cos θ1 E2 cos θ2 m12 θ12

0 -1.96e-05 1.41e-06 -2.36e-05 -9.21e-07 -3.59e-05 -3.35e-051 3.37e-06 -3.46e-07 3.99e-06 3.35e-07 6.05e-06 5.62e-062 1.62e-05 -1.07e-06 1.97e-05 5.85e-07 2.98e-05 2.79e-05

Table 7.4: The centroids of the three clusters. Note that the variables are standardizedto zero mean and unit variance.

Table 7.2 demonstrates that this clustering method is unsuitable for the separation

of the Z-pair jets from the non-Z-pair jets. This is because the data has non-spherical

structures and the FCM methods based on Euclidean distances work well for spherical

distributions. Figure 7.10 displays the first two principal components of the data.

The signal forms a high-density localized distribution overlapping another localized

distribution of the wrong jet-jet pairs.

So while this method can enhance the sample, it is however not an efficient clas-

sifier. Figure 7.10 shows the non-spherical distribution of the Z boson jet-pair data.

With an appropriate choice of a non-standard norm in the distance metric, an ellipse,

with the major axis along the longitudinal axis of the Z jet-pair distribution, can be

−2 0 2 4 6

−2−1

.. ....

. . ..

. ....

.. .. .

... ..

... .. ...

.. ...

. . ..

.... .

. .. ..

.. ...

. . ..

.. ...

.... .

.. ...

... ..

.. ...

.. ... .

. . ..

. .... .

... ..

.. ....

. . ..

.. ...

... ....

... . ..

Figure 7.10: The plot in the first two principal components on the fuzzy cluster data.Note the structures in the data distribution. The dark datapoints are the correctjet-jet pairs, and the lighter datapoints are the incorrect jet-jet pairs.

deformed into a circle. This would render the Z jet-pair distribution into a spherical

shape, making the FCM more efficient. But this would be a non-general solution,

appropriate for only this particular data distribution.

7.3.2 Demographic clustering on the combinatorial problem

In the last section, we examined the use of a classifier based on the Euclidean distance,

that is capable of enhancing the sample, but incapable of efficiently classifying the

sample. Here, we attempt to apply the demographic clustering algorithm [37] to solve

the combinatorial problem. In contrast to the FCM algorithm, the demographic

clustering algorithm measures the proximity of the datapoints based on a voting

mechanism.

The voting mechanism introduces the non-Euclidean element into this algorithm.

Each pair of datapoints is voted once for each variable axis. If the separation is large

(see distance parameter below) along an axis, then the pair receives a vote against,

and if the separation is not large, then the pair receives a vote for. This is repeated

for each axis in the problem (six in the present case). The six votes are tallied: if the

“for” votes are larger than the “against” votes, the pair belong to the same cluster,

and if not, then the pair belongs to different clusters. For a detailed description of

this procedure, see Appendix C.

Thus the voting mechanism makes the demographic clustering method non-Euclidean

as well as non-linear.

The separation between the two datapoints is measured according to a distance

parameter, d. This parameter is defined for each axis, and optimized according to an

internal algorithm by default, but can be externally specified according to use.

The dataset consists of the same data set used in the last section. The data are not

standardized for this section, because the algorithm is not metric based, but voting

0 0.5 1 1.5 2θ-distance [radians]

θ-Distance Parameter vs Purity

Figure 7.11: Demographic Clustering: The purity as a function of the distance pa-rameter on the θ12 variable.

count based. Standardizing the data, therefore, affords no advantage.

θ12 parameter

The use of demographic clustering algorithm is automatic. The algorithm internally

optimizes all its parameters. Here we examine the optimization of the distance pa-

rameter d that the algorithm uses to assign voting scores along different variable

axes. An important variable in the classification of correct and incorrect jet pairs is

the jet-jet opening angle θ12 (Section 2.3.1).

Figure 7.11 displays the result of varying the distance parameter dθ for θ12. Purity

is used to optimize the value of dθ because it is a measure of how well the cluster can

be identified with the signal. In this case we observe that the maximum in the purity

coincides with the maximum in the F -measure. Table 7.5 displays the classification

ability of the two clustering methods.

efficiency purity F -measureFuzzy c-means 0.99 0.39 0.56Demographic Clustering 0.97 0.51 0.67

Table 7.5: Comparison between fuzzy c-means and demographic clustering.

7.3.3 Comparing clustering result

Figure 7.12 compares the result from the partition obtained from demographic clus-

tering, that obtained from using the θ12-cuts obtained from Section 2.3.1, and that

from fuzzy clustering. As unsupervised classifiers, demographic clustering performs

better than fuzzy clustering, since the graph lies more to the right and thus has a

higher F -measure. Note that the previous supervised method using an ensemble of

neural networks however performs much better than any of the unsupervised methods.

7.4 Conclusion

In this chapter, we have examined the use of Principal Component Analysis as a

means to visualize data and conduct initial data exploration. PCA is helpful in a

high-dimensional problem which lets it choose the most significant coordinates to

view the data. The signal is localized in a space defined by the first two principal

components, which we exploited to identify signals using a density cutoff. The use

of PCA at the quark level has shown that there are non-linear structures in the data

distribution.

For automatic splitting of the data fuzzy c-means clustering was used. The cluster

method was able to enhance the signal in the dataset, but was unable to classify

0.3 0.4 0.5 0.6 0.7 0.8 0.9 1Purity

θ-cutsDemographic ClusteringNN EnsembleFuzzy Cluster

Comparision Efficiency-Purity

Figure 7.12: Demographic Clustering: The efficiency-purity graph obtained by vary-ing the distance parameter on the θ12 variable. Comparison is made with fuzzyclustering, θ12-cut and an ensemble of neural networks.

it (produce a cluster with higher than 50% purity). This was attributed to the

non-linear data distribution and the Euclidean distance measure used in the fuzzy

clustering method. To overcome this shortfall, demographic clustering, based on a

non-Euclidean distance parameter and a voting mechanism was used, with improved

results.

Chapter 8

Conclusion and Future work

8.1 The goal

The goal of this work is to examine multivariate methods in the study of high-energy

physics data. High-energy physics data are manifestly multivariate and we have

analyzed it from a data mining point of view. Data mining techniques are sufficiently

developed but their applications in specific problems are not trivial. With the LHC

close to completion and the research and development on the linear collider pushing

it too toward a certainty, we are now at the verge of a deluge of experimental data

that would alter our understanding of the basic nature of our physical world. Data

mining techniques, as we have demonstrated in this thesis, can provide tools in the

analysis of the data with potentially useful insight into the physics.

We have chosen the context of a future linear collider accelerator to examine the

use of data mining, since it is the context in which this approach is most suited.

We constrained ourselves to examining only the kinematic quantities since they are

general quantities and different phenomena are likely to have different kinematic

signals.

8.2 The Results

8.2.1 FastCal

Since the work was designed to meet the challenges of a future accelerator, and the

parameters of the experimental facilities are still under development, the study had to

adjust to a changing environment. To meet with the dearth of data, a fast Monte Carlo

simulator for the detector calorimeter, FastCal, was designed to provide voluminous

data quickly. The central feature of this Monte Carlo simulator is the parameteri-

zation of the hadronic shower and the averaging of calorimeter material properties.

This simulator produces data at rates which are about 4 orders of magnitude faster

than the full simulation, yielding about 50 times the statistical power.

FastCal yields cluster-level information, with each particle producing at most one

cluster in each of the two calorimeters (ECAL and HCAL). This level of graininess in

the data is sufficient for jet-based studies, which do not require calorimeter cell level

information, but require very high statistics.

8.2.2 Neural Network

As classifiers, supervised neural networks are superior in general to other methods.

This is attributable to the non-linear sigmoidal activation function of neural units and

the ability of neural networks to form complex class boundaries in variable space. In

contrast, binary trees, which are also trained using a supervised learning algorithm,

are limited to rectangular decision boundaries in variable space. Though this results

in less efficient classifiers, they provide decision boundaries which are akin to cuts on

variables obtainable from traditional physics data analysis and therefore somewhat

easier to understand.

The neural network models are difficult to understand and interpret in physics

terms. To understand neural models better, binary trees were used on neural network

results to approximate the complex decision boundaries with rectangular ones.

Though neural networks work efficiently, their optimum performance depends on

many non-trivial specifications. For this, we seek general solutions. Of immediate

concern is the neural network architecture. Too small a neural network would render

the network unable to model the required problem adequately, and too large a network

would introduce model artifacts and decrease its generalizing capabilities. These

issues were addressed with the use of a regularization scheme of weight decay and

pruning based on a simple rule to minimize the network size, and the use of an

earlystopping mechanism to prevent overfitting.

The size of the neural network further restricts the size of the training dataset

that it could optimally train on. A too-large or a too-small training dataset results

in poor neural network performance.

An ensemble of neural networks as used to improve the classification performance.

Individual neural networks are trained on separate training datasets with some over-

lap. Such ensembles perform better than individual neural networks. A crucial issue

in such ensembles is the training of individual neural networks. Earlystopping de-

creases the efficiency of such ensembles and it is important to continue training them

until a low error is obtained. It is important during such training to use a bagging

technique while designing the datasets so that some of the data are repeated across

different networks, and these features are reinforced.

For the study of neural networks, a neural network package was designed and imple-

mented. This package provides a general, configurable neural network that can be

used for problems beyond those addressed in this thesis. It was specifically designed

for use in the JAS environment, so that neural networks could be trained and used

within JAS. These trained neural networks could form a part of a bigger analysis

package. This method is sufficiently general for applications in neural network so-

lutions to problems beyond the jet-combinatorial problem that we have addressed

8.2.3 Exploratory Data Mining

Data mining approaches depend critically on data distributions. With problems that

have dimensions higher than three, it is hard to visualize the data distributions. The

data exploration was done using Principal Component Analysis, which is a linear

method that rotates the axes and orders them according to the variance content. The

axes with the highest variance in the data were used to visualize the data distribution.

While performing PCA, it is important to center the data and scale it to unit variance

along each original axis.

The data distributions in the principal components generally provide clues on

how analysis could be done. Interesting features are seen as regions of high density.

As an example, a cutoff in the two-dimensional principal components was performed

to distinguish correct jet pairs from incorrect jet pairs. The performance of such a

classifier approaches that of the neural networks.

Of particular significance are the shapes of the data distributions in the principal

components space. The kinematic limits in the wrong combinations result in distri-

butions that are curved and widely dispersed. These non-linear structures are not

adequately addressed by the linear transformations of the principal components.

8.2.4 Clustering

Visualization of data distributions is generally limited to three dimensions. Though

PCA orders the axes according to the variance and thus importance, visualization does

not aid in the detection of structures in higher dimensions. For problems in higher

dimensions other multivariate methods are required. Cluster analysis methods are one

class of methods that can be applied. There is a wide variety of methods available,

and their applications are non-trivial.

The Fuzzy c-means (FCM) clustering was applied on the combinatorial problem.

The fuzzy membership measure of a datapoint with respect to a cluster can be in-

terpreted as the probability of membership, akin to the Bayesian output of neural

networks. For c = 2, the FCM isolated a high purity (but low efficiency) cluster of

wrong combination jets. Thus this clustering method was able to enhance the signal

(correct jet pair) in the other cluster. The FCM, based on the Euclidean distance,

was unable to identify the correct jet pairs adequately.

A correction for the non-spherical clusters is a possible non-standard norm for the

distance metric. Though these non-standard norms can be used to squeeze ellipsoidal

distributions into spherical ones and thus make the data more amenable to Euclidean

distance methods, they are non-general solutions. For instance, if all signal clusters

have ellipsoidal distributions aligned (the major axes are parallel) the non-standard

norm would improve detection. If the signal clusters are not aligned, then there exists

no norm that would improve detection for all signal clusters.

To improve clustering ability, the demographic clustering was used. This method,

based on a voting mechanism, is not dependent on the Euclidean distance. On account

of this property, the demographic clustering method performed better than the FCM

clustering method. The correct jet-pair clusters had a purity that was higher than

50%, and the F -measure was 0.67.

The clustering methods, which are examples of unsupervised training methods, are

capable of distinguishing data distributions with broad properties. This is in contrast

to methods based on supervised learning, which are capable of creating models of

data that are finer in detail. As a result of this, cluster analyses do not yield results

that match the performance of supervised learning algorithms like neural networks.

8.3 Data mining and the future

The data mining process is an iterative process. This is analogous to the iterative

processes that constitute the learning algorithms: neural networks or cluster analysis.

With each iteration of the data, a better understanding of the problem is achieved,

and relevant domain knowledge is added to the next cycle of the data mining process,

which results in further improvement of the result.

The work in this thesis is a preliminary exploration toward a more systematic use

of data mining techniques in high-energy physics. One of the main results of this

thesis was the importance of the data distribution in kinematic variable space. The

Principal Component Analysis demonstrated that data distributions have structures.

For example, the data distribution of the wrong jet pairs in the three-dimensional

principal component space is a curved hypersurface. Principal Component Analysis,

which is a linear transformation of variables, is not capable of providing a transfor-

mation in which this hypersurface is not curved but flat. Such a distribution would

be better suited for separating the wrong jet pairs from the correct jet pairs.

To unfurl curved hypersurfaces, non-linear transformations are required. Transfor-

mations like Isomap [84] could be used, which directly address the issue of extracting

low-dimensional structures in high-dimensional data. Since Isomap uses the geodesic

distances, which are distances measured along curved surfaces, as opposed to Eu-

clidean structures used in PCA or MDS, the structures associated with the curved

surfaces will be laid bare. With the curved surface well defined, signal distributions

that occur away from the these surfaces will stand out more easily.

Isomap outputs can be further supplemented with self-associative neural network-

based non-linear PCA (NPCA) [47]. The self-associative neural networks, trained

on sufficient data, would result in a smaller-dimension data sample transformed non-

linearly.

8.4 Conclusion

The data mining approach to data analysis in high-energy physics offers a opportu-

nity for experimentalists to address the issues of large data sets and the search for

novel signals in the data. An iterative technique, this approach is data-centric and

particularly suited for data analysis in an experimental realm, especially in the ex-

perimental environments in the Large Hadron Colliders and Linear Colliders of the

future.

Appendix A

Using FastCal in JAS

A.1 Introduction

FastCal is a general, fast, parameterized Monte Carlo simulator for the linear collider

detector that is written in Java. It designed to work in the JAS environment with

the hep.lcd suite of packages. FastCal is designed to coexist and cooperate with

other simulation packages. It works directly on final state particles in the particle

tables in hep.lcd.event.LCDEvent objects. The preferred inputs are from StdHep

files generated by Pandora-Pythia.

The main class FastCal extends the class AbstractProcessor in the lcd pack-

age. The method process(LCDEvent event) processes every event and adds two

ClusterList objects to the LCDEvent data. These two objects contain the clusters

in the ECAL and HCAL.

FastCal is available as a standalone Java archive (.jar) file, and also as a part of

the hep.lcd package under hep.lcd.mc.fast.cluster.saurav.

A.2 Use of the FastCal package

The package can be used as a standalone package. If FastCal is used as a standalone

package, .jar file should be in the extensions folder in JAS version 3 (or in the

classpath in JAS version 2). The package classes are imported with the following

import fastcal.*;

If FastCal is used from the hep.lcd package then no special installation is required,

and the classes can be imported via the following code:

import hep.lcd.mc.fast.cluster.saurav.*;

The FastCal simulation is added to the analysis code by specifying a detector and

adding a FastCal object as follows:

Detector.setCurrentDetector(new Detector("ldmar01"));

add(new FastCal(true));

add(new UserAnalysis());

The first line sets the detector type for use (in this case, detector ldmar01), and

the second line includes a FastCal object. The constructor has an input parameter of

type boolean. This parameter, when set to true, flips the direction of the magnetic

field in the detector. If it is set to false, the magnetic field is not flipped. This

parameter can be used to compare the FastCal results with those from GISMO (set it

to true). The third line adds the user analysis object UserAnalysis which contains

the user’s analysis code.

In the process() method in user’s analysis code, one can retrieve the two ClusterList

objects in the following way:

ClusterList hadCL = (ClusterList) event.get("HADFCalClusterList");

ClusterList emCL = (ClusterList) event.get("EMFCalClusterList");

These objects can now be used like any object with a ClusterList interface. The

Cluster objects can now be used in the conjunction with other objects in the event

data, like tracks etc., or on its own.

Appendix B

CJNN – Neural Network GUI

package for JAS

CJNN is a neural network package that is designed to work in the JAS environment,

in both the JAS2 and JAS3 versions. The goals for CJNN were:

• A user-friendly graphical user interface for network training.

• An ability to use the trained network from within JAS in analysis codes written

in Java.

• An ability to include a trained network in a Java-archived analysis package.

CJNN can also be used as a standalone neural network program.

B.1 A brief description of CJNN

CJNN is a fully connected multi-layered neural network package in Java. The number

of layers and neural units in each layer can be varied according to user needs. The

neural units are based on the sigmoidal (logistic) function, with a bias term. Learning

is via standard backpropagation on a standard error term with a momentum term to

speed it up, and the number of learning iterations is limited via earlystopping [78].

B.2 Use of CJNN

The CJNN package is available in a .jar file, that should be saved in the JAS3

extensions directory. The CJNN plugin adds a menuitem to the toolbar in JAS. The

selection of this menuitem launches the CJNN graphical user interface.

B.3 Data

CJNN requires two independent datasets for training—one for training and the other

for validation. Training data are read from flat files, with data arranged in columns.

For a training dataset that consists of p input variables, r output variables and n

records, the data are read from files with p + r columns and n rows. The first p

columns are input variables and the rest are output variables. The validation data

are read from flat files in a similar format.

The training interface is graphics-based, and can be run from within the JAS

environment. Figures below illustrate the GUI.

B.4 Training interface

Figure B.1 shows the CJNN SetUp Network panel that lets the user choose the

working directory and the network architecture. The network architecture is encoded

in the form i-h1-h2-o, where i is the number of inputs, h1 and h2 are the number of

hidden units in the two hidden layers and o is the number of output units. There

Figure B.1: The network setup panel. The working directory sets up the folder inwhich the data files exist, and all CJNN created files are copied. Network configura-tion configures the basic network architecture. The neural network can alternativelybe read from an already existing configuration file. Further, the configuration of anetwork can be written to a file for later use.

can be more hidden layers, and all units will be fully connected with all the units in

adjoining layers.

Figure B.2 shows the CJNN Training Params panel which lets the user set η, the

learning rate and α, the momentum rate. Default values are provided. The training

and validation datasets are flat files that are selected from the working directory.

Rescaling the training data is optional, which would standardize the input variables

to zero mean and unit variance.

Figure B.3 displays the Train panel. It gives the user the option to initialize the

network. The network is initialized by assigning uniformly distributed random values

to the weights. The assigned values are between the specified limits. For efficient

network training, the random numbers should be small.

Figure B.2: The second panel in the GUI sets up the training parameters. eta is thelearning coefficient and alpha is the momentum coefficient term. Rescaling the datatransforms variables such that they have zero mean and unit variance along columns.The training parameters can be written to and read from a file.

Figure B.3: The third panel, under Weights Initialization, sets the weights in thenetwork. This can be done either from a file, or the weights can be randomized.Once the weights are set, the iterative training can begin.

The network can also be initialized from a saved network. This option can be

used to retrain partially trained and previously saved networks (weights.last). The

network can be trained for a fixed number of cycles (epochs).

B.5 Error Display

CJNN uses the JAIDA graphic packages, and during training, a realtime graph of

the training and validation error is displayed. The earlystopping is implemented by

saving the network configuration with the minimum error on the validation dataset

in a file weights.min.

B.6 Application of CJNN in code

The use of the CJNN trained neural network is given in the following example code:

import cjnn.backprop.*;

public class TestCJNN

NetworkBasic net;

TestCJNN()

// Create the network

net = new NetworkBasic();

// Set the working folder

net.setWorkingFolder("/home/saurav/cjnn-test");

// Configure the network from the default file

// the file is called <network.net>, and should have been

// saved at the time of training

net.configureNetwork();

// Read in the weights from the file.

net.readWeights("weights.min");

// Read in the normalization parameters. The parameters

// are read from a file called <data.standard>

net.readStandardFile();

public void analysisCode()

double[] in = 1.588442e-04, 0, 1, 1.588442e-04;

double[] out = net.ffOutStandard(in);

for (int i=0; i<out.length; i++)

System.out.println(out[i]);

public static void main (String args[])

TestCJNN cjnnTest = new TestCJNN();

cjnnTest.analysisCode();

Appendix C

Demographic Clustering – an

example

The demographic clustering method [37] classifies data using an algorithm maximizing

the similarity between records in a cluster and minimizing the similarity of records

in different clusters. This is achieved by using an algorithm based on the Condorcet

solution from 1785.

C.1 Condorcet’s Criterion

Marquis de Condorcet (1743-1794) was a French philosopher, mathematician and

political scientist who devised a way to choose a candidate in a preferential voting

system. It is based on a ranking of candidates but the winner is chosen not by

counting the highest preferential votes, but by a pairwise preference count against all

other candidates. That is if more voters prefer candidate A over B, then A is the

winner against B. This rule is used to sort the list of candidates and find the ultimate

winner.

Inspired by this rule Michaud [60] proposed the New Condorcet Criterion. This

rule is based on a “vote” dij between two records i and j. If the two records differ from

each other by a distance value in one variable, the vote increases by 1. Therefore, dij

is the count of the number of variables in which the two records differ. If there are p

variables, then p− dij is the count of variables that the two records agree.

For any partition P (the division of the dataset into clusters), the goodness crite-

rion is measured as:

G(P ) =N∑

∑i∈Ck

∑j∈Ck;j 6=i

(p− dij) +∑

j 6∈Ck

, (C.1)

where N is the number of clusters and Ck is the k-th cluster.

The Demographic Clustering method is an implementation of this criterion that

seeks a particular partition that maximizes this goodness criterion.

C.2 Demographic Clustering example

The algorithm used in the Intelligent Miner application can be understood from the

following example.

Let us consider three records, with three fields each

1 5.6 4.5 6.5

2 5.0 6.2 6.3

3 8.9 6.0 9.3

For each field we define a distance parameter d, which we fix at 1.0 for all three

fields. We consider the records pairwise, comparing the values of each field between

the two records. If the numbers in a particular field fall within d, then it receives a

score of 1 for; if not, it receives a score of 1 against. Since there are three records, we

may write down the the scores of the three pairs of records:

For Against

1,2 2 1

2,3 1 2

1,3 0 3

We consider all the possible combinations of clusters and score them. The scoring

is done in the following way. If two records are kept in the same cluster, then their

scores, for and against, are the same as given in the table above. If they are in

different clusters, the scores get reversed. For example, consider the combination

(1,2)(3); where records 1 and 2 belong to one cluster and record 3 belongs to a

different cluster.

(1,2)(3) For Against

1,2 2 1

2,3 2 1

1,3 3 0

Total 7 2

Therefore, this combination receives a total score of 7/9. We may calculate the

scores for all the different combinations ((1,3)(2), (1)(2)(3) etc.), and pick up the

combination with the highest score. The combination given above has the highest

score.

If we add a fourth record, a similar table to the first one can be made. We will

thus have six pairs. And correspondingly, we will have three different combinations

to choose from: (1,2,4)(3); (1,2)(3,4) and (1,2)(3)(4). As we go down the dataset

and test for each subsequent record, there is always a possibility that a new cluster

is created.

Appendix D

Binary Trees

The binary tree is a data mining algorithm that is based on supervised training.

Given a training data set, the algorithm trains by performing recursive division of

the data into two parts until each part consists of members of the same class. Because

of the binary and recursive splitting of the data, methods that employ this approach

are called binary trees.

The particular implementation of the binary tree used in this thesis is called

SPRINT, which is a part of the tool set in Intelligent Miner from IBM.

D.1 SPRINT

The SPRINT algorithm is optimized for speed, large datasets and categorical as well

as continuous data. The exact implementation for this will not be discussed here (for

details, see [79]).

At each node of the binary tree, the algorithm seeks to divide the data into two

parts that represent the classes more cleanly. That is, if there are two classes on which

the algorithm is training, the algorithm seeks to divide the data into two parts so

that each part contains data from one of the classes. This it seeks to do by imposing

a cut on one of the variables in the data.

Therefore, at each node, the algorithm has to make two choices. One, it has to

pick one of the variables; and two it has to pick a cut on this variable.

D.1.1 Split Points

The split points are chosen with the help of the gini index. If S denotes the entire

dataset at the given node and there are n classes, then the gini index is given by

g(S) = 1−n∑j

p2j (D.1)

where pj is the relative frequency of class j in S.

All possible splits are considered, and for each split, gs(S), the split gini index is

evaluated

gs(S) =n1

ng(S1) +

ng(S2) (D.2)

The split with the lowest gs(S) is choosen.

Appendix E

Fuzzy Clustering

The fuzzy clustering method is a k-means method of clustering. k-means denotes the

requirement that the number of clusters into which the data will be partioned must

be given as an input. It is called fuzzy because the members are not given a discrete

(0,1) membership in a cluster but a probability of membership. The membership

probability is based on the inverse of the Euclidean distance from cluster centroids.

This is also called Fuzzy c-means (FCM) [24, 14].

E.1 Algorithm

The algorithm is initiated with a randomized selection of cluster centroids. The

membership probability pij is calculated for each record by

pij =d−1

ij∑k d−1

, (E.1)

where pij is the probability that record j belongs to the i-th cluster. The summation

in the denominator over all clusters normalizes the probability. The distance dij is

the Euclidean distance between the record data point and the cluster centroid.

The new centroid positions are found by the weighted average of the data points

~cnewi =

∑j pij~rj

N, (E.2)

where ~cnewi is the new position vector of the i-th cluster centroid and ~rj is the position

vector of the j-th record. N is the total number of records in the dataset.

This repositioning of the cluster centroids is done iteratively until the change in

the position of all clusters fall below a tolerance level.

This algorithm implicitly minimizes the following objective function.

J =∑

pijdij (E.3)

Appendix F

Principal Component Analysis and

Multidimensional Scaling

F.1 Classical Multidimensional Scaling

The classical multidimensional scaling (MDS) technique consists of transforming a

distance matrix into a two-dimensional space, such that the relative distance be-

tween any two records is preserved. It is equivalent in result to Principal Component

Analysis.

F.2 MDS Algorithm

Consider the configuration the n × p matrix, which constitutes the entire data set.

n is the number of records in the dataset and p is the number of variates in each

record. Since the variates have different units and ranges, calculating Euclidean

distances between record pairs does not have any meaning. Therefore the variates of

the configuration matrix are normalized:

x′ =x− µ

σ. (F.1)

Here µ is the mean and σ is the standard deviation of the variate.

From the configuration matrix, the distance-squared matrix D2 is obtained. This

is an n × n matrix, with each element representing the Euclidean distance squared

between record pairs. This is a symmetric matrix. D2 is centered using the following

prescription:

B = −1

2(I − 1

2U)D2(I − 1

2U). (F.2)

Here I is the identity matrix and U is a matrix with all unit elements.

A singular value decomposition is performed. Since B is a symmetric and centered

matrix, the decomposition is of the form:

B = UWU ′, (F.3)

where W is a diagonal matrix and U ′ is the transpose of U .

From the decomposition, we may reconstruct the new configuration matrix Y as

follows:

Y = UW12 . (F.4)

Here Y is an n×n matrix, but we may pick just the first two rows and neglect the rest,

since they represent the most important of the variates that adequately preserves the

D2 matrix. Thus we now have a n× 2 matrix, called Y2, which replaces our original

configuration matrix X.

Glossary

CDF Collider Detector ar Fermilab.

CJNN A neural network package written in C++ and Java.

CP Charge-Parity.

ECAL Electromagnetic Calorimeter.

EM Electromagnetic.

FastMC A fast Monte Carlo simulator.

GISMO A Package for Particle Transport and Detector Simulation.

HCAL Hadronic Calorimeter.

HEP High-energy Physics.

IBM International Business Machines.

IM Intelligent Miner, a suite of data mining tools from IBM.

JAS Java Analysis Studio.

KDD Knowledge Discovery in Databases.

LC Linear Collider.

LCD Linear Collider Detector.

LCDROOT A suit of tools for LCD analysis in the ROOT framework.

LEP Large Electron-Positron Collider.

LHC Large Hadron Collider.

MDS Multidimensional Scaling.

MLP Multilayered Perceptron.

MP McCulloch-Pitts, a kind of artifical neuron.

MSSM Minimal Supersymmetric Model.

NLC Next Linear Collider.

NN Neural Network.

NPCA Non-linear Principal Component Analysis.

NSCP National Scalable Cluster Project.

PCA Principal Component Analysis.

RMS Root Mean Square.

ROOT An Object Oriented Data Analysis Framework.

SKICAT Sky Image Cataloging and Analysis Tool.

SLIQ Supervised Learning In Quest, where Quest is the Data Mining project at

the IBM Almaden Research Center.

SM Standard Model.

SPRINT Scalable Parallelizable Induction of Decision Tree.

XML eXtensible Markup Language.

Bibliography

[1] T. Abe et al. Detectors for the linear collider (chapter 15). [3]. Resource book

for Snowmass 2001, 30 Jun - 21 Jul 2001, Snowmass, Colorado.

[2] T. Abe et al. Higgs bosons at the linear collider (chapter 3). [3]. Resource book

[3] T. Abe et al. Linear Collider Physics Resource Book for Snowmass 2001. Stan-

ford Linear Accelerator Center, Stanford, California, June 2001. Resource book

[4] Guozhong An. The effects of adding noise during backpropagation training on a

generalization performance. Neural Computation, 8:643–674, 1996.

[5] J. A. Anderson and E. Rosenfeld, editors. Neurocomputing. MIT Press, 1988.

[6] G. Arnison et al. Experimental observation of isolated large transverse en-

ergy electrons with associated missing energy at√

s = 540 GeV. Phys. Lett.,

B122:103–116, 1983.

[7] W. B. Atwood, T. Burnett, R. Cailliau, D. R. Myers, and K. M. Storr. Gismo:

An object oriented program for high-energy physics event simulation and recon-

struction. Int. J. Mod. Phys., C3:459–478, 1992.

[8] P. Avery et al. MCFast: A fast simulation package for detector design studies. In

Proceedings of Computing in High-energy Physics (CHEP 97), Berlin, Germany,

7-11 Apr 1997, 1997.

[9] M. Banner et al. Observation of single isolated electrons of high transverse mo-

mentum in events with missing transverse energy at the CERN anti-p p collider.

Phys. Lett., B122:476–485, 1983.

[10] R. Barate et al. Search for the standard model higgs boson at LEP. Physics

Letters B, 565:61–75, 2003.

[11] W. Bartel et al. JADE collaboration. Z. Phys., C33:23, 1986.

[12] R. Bellman. Adaptive Control Processes: A Guided Tour. Princeton University

Press, New Jersey, 1961.

[13] S. Bethke, M. Cavetti, H. F. Hoffmann, D. Jacobs, M. Kasemann, and D. Linglin.

Report of the steering group of the LHC computing review. Technical Report

CERN/LHCC/2001-004, CERN, 2001.

[14] J. C. Bezdek. Fuzzy Mathematics in Pattern Recognition. PhD thesis, Applied

Math. Center, Cornell University, 1973.

[15] H. Bichsel, D. E. Groom, and S. R. Klein. Passage of particles through matter.

Phys. Rev., D66:010001–195–010001–206, 2002.

[16] C. M. Bishop. Neural Networks for Pattern Recognition. Oxford University Press,

New York, 1995.

[17] R. K. Bock, T. Hansl-Kozanecka, and T. P. Shah. Parametrization of the longitu-

dinal development of hadronic showers in sampling calorimeters. NIM, 186:533–

539, 1981.

[18] Leo Breiman. Bagging predictors. Machine Learning, 24(2):123–140, 1996.

[19] T. H. Burnett, W. B. Atwood, R. Cailliau, D. Myers, and K. M. Storr. The

GISMO project: Application of object oriented techniques to detector simu-

lation. In Proceedings, Data structures for particle physics experiments, pages

125–130, 1990.

[20] Rich Caruana, Steve Lawrence, and C. Lee Giles. Overfitting in neural networks:

Backpropagation, conjugate gradient, and early stopping. In Advances in Neural

Information Processing Systems, Denver, Colorado, 2000.

[21] D. L. Chester. Why two hidden layers are better than one. In Lawrence Erlbaum,

editor, Proceedings of the International Joint Conference on Neural Networks,

pages 265–268, 1990.

[22] Trevor Cox and Michael Cox. Multidimensional Scaling. Chapman & Hall,

London, U.K, 1994.

[23] N. Cristianini and J. Shawe-Taylor. An Introduction to Support Vector Machines.

Cambridge University Press, 2000.

[24] Joseph C. Dunn. A fuzzy relative of ISODATA process and its use in detecting

compact, well separated clusters. Journ. Cybern., 3:95–104, 1973.

[25] J. L. Elman. Finding structure in time. Cognitive Science, 14:179–211, 1990.

[26] C. W. Fabjan. Calorimetry in high-energy physics. NATO Adv. Study Inst. Ser.

B Phys., 128:281, 1985.

[27] U. Fayyad, D. Haussler, and P. Stolorz. KDD for science data analysis: issues and

examples. In Proceedings of the Second International Conference on Knowledge

Discovery and Data Mining, Portland, Oregon, 1996. AAAI Press.

[28] U. M. Fayyad, N. Weir, and S. Djorgovski. SKICAT: A machine learning system

for automated cataloging of large scale sky surveys. In Proceedings of the Tenth

International Conference on Machine Learning, pages 112–119. ICML, 1993.

[29] R. A. Fisher. The Design of Experiments. Oliver and Boyd, London, 1925.

[30] R. A. Fisher. Statistical Methods and Scientific Inference. Oliver and Boyd,

London, 2 edition, 1956.

[31] W. Frawley, G. Piatetsky-Shapiro, and C. Matheus. Knowledge discovery in

databases: An overview. AI Magazine, pages 213–228, Fall 1992.

[32] J. Freeman and A. Beretvas. A short review of the CDF electromagnetic and

hadronic shower simulation. In Proceedings of the 1986 Summer Study on the

Physics of the Superconducting Supercollider, pages 482–486, New York, NY,

1986. American Physical Society.

[33] K. Funahashi. On the approximate realization of continuous mappings by neural

networks. Neural Networks, 2:183–192, 1989.

[34] Lynn Garren. StdHep 5.01: Monte Carlo Standardization at FNAL, 2002.

[35] S. Geman, S. Bienenstock, and R. Doursat. Neural networks and the

bias/variance dilemma. Neural Computation, 4:1–58, 1992.

[36] S. L. Glashow. Partial symmetries of weak interactions. Nucl. Phys., 22:579–588,

[37] Johannes Grabmeier and Andreas Rudolf. Techniques of cluster algorithms in

data mining. Data Mining and Knowledge Discovery, 6:303–360, 2002.

[38] D. E. Groom. Atomic and nuclear properties of materials (rev.). Phys. Rev.,

D66:010001, 2002.

[39] David J. Hand. Data mining: Statistics and more? The American Statistician,

52(2), 1998.

[40] D. O. Hebb. The Organization of Behavior, chapter 4, pages 60–78. Wiley, New

York, 1949.

[41] Frank Hoppner, Frank Klawonn, Rudolf Kruse, and Thomas Runkler. Fuzzy

Cluster Analysis. John Wiley & Sons, New York, 1999.

[42] K. Hornik, M. Stinchcombe, and H. White. Multilayer feedforward networks are

universal approximators. Neural Networks, 2:359–366, 1989.

[43] Masako Iwasaki and Toshinori Abe. LCD ROOT simulation and analysis tools.

2001. hep-ex/0102015.

[44] William James. Psychology (Briefer Course), chapter XVI, pages 253–279. Holt,

New York, 1890.

[45] T. Kohonen. Self-organized formation of topologically correct feature maps.

Biological Cybernetics, 43:59–69, 1982. Reprinted in [5].

[46] T. Kohonen. Self-Organization and Associative Memory. Springer-Verlag, Berlin,

third edition, 1989.

[47] M. A. Kramer. Nonlinear principal component analysis using autoassociative

neural networks. AIChe Journal, 37:233–243, 1991.

[48] Anders Krogh and Jesper Vedelsby. Neural network ensembles, cross validation

and active learning, volume 7 of Advances in Neural Information Processing

Systems. MIT, Cambridge, Massachussets, 1995.

[49] Ray Kurzweil. The law of accelerating returns. 2000.

http://www.kurzweilai.net/articles/art0134.html.

[50] A. Lapedes and R. Farber. How neural nets work. American Institute of Physics,

New York, 1988.

[51] Yann LeCun, Leon Bottou, Geneviere B. Orr, and Klaus-Robert Muller. Effi-

cient backprop. In Geneviere B. Orr and Klaus-Robert Muller, editors, Neural

Networks: Tricks of the Trade. 1998.

[52] David D. Lewis. Evaluating and optimizing autonomous text classification sys-

tems. 1995.

[53] L. Lonnblad, C. Peterson, Hong Pi, and T. Rongvladson. Self-organizing net-

works for extracting jet features. Computer Physics Communications, 67:193,

[54] L. Lonnblad, C. Peterson, and T. Rognvaldsson. Mass reconstruction with neural

network. Physics Letters B, 278:181–186, 1992.

[55] L. Lonnblad, C. Peterson, and T. Rongvladson. Using neural networks to identify

jets. Nuclear Physics B, 349:675, 1991.

[56] Richard Maclin and David Opitz. An empirical evaluation of bagging and boost-

ing. In The Fourteenth National Conference on Artificial Intelligence, pages

546–551, Rhode Island, 1997. AAAI Press.

[57] W. S. McCulloch and W. Pitts. A logical calculus of ideas immanent in nervous

activity. Bulletin of Mathematical Biophysics, 5:115-133:18–27, 1943.

[58] Manish Mehta, Rakesh Agrawal, and Jorma Rissanen. SLIQ: A fast scalable

classifier for data mining. In Extending Database Technology, pages 18–32, 1996.

[59] H. Messel and D. F. Crawford. Electron–photon shower distribution function;

tables for lead, copper, and air absorbers. Pergamon Press, New York, 1970.

[60] Pierre Michaud. Clustering techniques. Future Gener. Comput. Syst., 13(2-

3):135–147, 1997.

[61] M. Minsky and S. Papert. Perceptrons. MIT Press, 1969. Partially reprinted in

[62] Tom M. Mitchell. Does machine learning really work? AI Magazine, 18(3):11–20,

[63] Gordon Moore. Cramming more components onto integrated cirsuits. Electron-

ics, 38(8), 1965.

[64] Gordon Moore. No exponential is forever: but forever can be delayed! In IEEE

International Solid-State Circuit Conference, 2003.

[65] J. von Neumann. pages 66–82. Yale University Press, New Haven.

[66] M. De Palma, C. Favuzzi, G. Maggi, E. Nappi, A. Ranieri, and P. Spinelli.

Measurement, parametrization and fast simulation of hadronic showers in lead.

Nucl. Inst. Meths. in Phys. Res., 219:87–96, 1984.

[67] Michael E. Peskin. Pandora: An object-oriented event generator for linear col-

lider physics. 1999. hep-ph/9910519.

[68] William H. Press, Saul A. Teukolsky, William T. Vetterling, and Brian P. Flan-

nery. Numerical Recipes in C: The Art of Scientific Computing. Cambridge

University Press, 1992.

[69] M. D. Richard and R. P. Lippmann. Neural network classifiers estimate bayesian

a posteriori probabilities. Neural Computation, 3(4):461–483, 1991.

[70] B. D. Ripley. Pattern Recognition and Neural networks. Cambridge University

Press, 1996.

[71] N. Rochester, J. H. Holland, L. H. Haibt, and W. L. Duda. Tests on a cell

assembly theory of the action of the brain, using a large digital computer. IRE

Transactions on Information Technology, 1956.

[72] Thorsteinn Rognvaldsson. On langevin updating in multilayer perceptrons. Neu-

ral Computation, 6, 1994.

[73] F. Rosenblatt. The perceptron: a probabilistic model for information storage

and organization in the brain. Psychological Review, 65:386–408, 1958.

[74] B. Rossi. High Energy Particle. Prentice Hall Inc., Englewood Cliffs, NJ, 1952.

[75] Dennis W. Ruck et al. The multilayer perceptron as an approximation to a

bayes optimal discrimination function. IEEE Transactions on Neural Networks,

1(4):296–298, 1990.

[76] D. E. Rumelhart, G. E. Hinton, and R. J. Williams. Learning internal represen-

tations by error propagation. In D.E. Rumelhart and J.L. McClelland, editors,

Parallel Distributed Processing: Explorations in the Microstructure of Cognition,

volume 1, pages 318–362. MIT Press, 1986.

[77] Abdus Salam. Weak and electromagnetic interactions. In Proc. of the 8th Nobel

Symposium on ‘Elementary particle theory, relativistic groups and analyticity’,

pages 367–377, Stockholm, Sweden.

[78] W. Sarle. Stopped training and other remedies for overfitting. In Proceedings

of the 27th Symposium on Interface of Computing Science and Statistics, pages

352–360. Interface Foundation, 1995.

[79] John C. Shafer, Rakesh Agrawal, and Manish Mehta. SPRINT: A scalable par-

allel classifier for data mining. In T. M. Vijayaraman, Alejandro P. Buchmann,

C. Mohan, and Nandlal L. Sarda, editors, Proc. 22nd Int. Conf. Very Large

Databases, VLDB, pages 544–555. Morgan Kaufmann, 3–6 1996.

[80] H Siegelmann and E. Sontag. Neural nets are universal computing devices.

Technical Report SYCON-91-08, 1991.

[81] G. Simone. GEANT4: Simulation for the next generation of hep experiments.

In International Conference on Computing in High-energy Physics (CHEP 95),

Rio de Janeiro, Brazil, 18-22 Sep 1995, 1995.

[82] Torbjorn Sjostrand, Leif Lonnblad, and Stephen Mrenna. Pythia 6.2: Physics

and manual. 2001. hep-ph/0108264.

[83] Leonard Susskind. The gauge hierarchy problem, technicolor, supersymmetry,

and all that. (talk). Phys. Rept., 104:181–193, 1984.

[84] Joshua B. Tenenbaum, Vin de Silva, and John C. Langford. A global geometric

framework for nonlinear dimensionality reduction. Science, 290:2319–2323, 2000.

[85] Valdimir N. Vapnik. Statistical Learning Theory: Adaptive and learning systems

for signal processing, communications and control. John Wiley and Sons Inc.,

[86] E. A. Wan. Neural network classification: A bayesian interpretation. IEEE

Transactions on Neural Networks, 1(4):303–305, 1990.

[87] Steven Weinberg. A model of leptons. Phys. Rev. Lett., 19:1264–1266, 1967.

[88] H. White. Connectionists nonparametric regression: multilayer feedforward net-

works can learn arbitrary mappings. Neural Networks, 3:535–549, 1990.

[89] Julia Yarba. User’s guide for showering/calorimetry in MCFAST. 1999.

MULTIVARIATE METHODS FOR HADRONIC FINAL STATES IN ELECTRON-POSITRON COLLISIONS...

Documents

Transcript of MULTIVARIATE METHODS FOR HADRONIC FINAL STATES IN ELECTRON-POSITRON COLLISIONS...

Hadronic Physics

Hadronic Physics I

Creative Writing - The Best of Saurav Patra

Saurav anand iip qbd packaging development

fuzzy logic by saurav garg

Marketing and marketing mix -BY Saurav

Astudyon promotioal strategie sofbharathiairtelhyderadad-saurav thakur

basic hadronic SU(3) model generating a critical end point in a hadronic model

SAURAV ADMIT CRD[2].docx

Rare Hadronic B Decays

Hadronic Physics 3

Desiccant and Pharmaceutical coil for Packaging Saurav Anand

Hadronic Physics II - CERN

Hadronic results from KLOE

Hadronic SUSY @ CMS

Proteinengineering Saurav 110510012515 Phpapp02

RECONNECT 2013 - iisjaipur.org 2013.pdf · RECONNECT 2013 Saurav Gupta Batch No: 1999 Email: saurav@guptafabtax.com About Saurav . . . He is a post-graduate in Fashion Marketing from

Thesis -MA International Relations Saurav Narain- S2406934

C complete reference by saurav verma

Saurav b day