A New Approach to Fitting Linear Models in High Dimensional Spaces · 2011. 3. 27. · This thesis...
Transcript of A New Approach to Fitting Linear Models in High Dimensional Spaces · 2011. 3. 27. · This thesis...
A New Approachto Fitting Linear
Modelsin High DimensionalSpaces
YongWang
This thesisis submittedin partialfulfilment of therequirementsfor thedegreeofDoctorof Philosophyin ComputerScienceat TheUniversityof Waikato
November2000
c�
2000 YongWang
Abstract
This thesispresentsa new approachto fitting linear models,called“paceregres-
sion”, which alsoovercomesthe dimensionalitydeterminationproblem. Its opti-
mality in minimizing theexpectedpredictionlossis theoreticallyestablished,when
the numberof free parametersis infinitely large. In this sense,paceregression
outperformsexisting proceduresfor fitting linearmodels.Dimensionalitydetermi-
nation,aspecialcaseof fitting linearmodels,turnsoutto beanaturalby-product.A
rangeof simulationstudiesareconducted;theresultssupportthetheoreticalanaly-
sis.
Throughthethesis,adeeperunderstandingis gainedof theproblemof fitting linear
models. Many key issuesarediscussed.Existing procedures,namelyOLS, AIC,
BIC, RIC, CIC, CV(d), BS(m),RIDGE,NN-GAROTTE andLASSO,arereviewed
andcompared,boththeoreticallyandempirically, with thenew methods.
Estimatinga mixing distribution is an indispensablepart of paceregression. A
measure-basedminimum distanceapproach,including probability measuresand
nonnegativemeasures,is proposed,andstronglyconsistentestimatorsareproduced.
Of all minimum distancemethodsfor estimatinga mixing distribution, only the
nonnegative-measure-basedonesolvesthe minority clusterproblem,what is vital
for paceregression.
Paceregressionhasstriking advantagesover existing techniquesfor fitting linear
models. It alsohasmoregeneralimplicationsfor empiricalmodeling,which are
discussedin thethesis.
iii
Acknowledgements
First of all, I would like to expressmy sincerestgratitudeto my supervisor, Ian
Witten, who hasprovideda varietyof valuableinspiration,assistance,andsugges-
tions throughoutthe period of this project. His detailedconstructive comments,
bothtechnicallyandlinguistically, on this thesis(andlots of otherthings)arevery
muchappreciated.I have benefitedenormouslyfrom his eruditionandenthusiasm
in variouswaysin my research.Hecarefullyconsideredmy financialsituations,of-
tenbeforeI havedonesomyself,and,offeredresearchandteachingassistantwork
to helpmeout; thusI wasableto studywithout any financialworry.
I would alsolike to thankmy co-supervisor, Geoff Holmes,for his productivesug-
gestionsto improve my researchandits presentation.As thechief of themachine
learningprojectWeka,he financially supportedmy attendanceat several interna-
tional conferences.His kindnessandencouragementthroughoutthe last few years
aremuchvalued.
I owea particulardebtof gratitudeto AlastairScott(Departmentof Statistics,Uni-
versityof Auckland),who becamemy co-supervisorin thefinal yearof thedegree,
becauseof thesubstantialstatisticscomponentof thework. His kindness,expertise,
andtimely guidancehelpedmetremendously. He providedvaluablecommentson
almostevery technicalpoint, aswell asthewriting in thethesis;encouragedmeto
investigatea few furtherproblems;andmadeit possiblefor meto usecomputersin
hisdepartmentandhenceto implementpaceregressionin S-PLUS/R.
Thanksalsogo to RayLittler andThomasYee,who spenttheir precioustime dis-
v
cussingpaceregressionwith me andofferedilluminating suggestions.Kai Ming
Ting introducedme to the M5 numericpredictor, the re-implementationof which
wastaken asthe initial stepof the PhD projectandeventually led me to become
interestedin thedimensionalitydeterminationproblem.
Mark Apperley, the headof the Departmentof ComputerScience,University of
Waikato,helpedmefinancially, alongwith many otherthings,by offering a teach-
ing assistantship,a part-timelecturership,funding for attendinginternationalcon-
ferences,andfunding for visiting theUniversityof Calgaryfor six months(anar-
rangementmadeby Ian Witten). I am alsograteful to the Waikato Scholarships
Office for offering thethree-yearUniversityof WaikatoFull Scholarship.
I havealsoreceivedvarioushelpfrom many othersin theDepartmentof Computer
Science,in particular, Eibe Frank,Mark Hall, StuartInglis, Steve Jones,Masood
Masoodian,David McWha, GordonPaynter, BernhardPfahringer, Steve Reeves,
Phil Treweek,andLenTrigg. It wassodelightfulplayingsoccerwith thesepeople.
I would liketo expressmy appreciationto many peoplein theDepartmentof Statis-
tics,Universityof Auckland,in particular, PaulCopertwait, BrianMcArdle, Renate
Meyer, RussellMillar, Geoff Pritchard,andGeorge Seber, for allowing me to sit
in on their courses,providing lecturenotes,andmarkingmy assignmentsandtests,
which helpedmeto learnhow to applystatisticalknowledgeandexpressit grace-
fully. Thanksarealsodueto JenniHolden,JohnHuakau,RossIhaka,Alan Lee,
ChrisTriggs,andothersfor their hospitality.
Finally, andmostof all, I am gratefulto my family andextendedfamily for their
tremendoustoleranceandsupport—inparticular, my wife, Cecilia, andmy chil-
dren,CampbellandMichelle. It is a pity that I oftenhave to leave them,although
stayingwith themis soenjoyable.Thetime-consumingnatureof thiswork looksso
fascinatingto CampbellandMichelle thatthey planto goto theirdaddy’suniversity
first, insteadof aprimaryschool,afterthey leave their kindergarden.
vi
Contents
Abstract iii
Acknowledgements v
List of Figures xi
List of Tables xiii
1 Intr oduction 1
1.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1
1.2 Modelingprinciples. . . . . . . . . . . . . . . . . . . . . . . . . . 4
1.3 Theuncertaintyof estimatedmodels . . . . . . . . . . . . . . . . . 6
1.4 Contributions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8
1.5 Outline . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 12
2 Issuesin fitting linear models 15
2.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 15
2.2 Linearmodelsandthedistancemeasure. . . . . . . . . . . . . . . 17
2.3 OLS subsetmodelsandtheir ordering . . . . . . . . . . . . . . . . 19
2.4 Asymptotics. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 21
2.5 Shrinkagemethods . . . . . . . . . . . . . . . . . . . . . . . . . . 22
2.6 Dataresampling. . . . . . . . . . . . . . . . . . . . . . . . . . . . 23
2.7 TheRIC andCIC . . . . . . . . . . . . . . . . . . . . . . . . . . . 24
2.8 EmpiricalBayes. . . . . . . . . . . . . . . . . . . . . . . . . . . . 25
2.8.1 EmpiricalBayes . . . . . . . . . . . . . . . . . . . . . . . 26
vii
2.8.2 Steinestimation. . . . . . . . . . . . . . . . . . . . . . . . 28
2.9 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 29
3 Paceregression 31
3.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 31
3.2 Orthogonaldecompositionof models. . . . . . . . . . . . . . . . . 32
3.2.1 Decomposingdistances. . . . . . . . . . . . . . . . . . . . 33
3.2.2 Decomposingtheestimationtask . . . . . . . . . . . . . . 34
3.2.3 Remarks . . . . . . . . . . . . . . . . . . . . . . . . . . . 36
3.3 Contributionsandexpectedcontributions . . . . . . . . . . . . . . 37
3.3.1 ������� and ������� . . . . . . . . . . . . . . . . . . . . . . 38
3.3.2 ������ and ������� . . . . . . . . . . . . . . . . . . . . . . . . 40
3.3.3 ������������� . . . . . . . . . . . . . . . . . . . . . . . . . . . 41
3.3.4 ��������� . . . . . . . . . . . . . . . . . . . . . . . . . . . . 44
3.4 Therole of ������ and ������� in modeling . . . . . . . . . . . . . . . 46
3.5 Modelingwith known ���� � � . . . . . . . . . . . . . . . . . . . . . 55
3.6 Theestimationof ���� � � . . . . . . . . . . . . . . . . . . . . . . . 65
3.7 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 67
3.8 Appendix . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 68
3.8.1 Reconstructingmodelfrom updatedabsolutedistances. . . 68
3.8.2 TheTaylorexpansionof ������������� with respectto � � . . . 69
3.8.3 Proofof Theorem3.11 . . . . . . . . . . . . . . . . . . . . 70
3.8.4 Identifiability for mixturesof ��� �������� "!"� distributions . . . . 73
4 Discussion 75
4.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 75
4.2 Finite # vs. # -asymptotics. . . . . . . . . . . . . . . . . . . . . . . 75
4.3 Collinearity . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 76
4.4 Regressionfor partialmodels. . . . . . . . . . . . . . . . . . . . . 77
4.5 Remarkson modelingprinciples . . . . . . . . . . . . . . . . . . . 78
4.6 Orthogonalizationselection. . . . . . . . . . . . . . . . . . . . . . 80
4.7 Updatingsigned $%'& ? . . . . . . . . . . . . . . . . . . . . . . . . . . 81
viii
5 Efficient, reliable and consistentestimation of an arbitrary mixing dis-
trib ution 85
5.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 85
5.2 Existingestimationmethods . . . . . . . . . . . . . . . . . . . . . 89
5.2.1 Maximumlikelihoodmethods . . . . . . . . . . . . . . . . 89
5.2.2 Minimum distancemethods . . . . . . . . . . . . . . . . . 90
5.3 GeneralizingtheCDF-basedapproach. . . . . . . . . . . . . . . . 93
5.3.1 Conditions . . . . . . . . . . . . . . . . . . . . . . . . . . 93
5.3.2 Estimators . . . . . . . . . . . . . . . . . . . . . . . . . . 96
5.3.3 Remarks . . . . . . . . . . . . . . . . . . . . . . . . . . . 99
5.4 Themeasure-basedapproach. . . . . . . . . . . . . . . . . . . . . 100
5.4.1 Conditions . . . . . . . . . . . . . . . . . . . . . . . . . . 101
5.4.2 Approximationwith probabilitymeasures. . . . . . . . . . 103
5.4.3 Approximationwith nonnegativemeasures . . . . . . . . . 106
5.4.4 Considerationsfor finite samples. . . . . . . . . . . . . . . 108
5.5 Theminority clusterproblem . . . . . . . . . . . . . . . . . . . . . 111
5.5.1 Theproblem . . . . . . . . . . . . . . . . . . . . . . . . . 112
5.5.2 Thesolution . . . . . . . . . . . . . . . . . . . . . . . . . 113
5.5.3 Experimentalillustration . . . . . . . . . . . . . . . . . . . 114
5.6 Simulationstudiesof accuracy . . . . . . . . . . . . . . . . . . . . 117
5.7 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 120
6 Simulation studies 123
6.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 123
6.2 Artificial datasets:An illustration. . . . . . . . . . . . . . . . . . . 125
6.3 Artificial datasets:Towardsreality . . . . . . . . . . . . . . . . . . 132
6.4 Practicaldatasets . . . . . . . . . . . . . . . . . . . . . . . . . . . 141
6.5 Remarks. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 144
7 Summary and final remarks 145
7.1 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 145
7.2 Unsolvedproblemsandpossibleextensions . . . . . . . . . . . . . 147
ix
7.3 Modelingmethodologies. . . . . . . . . . . . . . . . . . . . . . . 149
Appendix Help filesand sourcecode 153
A.1 Help files . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 155
A.2 Sourcecode . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 168
References 193
x
List of Figures
3.1 ���(��)������� and �����*�����+�,� , for ���.-0/�13254(136"1�!�1,4 . . . . . . . . . . . . 43
3.2 Two-componentmixture � -function . . . . . . . . . . . . . . . . . 53
4.1 Uni-componentmixture . . . . . . . . . . . . . . . . . . . . . . . . 82
4.2 Asymmetricmixture . . . . . . . . . . . . . . . . . . . . . . . . . 83
4.3 Symmetricmixture . . . . . . . . . . . . . . . . . . . . . . . . . . 84
6.1 Experiment6.1 . . . . . . . . . . . . . . . . . . . . . . . . . . . . 127
6.2 Experiment6.2 . . . . . . . . . . . . . . . . . . . . . . . . . . . . 129
6.3 Experiment6.3 . . . . . . . . . . . . . . . . . . . . . . . . . . . . 131
6.4 ���(��)�,��� in Experiment6.4 . . . . . . . . . . . . . . . . . . . . . . 133
6.5 Mixture � -function( 78-9/�2;: ) . . . . . . . . . . . . . . . . . . . . . 135
6.6 Mixture � -function( 78-9/�2;< ) . . . . . . . . . . . . . . . . . . . . . 136
xi
List of Tables
5.1 Clusteringreliability . . . . . . . . . . . . . . . . . . . . . . . . . 116
5.2 Clusteringaccuracy: Mixturesof normaldistributions . . . . . . . . 119
5.3 Clusteringaccuracy: Mixturesof � � � distributions . . . . . . . . . . 120
6.1 Experiment6.1 . . . . . . . . . . . . . . . . . . . . . . . . . . . . 126
6.2 Experiment6.2 . . . . . . . . . . . . . . . . . . . . . . . . . . . . 128
6.3 Experiment6.3,for #=�.-0/�1">� , and # . . . . . . . . . . . . . . . . . . 130
6.4 Experiment6.4 . . . . . . . . . . . . . . . . . . . . . . . . . . . . 134
6.5 ? � in Experiment6.5 . . . . . . . . . . . . . . . . . . . . . . . . . 136
6.6 Experiment6.5( 78-9/=2@: ) . . . . . . . . . . . . . . . . . . . . . . . 138
6.7 Experiment6.5( 78-9/=2@< ) . . . . . . . . . . . . . . . . . . . . . . . 139
6.8 Experiment6.6 . . . . . . . . . . . . . . . . . . . . . . . . . . . . 140
6.9 Practicaldatasets . . . . . . . . . . . . . . . . . . . . . . . . . . . 142
6.10 Experimentsfor practicaldatasets . . . . . . . . . . . . . . . . . . 143
xiii
Chapter 1
Intr oduction
1.1 Intr oduction
Empirical modelingbuilds modelsfrom data,asopposedto analytical modeling
which derivesmodelsfrom mathematicalanalysison the basisof first principles,
backgroundknowledge,reasonableassumptions,etc. Onemajor difficulty in em-
pirical modelingis thehandlingof thenoiseembeddedin thedata.By noisehere
we meanthe uncertaintyfactors,which changefrom time to time andareusually
bestdescribedby probabilisticmodels. An empiricalmodelingproceduregener-
ally utilizesa modelprototypewith someadjustable,freeparametersandemploys
an optimizationcriterion to find the valuesof theseparametersin an attemptto
minimize theeffect of noise. In practice,“optimization” usuallyrefersto thebest
explanationof thedata,wheretheexplanationis quantifiedin somemathematical
measure;for example,leastsquares,leastabsolutedeviation,maximumlikelihood,
maximumentropy, minimumcost,etc.
Thenumberof freeparameters,moreprecisely, thedegreesof freedom, playsanim-
portantrole in empiricalmodeling.Onewell-known phenomenonconcerningthese
bestdataexplanationcriteriais thatany givendatacanbeincreasinglyexplainedas
thenumberof freeparametersincreases—intheextremecasewhenthis numberis
equalto thenumberof observations,thedatais fully explained.This phenomenon
1
is intuitive,for any effectcanbeexplainedgivenenoughreasons.Nevertheless,the
modelthatfully explainsthedatain this way usuallyhaslittle predictivepower on
futureobservations,becauseit explainsthenoisesomuchthattheunderlyingrela-
tionshipis obscured.Instead,a modelthatincludesfewer parametersmayperform
significantlybetterin prediction.
While freeparametersareinevitably requiredby themodelprototype,it seemsnat-
ural to ask the question: how many parametersshouldenter the final estimated
model? This is a classicissue,formally known asthe dimensionalitydetermina-
tion problem(or the dimensionalityreduction, modelselection, subsetselection,
variable selection, etc. problem). It was first brought to wide awarenessby H.
Akaike (1969) in the context of time seriesanalysis.Sincethen,it hasdrawn re-
searchattentionfrom many scientificcommunities,in particular, statisticsandcom-
puterscience(seeChapter2).
Dimensionalitydeterminationis akey issuein empiricalmodeling.Althoughmost
researchinterestin this problemfocuseson fitting linear models,it is widely ac-
knowledgedthat the problemplaguesalmostall typesof model structurein em-
pirical modeling.1 It is inevitably encounteredwhen building various types of
model, suchas generalizedadditive models(Hastieand Tibshirani, 1990), tree-
basedmodels(Breiman et al., 1984; Quinlan, 1993), projection pursuit (Fried-
man and Stuetzle,1981), multiple adaptive regressionsplines(Friedman,1991),
local weightedmodels(Cleveland,1979; Fan and Gijbels, 1996; Atkesonet al.,
1997;Loader, 1999),rule-basedlearning(Quinlan,1993),andinstance-basedlearn-
ing (Li, 1987;Aha et al., 1991),to namea few. Thereis a large literaturedevoted
to theissueof dimensionalitydetermination.Many procedureshavebeenproposed
to addresstheproblem—mostof themapplicableto both linearandothertypesof
modelstructure(seeChapter2). Monographsthataddressthegeneraltopicsof di-
mensionalitydetermination,in particularfor linearregression,includeLinhart and1It doesnot seemto be a problemfor the estimationof a mixing distribution (seeChapter5),
sincelargersetof candidatesupportpointsdoesnot jeopardize,if not improve,theestimationof themixing distribution. This observation,however, doesnot generalizeto every kind of unsupervisedlearning.For example,thelossfunctionemployedin a clusteringapplicationdoestake thenumberof clustersinto accountfor, say, computationalreasons.
2
Zucchini (1989),Rissanen(1989),Miller (1990),BurnhamandAnderson(1998),
andMcQuarrieandTsai(1998).
Sofar this problemis oftenconsideredasanindependentissueof modelingprinci-
ples.Unfortunately, it turnsout to besofundamentalin empiricalmodelingthat it
castsashadow over thegeneralvalidity of many existingmodelingprinciples,such
astheleastsquaresprinciple,the(generalized)maximumlikelihoodprinciple,etc.
This is becauseit takesbut onecounter-exampleto refutethecorrectnessof a gen-
eralization,andall theseprinciplesfail to solve the dimensionalitydetermination
problem.We believe its solutionrelatescloselyto fundamentalissuesof modeling
principles,which have beenthe subjectof controversy throughoutthe history of
theoreticalstatistics,without any consensusbeingreached.
R. A. Fisher(1922)identifiesthreeproblemsthatarisein datareduction:(1) model
specification,(2) parameterestimationin the specifiedmodel, and (3) the distri-
bution of theestimates.In muchof the literature,dimensionalitydeterminationis
consideredasbelongingto thefirst typeof problem,while modelingprinciplesbe-
long to thesecond.However, theconsiderationsin theprecedingparagraphimply
thatdimensionalitydeterminationbelongsto thesecondproblemtype.To makethis
point moreexplicit, we modify Fisher’s first two problemsslightly, into (1) speci-
fying thecandidatemodelspace,and(2) determiningtheoptimalmodelfrom the
modelspace.Thismodificationis consistentwith modernstatisticaldecisiontheory
(Wald,1950;Berger, 1985),andwill betakenasthestartingpointof our work.
This thesisproposesa new approachto solving the dimensionalitydetermination
problem,which turnsout to beaspecialcaseof general modeling, i.e., thedetermi-
nationof thevaluesof freeparameters.Thekey ideais to considertheuncertainty
of estimatedmodels(introducedin Section1.3), which resultsin a methodology
similar to empiricalBayes(briefly reviewedin Section2.8). Theissueof modeling
principlesis briefly coveredin Section1.2.Two principleswill beemployed,which
naturally leadto a solutionto the dimensionalitydeterminationproblem,or more
generally, fitting linear models. We will theoreticallyestablishoptimality in the
senseof # -asymptotics(i.e.,aninfinite numberof freeparameters)for theproposed
3
procedures.This impliesthetendency of increasingperformanceas # increasesand
justifies the appropriatenessof our proceduresbeingappliedin high-dimensional
spaces.Like many otherapproaches,we focusfor simplicity on fitting linearmod-
els with normally distributednoise,but the ideascanprobablybe generalizedto
many othersituations.
1.2 Modeling principles
In theprecedingsection,we discussedtheoptimizationcriterionusedin empirical
modeling. This relatesto the fundamentalquestionof modeling: amongthe can-
didatemodels,which one is the best? This is a difficult andcontroversial issue,
and thereis a large literaturedevoted to it; see,for example,Berger (1985) and
Brown (1990)andthereferencescitedtherein.
In our work, two separatebut complementaryprinciplesareadoptedfor success-
ful modeling. The first one follows the utility theory in decisionanalysis(von
Neumannand Morgenstern,1944; Wald, 1950; Berger, 1985), whereutility is a
mathematicalmeasureof thedecision-maker’s satisfaction. Theprinciple for best
decision-making,includingmodelestimation,is to maximizethe expectedutility.
Sincewearemoreconcernedwith loss—thenegativeutility—we rephrasethis into
thefundamentalprinciplefor empiricalmodelingasfollows.
The Minimum ExpectedLossPrinciple. Choosethemodelfrom thecandidate
modelspacewith smallestexpectedlossin predictingfutureobservations.
To minimize the expectedloss over future observations,we often needto make
assumptionsso that future observationscanbe relatedto the observationswhich
areusedfor modelconstruction.By assumptionswe meanstatisticaldistribution,
backgroundknowledge,loss information,etc. Of course,the performanceof the
resultingmodeloverfutureobservationsalsodependsonhow fartheseassumptions
arefrom futurereality.
4
Further, we definethata bettermodelimpliessmallerexpectedlossandequivalent
modelshave equalexpectedloss. This principle is then appliedto find the best
modelamongall candidates.
It mayseemthattheprincipledoesnotaddresstheissueof modelcomplexity (e.g.,
thenumberof parameters),which is importantin many practicalapplications.This
is in factnot true,becausethedefinitionof thelosscouldalsoincludelossincurred
dueto modelcomplexity—for example,the lossfor comprehensionandcomputa-
tion. Thusmodelcomplexity canbenaturallytakeninto accountwhenminimizing
theexpectedloss.In practice,however, thiskind of lossis oftendifficult to quantify
appropriately, andevenmoredifficult to manipulatemathematically. Whenthis is
thecase,asecondparsimony principlecanbecomehelpful.
The Simplicity Principle. Of theequivalentmodels,choosethesimplest.
Wewill applythis principleto approximatelyequivalentmodels,i.e.,whenthedif-
ferencebetweentheexpectedlossesof themodelsin considerationis tolerable.It is
worthpointingout that,thoughsimilar, thisprinciplediffersfrom thegenerallyun-
derstoodOccam’s razor, in that the former considersequalexpectedlosses,while
the latter equaldataexplanation. We believe that it is inappropriateto compare
modelsbasedon dataexplanation,andreturnto this in thenext section.
“Tolerable”hereis meantonly in the senseof statisticalsignificance.In practice,
modelcomplexity may be further reducedwhenotherfactors,suchasdatapreci-
sion,predictionsignificanceandcostof modelconstruction,areconsidered.These
issues,however, arelesstechnicalandwill notbeusedin our work.
Althoughtheaboveprinciplesmayseemsimpleandclear, theirapplicationis com-
plicatedin many situationsby thefactthatthedistributionof thetruemodelis usu-
ally unknown, andhencetheexpectedlossis uncomputable.Partly dueto this rea-
son,a numberof modelingprinciplesexist which arewidely usedandextensively
investigated.Someof themostimportantonesaretheleastsquaresprinciple(Leg-
endre,1805;Gauss,1809),themaximumlikelihoodprinciple(Fisher, 1922,1925;
5
Barnard,1949;Birnbaum,1962),themaximumentropy principle(Shannon,1948;
KullbackandLeibler, 1951;Kullback,1968),theminimaxprinciple(vonNeuman,
1928;von NeumannandMorgenstern,1944;Wald, 1939,1950), the Bayesprin-
ciple (Bayes,1763),the minimum descriptionlength(MDL) principle (Rissanen,
1978,1983),Epicurus’s principleof multiple explanations,andOccam’s razor. In-
steadof discussingtheseprincipleshere,we examinea few majoronesretrospec-
tively in Section4.5,afterour work is presented.
In contrastto the above analyticalapproach,modelevaluationcanalsobe under-
takenempirically, i.e., by data.Thedatausedfor evaluationcanbeobtainedfrom
anotherindependentsampleor from dataresampling(for the latter, seeEfron and
Tibshirani,1993;ShaoandTu, 1995). While empiricalevaluationhasfoundwide
applicationin practice,it is moreanimportantaid in evaluatingmodelsthanagen-
eral modelingprinciple. One significant limitation is that resultsobtainedfrom
experimentingwith datarely heavily ontheexperimentdesignandhenceareincon-
clusive. Further, they areoftencomputationallyexpensive andonly apply to finite
candidates.Almostall majorexistingdataresamplingtechniquesarestill underin-
tensive theoreticinvestigation—andsometheoreticresultsin fitting linear models
havealreadyrevealedtheir limitations;seeSection2.6.
1.3 The uncertainty of estimatedmodels
It is known that in empiricalmodeling,the noiseembeddedin datais propagated
into the estimatedmodels,causingestimationuncertainty. Fisher’s third problem
concernsthiskind of uncertainty. An exampleis theestimationof confidenceinter-
val. Thecommonapproachto decreasingtheuncertaintyof anestimatedmodelis
to increasethenumberof observations,assuggestedby theCentralLimit Theorem
andtheLawsof LargeNumbers.
Traditionally, as mentionedabove, a modelingprocedureoutputsthe model that
bestexplainsthedataamongthecandidatemodels.Sinceeachestimatedmodelis
6
uncertainto someextent, themodelthatoptimally explainsthedatais thewinner
in acompetitionbasedonthecombinedeffectof boththetruepredictivepowerand
theuncertaintyeffect. If therearemany independentcandidatemodelsin thecom-
petition,the“optimum” canbesignificantlyinfluencedby theuncertaintyeffectand
hencetheoptimaldataexplanationmaybegoodonly by chance.An extremesitua-
tion is thatwhenall candidateshavenopredictivepower—i.e.,only theuncertainty
effect is explained.Then,themorethecandidateexplainsthedata,thefartherit is
awayfrom thetruemodelandhencetheworseit performsoverfutureobservations.
In thisextremecase,theexpectedlossof thewinnerincreaseswithoutboundasthe
numberof candidatesincreases—aconclusionthat canbe easilyestablishedfrom
orderstatistics.Thecompetitionphenomenonis alsoinvestigatedby Miller (1990),
wherehedefinestheselectionor competitionbiasof theselectedmodel.This bias
generallyincreaseswith thenumberof candidatemodels.
This impliesthata modelingprocedurewhich is solelybasedon themaximization
of dataexplanationis notappropriate.To takeanextremecase,it couldalwaysfind
the existenceof “patterns” in a completelyrandomsequence,provided only that
enoughnumberof candidatemodelsareused. This kind of modelingpracticeis
known in unsavory termsas“datadredging”,“fishing expeditions”,“datamining”
(in theold sense2), and“torturing thedatauntil they confess”(Miller, 1990,p.12).
An appropriatemodelingprocedure,therefore,shouldtake the uncertaintyeffect
into consideration. It would be ideal if this effect could be separatedfrom the
model’s dataexplanation,sothatthecomparisonbetweenmodelsis basedon their
predictivepower only. Inevitably, without knowing thetruemodel,it is impossible
to filter theuncertaintyeffect out of theobservedperformancecompletely. Never-
theless,sinceweknow thatcapableparticipantstendto performwell andincapable
onestendto performbadly, we shouldbeableto gainsomeinformationaboutthe
competitionfrom theobservedperformanceof all participants,usethis information
to generateabetterestimateof thepredictivepowerof eachmodel,andevenupdate
themodelestimateto a betteroneby strippingoff theexpecteduncertaintyeffect.2Data mining now denotes“the extractionof previously unknown informationfrom databases
thatmaybelarge,noisyandhavemissingdata”(Chatfield,1995).
7
Eventually, themodelselectionor, moregenerally, themodelingtaskcanbebased
on the estimatedpredictive power, not simply the dataexplanation. In short, the
modelingprocedurereliesonnot only randomdata,but alsorandommodels.
This methodologicalconsiderationformsthebasisof our new modelingapproach.
The practicalsituationmay be so difficult that no generalmathematicaltools are
availablefor providing solutions.For example,theremaybemany candidatemod-
els,generatedby differenttypesof modelingprocedure,which arenot statistically
independentatall. Thecompetitioncanbeveryhardto analyze.
In this thesis,this ideais appliedto solvingthedimensionalitydeterminationprob-
lem in fitting linear models,which seemsa lessdifficult problemthanthe above
generalformulation.To bemoreprecise,it is appliedto fitting linearmodelsin the
generalsense,which in fact subsumesthe dimensionalitydeterminationproblem.
We shall transformthe full model that containsall parametersinto an orthogonal
spaceto make the dimensionalmodelsindependent.The observed dimensional
modelsserve to provide anestimationof theoverall distribution of thetruedimen-
sionalmodels,which in turnhelpsto updateeachobserveddimensionalmodelto a
betterestimate,in expectation,of thetruedimensionalmodel.Theoutputmodelis
re-generatedby aninversetransformation.
Although startedfrom a different viewpoint, i.e., that of competingmodels,our
approachturnsout to beverysimilar to theempiricalBayesmethodologypioneered
by Robbins(1951,1955,1964,1983);seeSection2.8.
1.4 Contrib utions
Althoughthemajorideasin this thesiscouldbegeneralizedto othertypesof model
structure,we focuson fitting linearmodels,with a strongemphasison thedimen-
sionality determinationproblem,which, aswe will demonstrate,turnsout to be a
specialcaseof modelingin thegeneralsense.
8
As we have pointedout, theproblemunderinvestigationrelatescloselyto thefun-
damentalprinciplesof modeling.Therefore,despitethefact thatwe would like to
focussimply on solvingmathematicalproblems,argumentsin somephilosophical
senseseeminevitable.Thedangeris thatthey maysoundtoo “philosophical”to be
sensible.
Themaincontributionsof thework areasfollows.
The competition viewpoint. Throughoutthe thesis,thecompetitionviewpoint
is adoptedsystematically. This not only providesthe key to fitting linear models
quantitatively, includingasolutionto thedimensionalitydeterminationproblem,but
alsohelpsto answer, at leastqualitatively, someimportantquestionsin thegeneral
caseof empiricalmodeling.
Thephenomenonof competingmodelsfor fitting linearmodelshave beeninvesti-
gatedby otherauthors;for example,Miller (1990),DonohoandJohnstone(1994),
FosterandGeorge (1994),andTibshiraniandKnight (1997). However, asshown
later, theseexisting methodscanstill fail in someparticularcases.Furthermore,
noneof them achieves # -asymptoticoptimality generally, a propertyenjoyed by
our new procedures.
Thecompetitionphenomenonpervadesin variouscontexts of empiricalmodeling;
for example,in orthogonalizationselection(seeSection4.6).
Predictive objective of modeling. Throughoutthework, weemphasizetheper-
formanceof constructedmodelsover futureobservations,by explicitly incorporat-
ing it into theminimumexpectedlossprinciple(Section1.2)andrelatingit analyt-
ically to our modelingprocedures(see,e.g.,Sections2.2 and3.6). Although this
goal is often usedin simulationstudiesof modelingproceduresin the literature,
it is rarely addressedanalytically, which seemsironical. Many existing modeling
proceduresdo not directly examinethis relationship,althoughit is acknowledged
thatthey arederivedfrom differentconsiderations.
9
Employing thepredictiveobjectiveemphasizesthatit is reality thatgovernstheory,
notviceversa.Thedeterminationof lossfunctionaccordingto specificapplications
is anotherexample.
In Section5.6,we usethis sameobjective for evaluatingtheaccuracy of clustering
procedures.
Paceregression. A new approachto fitting linear models,paceregression, is
proposed,basedon consideringcompetingmodels,where“pace” standsfor “Pro-
jectionAdjustmentby Contribution Estimation.” Six relatedproceduresaredevel-
oped,denotedPACE � to PACE A , thatshareacommonfundamentalidea—estimating
thedistribution of theeffectsof variablesfrom the dataandusingthis to improve
modeling. The optimality of theseprocedures,in the senseof # -asymptotics,is
theoreticallyestablished.The first four proceduresutilize OLS subsetselection,
andoutperformexisting OLS methodsfor subsetselection,includingOLS itself (by
which we meanthe OLS fitted modelthat includesall parameters).By abandoning
theideaof selection,PACE B achievesthehighestpredictionaccuracy of all. It even
surpassestheOLS estimatewhenall variableshavelargeeffectsontheoutcome—in
somecasesby a substantialmargin. Unfortunately, theextensive numericalcalcu-
lationsthatPACE B requiresmaylimit its applicationin practice.However, PACE A is
averygoodapproximationto PACE B , andis computationallyefficient.
For deriving paceregressionprocedures,somenew terms,suchasthe “contribu-
tion” of a modelandalsothe - and � -functions,are introduced. They areable
to indicatewhetheror not a model,or a dimensionalmodel,contributes,in expec-
tation, to a predictionof futureobservations,andareemployed to obtainthepace
regressionprocedures.
We realizethat this approachis similar to the empiricalBayesmethodology(see
Section2.8), which, accordingto theauthor’s knowledge,hasnot beensuchused
for fitting linear models(aswell asothermodel structures),despitethe fact that
empirical Bayeshaslong beenrecognizedas a breakthroughin statisticalinfer-
10
ence(see,e.g.,Neyman,1962;Maritz andLwin, 1989;Kotz andJohnson,1992;
Efron,1998;CarlinandLouis,2000).
Reviewandcomparisons. Thethesisalsoprovidesin Chapter2 ageneralreview
thatcoversa wide rangeof key issuesin fitting linearmodels.Comparisonsof the
new procedureswith existingonesareplacedappropriatelythroughoutthethesis.
Throughoutthe review, comparisonsandsometheoreticwork, we alsoattemptto
provide a unified view for mostapproachesto fitting linear models,suchas OLS,
OLS subsetselection,modelselectionbasedon dataresampling,shrinkageestima-
tion, empiricalBayes,etc.Further, ourwork in fitting linearmodelsseemsto imply
that supervisedlearningrequiresunsupervisedlearning—moreprecisely, the esti-
mationof a mixing distribution—asa fundamentalstepfor handlingcompetition
phenomenon.
Simulation studies. Simulationstudiesareconductedin Chapter6. Theexper-
imentsrangefrom artificial to practicaldatasets,illustration to application,high to
low noiselevel, largeto smallnumberof variables,independentto correlatedvari-
ables,etc.Theproceduresusedin theexperimentsincludeOLS, AIC, BIC, RIC, CIC,
PACE � , PACE C , PACE A , LASSO andNN-GAROTTE.
Experimentalresultsgenerallysupportour theoreticconclusions.
Efficient, reliable and consistentestimation of a mixing distrib ution. Pace
regressionrequiresanestimatorof themixing distributionof arbitrarymixturedis-
tribution. In Chapter5, a new minimumdistanceapproachbasedon a nonnegative
measureis shown to provideefficient,reliableandconsistentestimationof themix-
ing distribution. Theexisting minimumdistancemethodsareefficient andconsis-
tent, yet not reliable,dueto the minority clusterproblem(seeSection5.5), while
the maximumlikelihoodmethodsare reliableandconsistent,yet not efficient in
comparisonwith theminimumdistanceapproach.
11
Reliableestimationis a finite sampleissuebut is vital in paceregression,since
themisclusteringof evena singleisolatedpoint could resultin unboundedlossin
prediction. Efficiency may alsobe desiredin practice,especiallywhen it is nec-
essaryto fit linearmodelsrepeatedly. Strongconsistency of eachnew estimatoris
established.
Wenotethattheestimationof amixing distribution is anindependenttopicwith its
own valueandhasa largeliterature.
Implementation. Paceregression,togetherwith someotherideasin the thesis,
are implementedin the programminglanguagesof S-PLUS (testedwith versions
3.4 and 5.1)—in a compatibleway with R (version1.1.1)—andFORTRAN 77.
The sourcecodeis given in the Appendix,andalsoservesto provide full details
of the ideasin the thesis. Help files in S format have beenwritten for the most
importantfunctions.
All the sourcecodeandhelp files arefreely availablefrom the author, andcould
beredistributedand/ormodifiedunderthe termsof version2 of theGNU General
PublicLicenseaspublishedby theFreeSoftwareFoundation.
1.5 Outline
Thethesisconsistsof sevenchaptersandoneappendix.
Chapter1 briefly introducesthe generalidea of the whole work. It definesthe
dimensionalitydeterminationproblem,and points outs that a shadow is caston
the validity of many empirical modelingprinciplesby their failure to solve this
problem.Our approachto this problemadoptstheviewpointof competingmodels,
dueto estimationuncertainty, andutilizestwo complementarymodelingprinciples:
minimumexpectedloss,andsimplicity.
Someimportantissuesin fitting linearmodelsarereviewedin Chapter2, in anat-
12
tempt to provide a solid basisfor the work presentedlater. They include linear
modelsand distances,OLS subsetselection,asymptotics,shrinkage,dataresam-
pling, RIC and CIC procedures,and empirical Bayes,etc. The basicmethodof
applyingthegeneralideasof competingmodelsto linearregressionis embeddedin
this chapter.
Chapter3, thecoreof thethesis,formally proposesandtheoreticallyevaluatespace
regression.New concepts,suchascontribution, - and � -functions,aredefinedin
Section3.3 andtheir rolesin modelingareillustratedin Section3.4, respectively.
Six proceduresof paceregressionare formally definedin Section3.5, while # -asymptoticoptimality is establishedin Section3.6.
A few important issuesare discussedin Chapter4, somecompletingthe defini-
tion of thepaceregressionproceduresin specialsituations,othersexpandingtheir
implicationsinto abroaderarena,andstill othersraisingopenquestions.
In Chapter5, theminimumdistanceestimationof a mixing distribution is investi-
gated,alongwith abrief review of maximumlikelihoodestimation.It focusesones-
tablishingconsistency of theproposedprobability-measure-basedandnonnegative-
measure-basedestimators.Theminority clusterproblem,vital for paceregression,
is discussedin Section5.5.Section5.6furtherempiricallyinvestigatestheaccuracy
of theseclusteringestimatorsin situationsof overlappingclusters.
Resultsof simulationstudiesof paceregressionand other proceduresfor fitting
linearmodelsaregivenin Chapter6.
Thefinal chapterbriefly summarizesthewholework andprovidessometopicsfor
futurework andopenquestions.
The sourcecodein S andFORTRAN, andsomehelp files, are listed in the Ap-
pendix.
13
Chapter 2
Issuesin fitting linear models
2.1 Intr oduction
Thebasicideaof regressionanalysisis to fit alinearmodelto asetof data.Theclas-
sicalordinaryleastsquares(OLS) estimatoris simple,computationallycheap,and
haswell-establishedtheoreticaljustification. Nevertheless,themodelsit produces
areoftenlessthansatisfactory. For example,OLS doesnotdetectredundancy in the
setof independentvariablesthataresupplied,andwhenalargenumberof variables
arepresent,many of which areredundant,the modelproducedusuallyhasworse
predictiveperformanceonfuturedatathansimplermodelsthattakefewervariables
into account.
Many researchershave investigatedmethodsof subsetselectionin an attemptto
neutralizethis effect. Themostcommonapproachis OLS subsetselection:from a
setof OLS-fittedsubsetmodels,choosetheonethatoptimizessomepre-determined
modelingcriterion. Almost all theseproceduresarebasedon the ideaof thresh-
olding variationreduction:calculatinghow muchthevariationof themodelis in-
creasedif eachvariablein turn is taken away, settinga thresholdon this amount,
anddiscardingvariablesthatcontributelessthanthethreshold.Therationaleis that
anoisyvariable—i.e.,onethathasno predictivepower—usuallyreducesthevaria-
tion only marginally, whereasthevariationaccountedfor by a meaningfulvariable
is largerandgrowswith thevariable’ssignificance.
15
Many well-known procedures,including FPE (Akaike, 1970),AIC (Akaike, 1973),
CD (Mallows, 1973)and BIC (Schwarz, 1978) follow this approach.While these
certainly work well for somedatasets,extensive practicalexperienceand many
simulationstudieshave exposedshortcomingsin themall. Often, for example,a
certainproportionof redundantvariablesareincludedin thefinal model(Derksen
andKeselman,1992).Indeed,therearedatasetsfor which a full regressionmodel
outperformsthe selectedsubsetmodelunlessmostof the variablesareredundant
(Hoerl etal., 1986;Roecker, 1991).
Shrinkage methodsoffer analternative to OLS subsetselection.Simulationstudies
show thatthetechniqueof biasedridgeregressioncanoutperformOLS subsetselec-
tion,althoughit generatesamorecomplex model(FrankandFriedman,1993;Hoerl
et al., 1986).Theshrinkageideawasfurtherexploredby Breiman(1995)andTib-
shirani(1996),who wereableto generatemodelsthatarelesscomplex thanridge
regressionmodelsyetstill enjoy higherpredictiveaccuracy thanOLS subsetmodels.
Empiricalevidencepresentedin thesepaperssuggeststhatshrinkagemethodsyield
greaterpredictiveaccuracy thanOLS subsetselectionwhenamodelhasmany noisy
variables,or atmostamoderatenumberof variableswith moderate-sizedeffects—
whereasthey performworsewhentherearea few variablesthat have a dramatic
effecton theoutcome.
Theseproblemsaresystematic:the performanceof modelingprocedurescanbe
relatedto theeffectsof variablesandtheextentof theseeffects. Researchershave
soughtto understandthesephenomenaandusethemto motivatenew approaches.
For example,Miller (1990)investigatedtheselectionbiasthat is introducedwhen
the samedatais usedboth to estimatethe coefficientsandto choosethe subsets.
New procedures,including the little bootstrap(Breiman,1992),RIC (Donohoand
Johnstone,1994;FosterandGeorge,1994),andCIC (TibshiraniandKnight, 1997)
havebeenproposed.While theseundoubtedlyproducegoodmodelsfor many data
sets,we will seelater that thereis no singleapproachthatsolvesthesesystematic
problemsin ageneralsense.
Ourwork in this thesiswill show thattheseproblemsarethetip of aniceberg. They
16
aremanifestationsof a muchmoregeneralphenomenonthatcanbeunderstoodby
examiningtheexpectedcontributionsthat individual variablesmake in an orthog-
onaldecompositionof theestimatedmodel.This analysisleadsto a new approach
called“paceregression”;seeChapter3.
We investigatemodelestimationin a generalsensethatsubsumessubsetselection.
We do not confineour efforts to finding thebestof thesubsetmodels;insteadwe
addressthewholespaceof linearmodelsandregardsubsetmodelsasaspecialcase.
But it is animportantspecialcase,becausesimplifying themodelstructurehaswide
applicationsin practice,andwewill useit extensively to helpsharpenour ideas.
In this chapter, we introduceand review someimportantissuesin linear regres-
sion to provide a basisfor the analysisthat follows. We briefly review the major
approachesto model selection,focusingon their failure to solve the systematic
problemsraisedabove. A commonthreademerges:thekey to solving thegeneral
problemof modelselectionin linearregressionlies in thedistributionof theeffects
of thevariablesthatareinvolved.
2.2 Linear modelsand the distancemeasure
Given # explanatoryvariables,the responsevariable,and E independentobserva-
tions,a linearmodelcanbewritten in thefollowing matrix form,F -HGJI ��KML 1 (2.1)
where F is the E -dimensionalresponsevector, G the EON�# designmatrix, I�� the# -dimensionalparametervectorof the true, underlying,model,andeachelement
of the noisecomponentL is independentand identically distributedaccordingtoP ��/=1�QR�,� . We assumefor themostpart thatthevarianceQR� is known; if not, it can
beestimatedusingtheOLS estimator $QR� of thefull model.
With variablesdefined,a linearmodelis uniquelydeterminedby a parametervec-
17
tor I . Therefore,weuse S to denoteany model, S ��IT� themodelwith parameter
vector I , and SU� asshorthandfor theunderlyingmodel S ��IT��� . Theentirespace
of linear modelsis V > -XW'S ��IT�ZYTI\[^] >=_ . Note that modelsconsidered(and
produced)by the OLS method,OLS subsetselectionmethods,andshrinkagemeth-
odsareall subclassesof V > . Any zeroentryin I � correspondsto a truly redundant
variable.In fact,variableselectionis notaproblemindependentfrom parameteres-
timation,or moregenerally, modeling;it is justaspecialcasein whichthediscarded
variablescorrespondto zeroentriesin theestimatedparametervector.
Further, we needa generaldistancemeasurebetweenany two modelsin thespace,
which supersedesthe usually-usedloss function in the sensethat minimizing the
expecteddistancebetweentheestimatedandtruemodelsalsominimizes(perhaps
approximately)theexpectedloss.Weuseageneraldistancemeasureratherthanthe
lossfunctiononly, becauseotherdistancesarealsoof interest.Defininga distance
measurecoincidestoowith thenotionof ametricspace—whichis how weenvisage
themodelspace.
Now we definethedistancemeasureby relatingit to thecommonly-usedquadratic
loss.Givenadesignmatrix G , thepredictionvectorof themodel S ��IT� is F'`ba;c3d -GJI . In particular, thetruemodel S � predictstheoutputvector F � - F `fe -gGJI � .Thedistancebetweentwo modelsis definedash ��S ��I � �i1�S ��I � �j�k-mlnl F `ba@c3o�dqp F `ba@cirsd ltl � uQ � 1 (2.2)
where ltlwvTltl denotesthe x � norm. Note that our real interest,the quadraticlossltl F ` p F �ylnl � of the model S , is directly relatedto the distanceh ��Sz1�SU��� by
a scalingfactor QR� . Therefore,in the casethat QR� is unknown and $Qq� is usedin-
stead,any conclusionconcerningthedistanceusing Qq� remainstrueasymptotically
(Section3.6).
Of course,lnl F ` p F �ylnl � is only thelossunderthe { -fixedassumption—i.e.,thede-
signmatrix G remainsunchangedfrom futuredata. Anotherpossibleassumption
usedin fitting linearmodelsis { -random—i.e.,eachexplanatoryvariableis a ran-
18
domvariablein thesensethatbothtrainingandfuturedataaredrawn independently
from the samedistribution. The implicationsof { -fixed and { -randomassump-
tionsin subsetselectionarediscussedby Thompson(1978);seealsoMiller (1990),
Breiman(1992) and Breimanand Spector(1992) for different treatmentof two
cases,in particularwhen E is small. Althoughour work presentedlaterwill not di-
rectly investigatethe { -randomsituation,it is well known thatthetwo assumptions
convergeas Ef| } , andhenceproceduresoptimalin onesituationarealsooptimal
in theother, asymptotically.
The performanceof models(producedby modelingprocedures)is thusnaturally
orderedin termsof their expecteddistancesfrom the true model. The modeling
taskis henceto find a modelthat is ascloseaspossible,in expectation,to thetrue
one.Twoequivalentmodels,sayS ��I � � and S ��I � � , denotedby S ��I � �~-HS ��I � � ,haveequalexpectedvaluesof this distance.
2.3 OL S subsetmodelsand their ordering
An OLS subsetmodel is one that usesa subsetof the # candidatevariablesand
whoseparametervectoris anOLS fit. Whendeterminingthebestsubsetto use,it is
commonpracticeto generateasequenceof # K 6 nestedmodelsW'S & _ with increas-
ing numbers� of variables.S�� is thenull modelwith no variablesand S > is the
full modelwith all variablesincluded.TheOLS estimateof model S & ’s parameter
vectoris $I `�� -���Gb�`�� G `�� �,� � Gb�`�� F , where G `�� is the E�N�� designmatrix for
model S & . Let � `�� -�G `�� ��G �`�� G `�� � � � G �`�� betheorthogonalprojectionma-
trix from the original # -dimensionalspaceonto the reduced� -dimensionalspace.
Then $F `�� -9� `�� F is theOLS estimateof F �`�� -9� `�� F � .Oneway of determiningsubsetmodelsis to includethevariablesin a pre-defined
orderusingprior knowledgeaboutthe modelingsituation. For example,in time
seriesanalysisit usuallymakesgoodsenseto givepreferenceto closerpointswhen
selectingautoregressive terms,while whenfitting polynomials,lower-degreeterms
19
areoften includedbeforehigher-degreeones.Whenthevariablesequenceis pre-
defined,a total of # K 6 subsetmodelsareconsidered.
In theabsenceof prior ordering,a data-drivenapproachmustbeusedto determine
appropriatesubsets.Thefinal modelcould involve any subsetof thevariables.Of
course,computingandevaluatingall ! > modelsrapidly becomescomputationally
infeasibleas # increases.Techniquesthat are usedin practiceinclude forward,
backward,andstepwiserankingof variablesbasedon partial-� ratios(Thompson,
1978).
The differencebetweenthe prior orderinganddata-driven approachesaffects the
subsetselectionprocedure.If the orderingof variablesis pre-defined,subsetsare
determinedindependentlyof the data, which implies that the ratio betweenthe
residualsumof squaresand the estimatedvariancecanbe assumedto be � dis-
tributed. The subsetselectioncriteria FPE, AIC, andC D all make this assumption.
However, data-drivenorderingcomplicatesthesituation.Candidatevariablescom-
peteto enterand leave the model, causingcompetitionbias (Miller, 1990). It is
certainlypossibleto useFPE, AIC and C D in this situation,but they lack theoret-
ical support,and in practicethey performworsethanwhen the variableorder is
correctlypre-defined.For example,supposeunderfittingis negligible andthenum-
ber of redundantvariablesincreaseswithout bound. Thenthepredictive accuracy
of theselectedmodelandits expectednumberof redundantvariablesboth tendto
constantvalueswhenthevariableorderis pre-defined(Shibata,1976),whereasin
thedata-drivenscenariothey bothincreasewithoutbound.
Pre-definingthe orderingmakesuseof prior knowledgeof the underlyingmodel.
As is only to beexpected,thiswill improvemodelingif theinformationis basically
correct,andhinderit otherwise.In practice,acombinationof pre-definedanddata-
drivenorderingis oftenused.For example,whencertainvariablesareknown to be
relevant,they shoulddefinitelybekeptin themodel;also,it is commonpracticeto
alwaysretaintheconstantterm.
20
2.4 Asymptotics
We will be concernedwith two asymptoticsituations: E -asymptotics, wherethe
numberof observations increaseswithout bound, and # -asymptotics, where the
numberof variablesincreaseswithout bound. Usually # -asymptoticsimplies E -
asymptotics;for example,our # -asymptoticconclusionimplies,asleast,E p #�| }(seeSection3.6). In thissectionwereview someE -asymptoticresults.Theremain-
derof thethesisis largelyconcernedwith # -asymptotics,which justify theapplica-
bility of theproposedproceduresin situationof largedatasetswith many possible
variables—i.e.,asituationwhensubsetselectionmakessense.
ThemodelselectioncriteriaFPE, AIC andCD are E -asymptoticallyequivalent(Shi-
bata,1981) in the sensethat they dependon thresholdvaluesof � -statisticthat
becomethesame—inthis case,2—asE approachesinfinity. With reasonablylarge
samplesizes,theperformanceof different E -asymptoticallyequivalentcriteriaare
hardly distinguishable,both theoreticallyand experimentally. When discussing
asymptoticsituations,we useAIC to representall threecriteria. Notethatthis def-
inition of equivalenceis lessgeneralthanours,asdefinedin Sections1.2 and2.2;
ourswill beusedthroughoutthethesis.
Asymptoticallyspeaking,thechangein theresidualsumof squaresof a significant
variableis ����ER� , whereasthatof a redundantvariablehasa upperbound ����6'� , in
probability, andaupperbound�����t�"�~�n���kER� , almostsurely;see,e.g.,Shibata(1986),
Zhaoet al. (1986),andRao& Wu (1989). The modelestimatorgeneratedby a
thresholdfunctionboundedbetween����6'� and ����ER� is weaklyconsistentin terms
of modeldimensionality, whereasonewhosethresholdfunctionisboundedbetween�����n�"�~�t�"�~E�� and ����E�� is stronglyconsistent.
Somemodelselectioncriteriaare E -asymptoticallystronglyconsistent.Examples
includeBIC (Schwarz,1978), � (HannanandQuinn,1979),GIC (Zhaoetal.,1986),
andRaoandWu (1989). Theseall replaceAIC’s thresholdof ! by an increasing
functionof E boundedbetween�����t�"�~�n�"��ER� and ����E�� . Thefunctionvalueusually
21
exceeds2 (unlessE is verysmall),giving athresholdthatis largerthanAIC’s. How-
ever, employing therateof convergencedoesnot guaranteea satisfactorymodelin
practice.For any finite dataset,a higherthresholdrunsa greaterrisk of discarding
a non-redundantvariablethat is only barelycontributive. CriteriasuchasAIC that
are E -asymptoticallyinconsistentdo not necessarilyperformworsethanconsistent
onesin finite samples.
TheseOLS subsetselectioncriteria all minimize a quantity that becomes,in the
senseof E -asymptoticequivalence,ltl F p F `�� lnl � yQ � K ?�� (2.3)
with respectto thedimensionalityparameter� , where? is thethresholdvalue.De-
notethe E -asymptoticequivalenceof proceduresormodelsby -�� (and # -asymptotic
equivalenceby - > ). We write (2.3) in parameterizedform as OLSC ��?�� , where
OLS -+� OLSC ��/�� , AIC -+� OLSC ��!"� and BIC -+� OLSC ���t�"�~E�� . The model se-
lectedby criterion(2.3)is denotedby S OLSCa@��d
. Sinceequivalentproceduresimply
equivalentestimatedmodels(andviceversa),wecanalsowrite S OLS -���S OLSCa � d ,S AIC -+��S OLSC
a � d and S BIC -+��S OLSCa;� �¢¡ � d .
2.5 Shrinkage methods
Shrinkagemethodsprovideanalternativeto OLS subsetselection.Ridgeregression
givesa biasedestimateof themodel’sparametervectorthatdependson a ridgepa-
rameter. IncreasingthisquantityshrinkstheOLS parameterstowardzero.Thismay
give betterpredictionsby reducingthevarianceof predictedvalues,thoughat the
costof a slight increasein bias. It oftenimprovestheperformanceof theOLS esti-
matewhensomeof thevariablesare(approximately)collinear. Experimentsshow
that ridge regressioncanoutperformOLS subsetselectionif mostvariableshave
small to moderateeffects (Tibshirani,1996). Although standardridge regression
doesnot reducemodeldimensionality, its lesserknown variantsdo (Miller, 1990).
22
Thenn-garrote(Breiman,1995)andlasso(Tibshirani,1996)procedureszerosome
parametersand shrink othersby defining linear inequality constraintson the pa-
rameters.Experimentsshow that thesemethodsoutperformridge regressionand
OLS subsetselectionwhenpredictorshavesmallto moderatenumbersof moderate-
sizedeffects, whereasOLS subsetselectionbasedon CD prevails over othersfor
smallnumbersof largeeffects(Tibshirani,1996).
All shrinkagemethodsrely onaparameter:theridgeparameterfor ridgeregression,
thegarroteparameterfor thenn-garrote,andthetuningparameterfor thelasso.In
eachcasetheparametervaluesignificantlyinfluencestheresult.However, thereis
no consensuson how to determinesuitablevalues,which may partly explain the
unstableperformanceof thesemethods.In Section3.4,we offer a new explanation
of shrinkagemethods.
2.6 Data resampling
Standardtechniquesof dataresampling,suchascross-validationandthebootstrap,
canbe appliedto the subsetselectionproblem. Theoreticalwork hasshown that,
despitetheir computationalexpense,thesemethodsperformno betterthantheOLS
subsetselectionprocedures.For example,underweak conditions,Shao(1993)
showsthatthemodelselectedby leave-£ -out cross-validation,denotedCV( £ ), is E -
asymptoticallyconsistentonly if £( yEf| 6 andE p £)| } asEf| } . Thissuggests
that, perhapssurprisingly, the training set in eachfold shouldbe chosento be as
small aspossible,if consistency is desired.Zhang(1993)further establishesthat
undersimilarconditions,CV ��£(�~- OLSC ����!yE p £(�j w��E p £(����E -asymptotically. This
meansthatAIC - CV ��6'� andBIC - CV ��E~���n�"�~E p !��� w���n�"�~E p 6'�j�¤E -asymptotically.
The behavior of the bootstrapfor subsetselectionis examinedby Shao(1996),
who provesthat if thebootstrapusingsamplesize ¥ , BS ��¥¦� , satisfies¥¦ uE§| / ,it is E -asymptoticallyequivalentto CV ��E p ¥¦� ; in particular, BS ��E���- CV �¨6'�©E -
asymptotically. Therefore,BS ��¥¦��- OLSC ����E K ¥¦�� y¥¦��E -asymptotically.
23
Oneproblemwith thesedataresamplingmethodsis the difficulty of choosingan
appropriatenumberof folds £ for cross-validation,or anappropriatesamplesize ¥for thebootstrap—bothaffect thecorrespondingprocedures.More fundamentally,
their asymptoticequivalenceto OLSC seemsto imply that thesedataresampling
techniquesfail to take the competitionphenomenoninto accountand henceare
unableto solve theproblemsraisedin Section2.1.
2.7 The RI C and CI C
Whenthereis no pre-definedorderingof variables,it is necessaryto take account
of the processby which a suitableorderingis determined.The expectedvalueof
the ª th largestsquared« -statisticof # noisy variablesapproaches!¬�n�����#= yª¢� as #increasesindefinitely. Thispropertycanhelpwith variableselection.
Thesoft thresholdingproceduredevelopedin thecontext of wavelets(Donohoand
Johnstone,1994)andthe RIC for subsetselection(FosterandGeorge,1994)both
aim to eliminateall non-contributory variables,up to the largest,by replacingthe
threshold! in AIC with !¬�t�"�k# ; that is, RIC - OLSC ��!¬�n�"�k#=� . Themorevariables,
the higherthe threshold.Whenthe true hypothesisis the null hypothesis(that is,
thereareno contributive variables),or thecontributive variablesall have large ef-
fects,RIC findsthecorrectmodelby eliminatingall noisyvariablesupto thelargest.
However, whentherearesignificantvariableswith small to moderateeffects,these
canbeerroneouslyeliminatedby thehigherthresholdvalue.
TheCIC procedure(TibshiraniandKnight, 1997)adjuststhetrainingerrorby tak-
ing into accounttheaveragecovarianceof thepredictionsandresponses,basedon
thepermutationdistribution of thedataset.In anorthogonallydecomposedmodel
24
space,thecriterionsimplifiesto
CIC �t�(�®- lnl F p F `�� ltl �TK ! E�Z¯ &° ±³² � « � a ±n´ > dnµ $Q � (2.4)¶ lnl F p F `�� ltl � K§· &° ±³² � �n������#w uª¢��$Q � 1 (2.5)
where «¨� a ±n´ > d is the ª th largestsquared« -statisticout of # , andE� is theexpectation
over the permutationdistribution. As this equationshows, CIC usesa threshold
valuethatis twice theexpectedsumof thesquared« statisticsof the � largestnoisy
variablesoutof # . Because�n¸t¹ >,º¼» ��½ «¨� a � ´ > d¿¾ ! E«¨� a � ´ > d�À -0/ , thismeansthat,for the
null hypothesis,even the largestnoisy variableis almostalwayseliminatedfrom
themodel. Furthermore,this hastheadvantageover RIC thatsmallercontributive
variablesaremorelikely to berecognizedandretained,dueto theuneven,stairwise
thresholdof CIC for eachindividualvariable.
Nevertheless,shortcomingsexist. For example,if mostvariableshavestrongeffects
andwill certainlynotbediscarded,theremainingnoisyvariablesaretreatedby CIC
asthoughthey werethesmallestout of # noisyvariables—whereasin reality, the
numbershouldbe reducedto reflect the smallernumberof noisy variables. An
overfittedmodelwill likely result.Analogously, underfittingwill occurwhenthere
are just a few contributive variables(Chapter6 givesan experimentalillustration
of this effect). CIC is basedon anexpectedorderingof thesquared« -statisticsfor
noisy variables,and doesnot dealproperly with situationswherevariableshave
mixedeffects.
2.8 Empirical Bayes
Although our work startedfrom the viewpoint of competingmodels,the solution
discussedin the later chaptersturnsout to be very similar to the methodologyof
empiricalBayes.In this section,we briefly review this paradigm,in anattemptto
provideasimpleandsolidbasisfor introducingtheideasin ourapproach.However,
25
Sections4.6and7.3show thatthecompetingmodelsviewpointseemsto fall beyond
thescopeof theexistingempiricalBayesmethods.
2.8.1 Empirical Bayes
The empirical Bayesmethodologyis due to Robbins(1951,1955,1964,1983).
After its appearance,it wasquickly consideredto beabreakthroughin thetheoryof
statisticaldecisionmaking(cf. Neyman,1962).Thereis alargeliteraturedevotedto
it. Booksthatprovide introductionsanddiscussionsincludeBerger(1985),Maritz
andLwin (1989),CarlinandLouis(1996),andLehmannandCasella(1998).Carlin
andLouis (2000)givea recentreview.
The fundamentalideaof empiricalBayes,encompassingthe compounddecision
problem(Robbins,1951)1, is to usedatato estimatethe prior distribution (and/or
theposteriordistribution) in Bayesiananalysis,insteadof resortingto a subjective
prior in theconventionalway. Despitetheterm“empiricalBayes”it is a frequentist
approach,andLindley notesin hisdiscussionof Copas(1969)that“thereis noone
lessBayesianthananempiricalBayesian.”
A typicalempiricalBayesprobleminvolvesobservations{ � 1323232Á1�{ > , independently
sampledfrom distributions ���{ ± �,Â��± � , where Â��± may be completelydifferent for
eachindividual ª . DenoteÃg-���{ � 1323232i1�{ > �¢Ä and Å � -���Â��� 132�232�1,Ây�> �¨Ä . Let Å be an
estimatorof Å � . Givena lossfunction,say,Æ ��ÅT1�Å � ��-mlnlÇÅ p Å � lnl � - >° ±È² � �� ± p  �± � � 1 (2.6)
theproblemis to find theoptimalestimatorthatminimizestheexpectedloss.
As we know, the usualmaximumlikelihoodestimatoris Å ML -ÉÃ , which is also
the Bayesestimatorwith respectto the non-informative constantprior. The em-
pirical Bayesestimator, however, is different. Since { � 1�23232�1j{ > are independent,1Throughoutthethesis,weadoptthewidely-usedtermempiricalBayes, althoughthetermcom-
pounddecisionprobablymakesmoresensein ourcontext of competingmodels.
26
we canconsiderthat Ây���1323232Á1,Â��> arei.i.d. from a commonprior distribution ����Â��,� .Therefore,it is possibleto estimate���Ây��� from { � 1�23232�1j{ > basedontherelationship� > ��{¤� ¶\Ê ���{��, � �©Ë���� � ��2 (2.7)
Fromthis relationship,it is possibleto find anestimator ����� � � of ��� � � from the
known � > ��{¤� and ����{��,Â���� (we will discusshow to do this in Chapter5). Once���� � � hasbeenfound,theremainingstepsareexactly like Bayesiananalysiswith
the subjective prior replacedby the estimated ����� � � , resulting in the empirical
Bayesestimator  EB
± ��ª�1�Ã��k-ÍÌuÎ Ây�,Ϭ��{ ± �,Â��,�©ËÐ����Ây���Ì Î Ï¿��{ ± �� � �©Ë¼���� � � 2 (2.8)
Note that the dataareusedtwice: oncefor the estimationof �������� andagainfor
updatingeachindividual  ML
±(or { ± ).
TheempiricalBayesestimatorcanbetheoreticallyjustifiedin theasymptoticsense.
TheBayesrisk of theestimatorconvergesto thetrueBayesrisk as #�| } , i.e.,as
if thetruedistribution ��� � � wereknown. Sinceno estimatorcanreducetheBayes
risk below the true Bayesrisk, it is asymptoticallyoptimal (Robbins,1951,1955,
1964).Clearly Å ML is inferior to Å EB in this sense.
Empirical Bayescan be categorizedin differentways. The bestknown division
is betweenparametric empirical Bayesand nonparametricempirical Bayes, de-
pendingon whetheror not the prior distribution assumesa parametricform—for
example,anormaldistribution.
Applicationsof empirical Bayesare surveyed by Morris (1983) and Maritz and
Lwin (1989).Themethodis oftenusedeitherto generatea prior distributionbased
on pastexperience(see,e.g.,Berger, 1985),or to solve compounddecisionprob-
lemswhena lossfunction in theform of (2.6) canbeassumed.In mostcases,the
parametricempiricalBayesapproachis adopted(see,e.g.,Robbins,1983;Maritz
andLwin, 1989),wherethemixing distribution hasa parametricinterpretation.To
27
my knowledge,it is rarelyusedin thegeneralsettingof empiricalmodeling,which
is the main concernof the thesis. Further, the lessfrequentlyusednonparametric
empiricalBayesappearsmoreappropriatein our approach.
2.8.2 Steinestimation
Shrinkageestimationor Steinestimationis anotherfrequentisteffort, andhasclose
ties to parametricempiricalBayes. It wasoriginatedby Stein(1955),who found,
surprisingly, that theusualmaximumlikelihoodestimatorfor themeanof a multi-
variatenormaldistribution,i.e., Å ML -^Ã , is inadmissiblewhen #ÒÑg! . Heproposed
anew estimator Å JS -ÔÓ¨6 p ��# p !"��Qq�lnl ÃÕlnl � Ö Ã~1 (2.9)
which waslater shown by JamesandStein(1961) to have smallermeansquared
error thantheusualestimatorfor every Å � , when #8Ñ×! , althoughthenew estima-
tor is itself inadmissible.This new estimator, known astheJames-Steinestimator,
improvesovertheusualestimatorby shrinkingit towardzero.Subsequentdevelop-
mentof shrinkageestimationgoesfar beyondthis case,for example,positive-part
James-Steinestimator, shrinking toward the commonmean,toward a linear sub-
space,etc.(cf. LehmannandCasella,1998).
The connectionof the James-Steinestimatorto parametricempirical Bayeswas
establishedin the seminalpapersof Efron and Morris (1971,1972a,b,1973a,b,
1975,1977). Among other things, they showed that the James-Steinestimatoris
exactlytheparametricempiricalBayesestimatorif �������� is assumedto beanormal
distributionandtheunbiasedestimatorof theshrinkagefactoris used.
Althoughour work doesnot adopttheSteinestimationapproachandfocusesonly
onasymptoticresults,it is interestingto notethatwhile thegeneralempiricalBayes
estimatorachievesasymptoticoptimality, theJames-Steinestimator, andafew other
similarones,uniformallyoutperformtheusualML estimatorevenfor finite #��¨Ñ^!"� .28
Sofar theredoesnot seemto have beenfoundanestimatorwhich bothdominates
the usualML estimatorfor finite # and is # -asymptoticallyoptimal for arbitrary������� .The James-Steinestimatoris appliedto linear regressionby Sclove (1968) in the
form ØIb-Ù��6 p ��# p !"��Q �Ú�Û�Û � �I~1 (2.10)
where �I is theOLS estimateof theregressioncoefficientsandRSSis theregression
sumof squares.Note that whenRSSis large in comparisonwith the numerator��# p !"�¨QR� , shrinkagehasalmostno effect. Alternativederivationsof this estimator
aregivenby Copas(1983).
2.9 Summary
In this chapter, we reviewed the major issuesin fitting linear models,aswell as
proceduressuchasOLS, OLS subsetselection,shrinkage,asymptotics,x-fixedvs.x-
random,dataresampling,etc.Fromthediscussion,acommonthreadhasemerged:
eachprocedure’s performanceis closelyrelatedto thedistribution of theeffectsof
the variablesit involves. This considerationis analogousto the idea underlying
empiricalBayesmethodology, andwill beexploitedin laterchapters.
29
Chapter 3
Paceregression
3.1 Intr oduction
In Chapter2, we reviewedmajor issuesin fitting linearmodelsandintroducedthe
empiricalBayesmethodology. Fromthisbrief review, it becomesclearthatthema-
jor proceduresfor linearregressionall fail to solve thesystematicproblemsraised
in Section2.1 in any generalsense.It hasalsoemergedthateachprocedure’s per-
formancerelatescloselyto theproportionsof thedifferenteffectsof theindividual
variables. It seemsthat the essentialfeatureof any particularregressionproblem
is thedistribution of theeffectsof thevariablesit involves.This raisesthreeques-
tions:how to definethisdistribution;how to estimateit from thedata,if indeedthis
is possible;andhow to formulatesatisfactorygeneralregressionproceduresif the
distribution is known. Thischapteranswersthesequestions.
In Section3.2, we introducethe orthogonaldecompositionof models. In the re-
sultingmodelspace,theeffectsof variables,whichcorrespondto dimensionsin the
space,arestatisticallyindependent.Oncethe effectsof individual variableshave
beenteasedout in this way, thedistributionof theseeffectsis easilydefined.
Thesecondquestionaskswhetherwe canestimatethis distribution from thedata.
The answeris “yes.” Moreover, estimatorsexist which arestronglyconsistent,in
thesenseof # -asymptotics(andalso E -asymptotics).In fact,theestimationis sim-
31
ply a clusteringproblem—tobe more precise,it involvesestimatingthe mixing
distributionof amixture.Section3.6showshow to performthis.
We answerthe third questionby demonstratingthreesuccessively morepowerful
techniques,andestablishoptimality in thesenseof # -asymptotics.First, following
conventionalideasof modelselection,thedistributionof theeffectsof thevariables
canbe usedto derive an optimal thresholdfor OLS subsetmodelselection. The
resultingestimatoris provably superiorto all existing OLS subsetselectiontech-
niquesthatarebasedon a thresholdingprocedure.Second,by showing that there
are limitations to the useof thresholdingvariationreduction,we develop an im-
provedselectionprocedurethatdoesnot involve thresholding.Third, abandoning
theuseof selectionentirelyresultsin anew adjustmenttechniquethatsubstantially
outperformsall otherprocedures—outperformingOLS evenwhenall variableshave
largeeffects.Section3.5 formally introducestheseprocedures,which areall based
on analyzingthedimensionalcontributionsof the estimatedmodelsintroducedin
Section3.3, anddiscussestheir properties.Section3.4 illustratesthe procedures
throughexamples.
Twooptimalities,correspondingto theminimumprior andposteriorexpectedlosses
respectively, aredefinedin Section3.4. The optimality of eachproposedestima-
tor is theoreticallyestablishedin thesenseof # -asymptotics,which further impliesE -asymptotics.Whenthenoisevarianceis unknown andreplacedby the OLS un-
biasedestimator $Qq� , the realizationof the optimality requires ��E p #=�| } ; see
Corollary3.12.1in Section3.6.
Sometechnicaldetailsarepostponedto theappendixin Section3.8.
3.2 Orthogonal decompositionof models
In this section,we discussthe orthogonaldecompositionof linear models,and
definesomespecialdistancesthat areof particularinterestin our modelingtask.
We will alsobriefly discussin Section3.2.3 the advantagesof usingan orthogo-
32
nal modelspace,the OLS subsetmodels,andthe choiceof an orthonormalbasis.
We assumethatno variablesarecollinear—that is, a modelwith # variableshas #degreesof freedom.Wereturnto theproblemof collinearvariablesin Section4.3.
Following thenotationintroducedin Section2.2,givenamodel S ��IT� with param-
etervector I , its predictionvectoris F ` -HGJI whereG is the E�N�# designmatrix.
This vectoris locatedin thespacespannedby the # separateE -vectorsthat repre-
sentthevaluesof theindividual variables.For any orthonormalbasisof this spaceÜ � 1 Ü � 13232�2�1 Ü > (satisfying lnl Ü & ltl�-Ý6 ), let � � 1,� � 1323232Á1,� > bethecorrespondingprojec-
tion matricesfrom this spaceonto the axes. F ` decomposesinto # components� � F ` 1�� � F ` 13232�2Á1,� > F ` , eachbeinga projectionontoa differentaxis. Clearly the
wholeis thesumof theparts: F ` -ßÞ >& ² � � &iF ` .
3.2.1 Decomposingdistances
Thedistanceh ��S ��I � �i1�S ��I � ��� betweenmodelsS ��I � � and S ��I � � hasbeende-
finedin (2.2).AlthoughthismeasureinvolvesthenoisevarianceQ � for convenience
of bothanalysisandcomputation,it is lnl F `ba@c3o�d p F `ba@cirsd ltl � that is thecenterof in-
terest.
Givenanorthogonalbasis,thedistancebetweentwo modelscanbedecomposedas
follows: h ��S ��I � ��1�S ��I � �j��- >° & ² � h & ��S ��I � ��1�S ��I � �j�i1 (3.1)
where h & ��S ��I � �i1�S ��I � ���~-àlnl � &,F `ba@c3o�d¤p � &,F `ba@cir¢d lnl � yQ � (3.2)
is the � th dimensionaldistancebetweenthe models. The propertyof additivity
of distancein this orthogonalspacewill turn out to be crucial for our analysis:
the distancebetweenthe modelsis equalto the sumof the distancesbetweenthe
33
models’projections.
Denoteby S�� thenull model S ��/"� , whoseevery parameteris zero.Thedistance
betweenS andthenull modelis theabsolutedistanceof S , denotedby á¦��Sâ� ;thatis, á¦��Sâ�~- h ��Sz1�S���� . Decomposingtheabsolutedistanceyields
á¦��Sâ�~- >° & ² � á & ��Sâ�i1 (3.3)
whereá & ��Sâ�~-ãlnl � &�F ` lnl �� yQR� is the � th dimensionalabsolutedistanceof S .
3.2.2 Decomposingthe estimation task
Two modelsareof centralinterestin theprocessof estimation:thetruemodel SU�andtheestimatedmodel S . In Section2.2,we definedthedistancebetweenthem
asthelossof theestimatedmodel,denotedbyÆ ��Sâ� ; thatis,
Æ ��Sâ�~- h ��Sz1�S � � .We also relatedthis loss function to the quadraticloss incurredwhenpredicting
futureobservations,asrequiredby theminimumexpectedlossprinciple.
Beingadistance,thelosscanbedecomposedinto dimensionalcomponentsÆ ��Sâ�~- >° & ² � Æ & ��Sâ�i1 (3.4)
whereÆ & ��Sâ�~- h & ��Sz1�SU�,�~-ãltl � &,F ` p � &ÁF `fe lnl �� yQR� is the � th dimensionalloss
of model S .
Orthogonaldecompositionbreakstheestimationtaskdown into individualestima-
tion tasksfor eachof the # dimensions.á & ��SU��� is theunderlyingabsolutedistance
in the � th dimension,and á & ��Sâ� is anestimateof it. Thelossincurredby this es-
timateisÆ & ��Sâ� . Thesumof the lossesin eachdimensionis thetotal loss
Æ ��Sâ�of themodel S . This reducesthemodelingtaskto estimatingá & ��S � � for all � .Oncetheseestimateddistanceshave beenfound for eachdimension�¦-ä6"1�23232�1,# ,theestimatedmodelcanbereconstructedfrom them.
34
Our estimationprocesshastwo steps:an initial estimateof á & ��SU�,� followedby
a furtherrefinementstage.This implies that thereis someinformationnot usedin
the initial estimatethatcanbeexploited to improve it; we usethe lossfunction to
guiderefinement.Thefirst stepis to find a relationshipbetweentheinitial estimateá & ��Sâ� and á & ��S � � . The classicOLS estimate åS , which hasparametervector�IJ-Ý��G � Gæ� � � G � F , providesabasisfor sucharelationship,becauseit is well known
thatfor all � , á & � åS�� areindependently, noncentrally��� distributedwith onedegree
of freedomandnoncentralityparameterá & ��SU���� "! (Schott,1997,p.390).Wewrite
this as á & � åSU��ç9� � � ��á & ��S � �� "!�� independentlyfor all �2 (3.5)
WhenQR� is unknownandthereforereplacedby theunbiasedOLS estimate$QR� , the ���distribution in (3.5)becomesan � distribution: á & � åSU��çß���¨6"1�E p #è1�á & ��SU�,�j "!"�(asymptotically)independentlyfor all � . The � -distribution canbeaccuratelyap-
proximatedby (3.5)when E p #�é / .This relationshipformsacornerstoneof ourwork. In Section3.8.1wediscusshow
to re-generatethemodelfrom theupdatedabsolutedistances.
Updating signedprojections. An alternativeapproachto updatingabsolutedis-
tancesis to updatesignedprojections,wherethe signsof the projectionscanbe
determinedby the directionsof theaxis in an orthogonalspace.Both approaches
sharethesamemethodology, anduseaverysimilaranalysisandderivationto obtain
correspondingmodelingprocedures.
Our work throughoutthethesisfocuseson updatingabsolutedistances.It seemsto
us thatupdatingabsolutedistancesis generallymoreappropriatethanthealterna-
tive,althoughwe do not excludethepossibilityof usingthelatter in practice.Our
argumentsfor this preferencearegivenin Section4.7.
35
3.2.3 Remarks
Advantagesof an orthogonal model space. Thereareat leasttwo advantages
to usingorthogonallydecomposedmodelsfor estimation.Thefirst is additivity: the
distancebetweentwo modelsis thesumof their distancesin eachdimension.This
convenientpropertyis inheritedby two specialdistances:theabsolutedistanceof
themodelitself andthelossfunctionof theestimatedmodel.Of course,any quan-
tity that involvesadditionandsubtractionof distancesbetweenmodelsis additive
too.
Thesecondadvantageis theindependenceof themodel’scomponentsin thediffer-
entdimensions.Thismakesthedimensionaldistancesbetweenmodelsindependent
too. In particular, theabsolutedistancesin eachdimension,andthelossesincurred
by an estimatedmodel in eachdimension—aswell asany measurederived from
theseby additiveoperators—areindependentbetweenonedimensionandanother.
Thesetwo featuresallow the processof estimatingthe overall underlyingmodel
to be broken down into estimatingits componentsin eachdimensionseparately.
Furthermore,the distribution of the effectsof variables—preciselydimensions—
canbeaccuratelydefined.
A specialcase: OL S subsetmodels. It is well known that OLS subsetmodels
area specialcaseof orthogonallydecomposedmodelsin which eachvariableis
associatedwith anorthogonaldimension.Thismakesit easyto simplify themodel
structure:discardingvariablesin acertainorderis thesameasdeletingdimensions
in thesameorder. If thediscardedvariablesareactuallyredundant,deletingthem
makesthemodelmoreaccurate.
Denotethe nestedsubsetmodelsof S , from the null model up to the full one,
by S��31�S � , 232�231�S > respectively. Denoteby F `�� the predictionvector of the� -dimensionalmodel S & , and by � `�� -ÔG `�� ��G �`�� G `�� � � � G �`�� the orthogo-
nal projectionmatrix from the spaceof # dimensionsto the � -dimensionalsub-
36
spacecorrespondingto S & . Then F `�� - � `�� F ` and � & - � `�� p � `��¨ê o .Furthermore,the � th orthonormalbasecan be written as
Ü & - � &�F èltl � &�F lnlZ-� F `�� p F `��¨ê o �j èlnlÈ� F `�� p F `��¨ê o �'ltl .Choiceof an orthonormal basis. Thechoiceof anorthonormalbasisdepends
ontheparticularapplication.In somecases,theremayexist anobvious,reasonable
orthonormalbasis—forexample,orthogonalpolynomialsin polynomialregression,
or eigenvectorsin principal componentanalysis.In othersituationswhereno ob-
viousbasisexists, it is alwayspossibleto constructonefrom thegivendatausing,
say, a partial-� test(Section2.3). The latter, moregeneral,data-driven approach
is employed in our implementationandsimulationstudies(Chapter6). However,
wenotethatthepartial-� testneedsto observe F -valuesbeforemodelconstruction,
andhencecancausea problemof orthogonalizationselection(Section4.6),which
is ananalogousphenomenonto modelselection.
3.3 Contrib utions and expectedcontributions
Wenext exploreanew measurefor amodel:its contribution. An estimatedmodel’s
contribution is zerofor thenull modelandreachesa maximumwhenthemodelis
thesameastheunderlyingone.It canbedecomposedinto # independent,additive
componentsin # -dimensionalorthogonalspace—the“dimensionalcontributions.”
In practice,thesequantitiesarerandomvariables,andwe candefinebotha cumu-
lative expectedcontribution functionandits derivative of of the estimatedmodel.
Thesetwo functionscanbe estimatedfor any particularregressionproblem,and
will turn out to play key rolesin understandingthemodelingprocessandin build-
ing actualmodelsin practice.
37
3.3.1 ëíìjî8ï and ðñëíìjîJïDefinition 3.1 Thecontribution of an estimateS of theunderlyingmodel SU� is
definedto be ò ��Sâ�k- Æ ��S��,� p Æ ��Sâ��2 (3.6)
The contribution is actually the differencebetweenthe lossesof two models: the
null andtheestimated.Therefore,its signandvalueindicatethemeritsof theesti-
matedmodelagainstthenull one:a positivecontribution meansthat theestimated
modelis betterthanthenull onein termsof predictive accuracy; a negative value
meansit is worse;andzeromeansthey areequivalent.
Since��S � � is aconstant,maximizingthe(expected)contribution
ò ��Sâ� is equiv-
alentto minimizing the(expected)lossÆ ��Sâ� , wherethelatter is our goalof mod-
eling. Therefore,wenow convertthisgoalinto maximizingthe(expected)contribu-
tion. Usingthecontribution, ratherthandealingdirectly with losses,is particularly
usefulin understandingthetaskof modelselection,aswill soonbecomeclear.
Givena # -dimensionalorthogonalbasis,thecontribution functiondecomposesinto# componentsthatretainthepropertiesof additivity anddimensionalindependence:ò ��Sâ��- >° & ² �ò & ��Sâ� (3.7)
where ò & ��Sâ�~- Æ & ��S���� p Æ & ��Sâ� (3.8)
is the � th dimensionalcontributionof themodel S .
In thesubsetselectiontask,eachdimensionis eitherretainedor discarded.It is clear
that this decisionshouldbe basedon the sign of the correspondingdimensional
contribution. If a dimension’s contribution is positive, retainingit will give better
38
predictiveaccuracy thandiscardingit, andconversely, if thecontributionis negative
thendiscardingit will improveaccuracy. If thedimensionalcontribution is zero,it
makesno differenceto predictive accuracy whetherthat dimensionis retainedor
not.
In following we focus on a single dimension,the � th. The resultsof individual
dimensionscaneasilybe combinedbecausedimensionsare independentand the
contribution measureis additive. Focusingon the contribution in this dimension,
we write %'& � -óá & ��Sâ� and % �& � -ôá & ��S � � . Without lossof generality, assume
that % �& ¾ / . If theprojectionof F ` in the � th dimensionis in thesamedirection
as that of F ` e , then %'& is the positive squareroot of á & ��Sâ� ; otherwiseit is the
negativesquareroot. In eithercase,thecontributioncanbewrittenò & ��Sâ��- % �& � p � %'& p % �& � � 2 (3.9)
Clearly,
ò & ��Sâ� is zerowhen %'& is / or ! % �& . When %'& liesbetweenthesevalues,the
contribution is positive. For any othervaluesof %'& , it is negative. The maximum
contribution is achievedwhen %u& - % �& , andhasvalue % �& � , which occurswhen—in
this dimension—theestimatedmodelis the truemodel. This is obviously thebest
thattheestimatedmodelcando in this dimension.
In practice,however, only %'& � is available. Neither the valueof % �& � nor the di-
rectionalrelationshipbetweenthe two projectionsareknown. Denote
ò & ��Sâ� by�� %'& � � , alteringthenotionof thecontribution of S in this dimensionto thecon-
tribution of %'& � . ��� %'& � � is usedbelow asshorthandfor �� %'& � � % �& � 1�õ & ��- ò & ��Sâ� ,where õ & is thesignof %'& . In thefollowing we dropthesubscript� whenonly one
dimensionis underconsideration,giving % � and % � � for %'& � and % �& � respectively.
Wealsouse � for % � sinceit is this, ratherthan % , thatis available;likewiseweuse� � for % � � .We have arguedthat the performanceof an estimatedmodel can be analyzedin
termsof thevalueof its contribution in eachdimension.Unfortunately, this value
is unavailablein practice. What canbe computed,aswe will show, is the condi-
39
tional expectedvalue ö~÷¬½Ç��������+�31�õ'�'l �13v À wherethe expectationis taken over all
theuncertaintiesconcerning�+� and õ . The“ v ” hererepresentstheprior knowledge
that is availableabout ��� and õ (if any). In our work presentedlater, the“ v ” is re-
placedby threequantities:��� , ����+�,� (distributionof ��� ), and �� plus �������,� . The# -asymptotic“availability” of ���� � � (i.e., stronglyconsistentestimators)ensures
our # -asymptoticconclusions.
Note that theexpectationis alwaysconditionalon � . The valueof � is available
in practice,for example,thedimensionalabsolutedistanceof the OLS full model.
Therefore,theexpectedcontribution is a functionof � , andwe simply denoteit by��������� (equivalently, ö~÷q������ ) asa generalrepresentationfor all possiblesitua-
tions.
3.3.2 øÍìjîJï and ùúì�îJïIn practice,theestimate� is a randomvariable. For example,accordingto (3.5),
theOLS estimateis ��9ç\� � ����� � �!"� . Analogouslyto thedefinitionsof CDF andpdf
of a randomvariable,we definethecumulativeexpectedcontribution functionand
its derivative of � , denotedby ��� and ������� respectively. For convenience,we
call ������ and ������� the - and � -functionsof � .
Definition 3.2 Thecumulativeexpectedcontribution functionof � is definedas ñ������- Ê ÷� ����«j�jϬ��«j�=Ë�«�1 (3.10)
where Ͽ����� is thepdfof � .
Further, ��������- Ë ñ�����Ëw� 2 (3.11)
40
Immediately, wehave ����������- �������Ϭ����� 2 (3.12)
Note that thedefinitionsof ������ and ������� , aswell astherelationship(3.12),en-
capsulateconditionalsituations.Forexample,with known �+� , wehave ����������+���~-��������� � �j "Ϭ������ � � , with known ���� � � , ���������©-×�������,���� "Ϭ����,��� , andwith
known �� plus ����� � � , ��������q���1,����-Í������q���1,��j "Ϭ����¤��û1,��� . Further, and �aremutuallyderivableunderthesamecondition.
3.3.3 ùúìRüîþýuîbÿuïNow wederiveanexpressionfor ���(���� given ��� , where �� is theOLS estimateof the
dimensionalabsolutedistance.Let ��ã-�$% � and % �û- K � � � , where $% is a signed
randomvariabledistributedon therealline accordingto thepdf ����$% � % �,� . Notethat
from the property(3.5), we have a specialsituationthat $% ç P � % �3136'� , and thus��0-à$% � ç9� � � � % � � "!"� , where
����$% � % � �k- 6� !�� � � ���� ê � e� rr 2 (3.13)
It is this specialcasethatmotivatestheutilization of thepdf of $% , insteadof thatof�� directly. This is becausethepdf of thenoncentralchi-squareddistributionhasan
infinite seriesform, which is inconvenientfor bothanalysisandcomputation.
Throughoutthethesis,weonly use����$% � % � � in theform (3.13),althoughthefollow-
ing derivationworksgenerallyonce����$% � % �,� is known.
With known ���,$% � % �,� , theCDF of �� given ��� is
��*������ � �~- Ê � ÷� � ÷ ����«�� � � � �=Ë=«�1 (3.14)
41
hence Ï¿�*����� � ��- £*��*��������,�£ �� - ��� � ���� � � � � K ��� p � ���� � � � �! � �� 2 (3.15)
Using (3.9), rewrite the contribution of �� given �+� by a two-argumentfunction� ��$% � % �,� ���(�����- � ��$% � % � ��- % � � p ��$% p % � � � 2 (3.16)
Only the sign of $% can affect the value of the contribution, and so the expected
contributionof �� given ��� is
����� ����� � ��- � � � ��� � � � � ��� � ��)� � � � � K � � p � ���� � � � � ��� p � ��� � � � ���� � ��� � � � � K ��� p � ��� � � � � 2(3.17)
Using(3.12),
���*����� � ��- � � � ��� � � � � ��� � ��� � � � � K � � p � ���� � � � � ��� p � ���� � � � �! � �� 2 (3.18)
In particular, ����/=���+�,��-ä/ for every ��� (seeAppendix3.8.2). This givesthe fol-
lowing theorem.
Theorem 3.3 The ���*�����+�,� of theOLS estimate�� given �+� is determinedby(3.18),
while thepdf Ͽ�*�������,� is determinedby (3.15).
Because����������- �������� "Ͽ����� by (3.12),and Ϭ����� is alwayspositive, the value
of ������� hasthe samesign as ������� . Thereforethe sign of � canbe usedasa
criterion to determinewhethera dimensionshouldbe discardedor not. Within a
positiveinterval, where �������8ÑU/ , � is expectedto contribute positively to the
predictiveaccuracy, whereaswithin a negativeone, where ���������9/ , it will do the
opposite.At azeroof ������� theexpectedcontribution is zero.
42
-0.3
-0.2
-0.1
0
0.1
0.2
0.3
0.4
0.5
0 5 10 15 20
h(A ; 0.0)h(A ; 0.5)h(A ; 1.0)h(A ; 2.0)h(A ; 5.0)
-20
-15
-10
-5
0
5
0 5 10 15 20
EC(A ; 0.0)EC(A ; 0.5)EC(A ; 1.0)EC(A ; 2.0)EC(A ; 5.0)
(a) (b)
Figure3.1: ���(�����+��� and ������������� , for �+�©-H/=132;4�136"1�!(1�4 .Figure3.1(a)shows examplesof ���*����� � � where � � is /�1,/=2;4�136"1�! and 4 . All the
curvesstartfrom theorigin. When � � -9/ , thecurvefirst decreasesas �� increases,
and thengraduallyincreases,approachingthe horizontalaxis asymptoticallyand
never rising above it. As thevalueof ��� grows, the ���(�����+�,� curvegenerallyrises.
However, it always lies below the horizontalaxis until �+� becomes0.5. When� � Ñ^/=2;4 , thereis onepositiveinterval. A maximumis reachednot far from � � (the
maximumapproaches��� asthe latter increases),andthereaftereachcurve slowly
descendsto meettheaxisataround· ��� (theordinateapproachesthisvalueas�+� in-
creases).Thereaftertheinterval remainsnegative;within it, each� -functionreaches
aminimumandthenascendsto approachtheaxisasymptoticallyfrom below.
Most observationsin thelastparagraphareincludedin thefollowing theorem.For
convenience,denotethe zerosof � by� � , � � , ��� in increasingorder along the
horizontalaxis,andassumethat � is properlydefinedat } . Since �+� in practiceis
alwaysfinite, weonly considerthesituation� � �^} .
Theorem 3.4 Propertiesof ���(�����+��� .1. Every �����������,� hasthreezeros(twoof which maycoincide).
2. When ����� /=2;4 , � � - � � - / , ���§- } ; when �+�íÑ /�254 , � � - / ,/���� � �^} , ���Õ-ß} .
43
3. �t¸n¹)÷ e ºú» ��� � p · �����~-9/ .4. ���*����� � �©Ñí/ for ��×[O��/�1�� � � and ���(������ � ���í/ for ��\[§��� � 1�}^� .5. When ���ßÑ /=254 , ���(��)������� has a unique maximum,at ������� , say. Then�t¸n¹)÷ e ºú» ��������� p ������-H/ .6. ��� ����� � � is continuousfor every �� and � � .
Theproof canbeeasilyestablishedusingformulae(3.13)–(3.18).Almost all these
propertiesareevident in Figure3.1(a).Thecritical value /=2;4 —thelargestvalueof�+� for which ���*���������� hasno positive interval—canbeobtainedby settingto zero
the first derivative of ���*�����+�,� with respectto �� at point ��ä-ô/ . The derivatives
around ��0-9/ canbeobtainedusingtheTaylor expansionof ��� ������ � � with respect
to� �� (seeAppendix3.8.2).
As notedearlier, the sign of � canbe usedasa criterion for subsetselection. In
Section3.4,Figure3.1(a)is interpretedfrom asubsetselectionperspective.
Figure3.1(b)shows theexpectedcontribution curves �����(�����+��� . The locationof
themaximumconvergesto �+� as �+��| } , andit convergesvery quickly—when� � -0! , thedifferenceis almostunobservable.
3.3.4 ù ìjîñý � ïWhenderiving the - and � -functions,wehaveassumedthattheunderlyingdimen-
sionalabsolutedistance�+� is given. However, ��� , is, in practice,unknown—only
thevalueof � is known. To allow the functionsto be calculated,we consider�+�to bearandomvariablewith adistribution ����+��� , adistributionwhich is estimable
from asampleof � . This is exactly theproblemof estimatingamixing distribution.
In our situation,the pdf of the correspondingmixture distribution hasthe general
form Ï¿�������~- Ê"!"# Ϭ������� � �=Ë���� � �i1 (3.19)
44
where ����[g]%$ , the nonnegative half real line. �������,� is the mixing distribution
function, Ͽ��������,� thepdf of thecomponentdistribution,and Ϭ�����,�� thepdf of the
mixturedistribution. Section3.6discusseshow to estimatethemixing distribution
from a sampleof � .
If themixing distribution ����+�,� is given,the � -functionof � canbeobtainedfrom
thefollowing theorem.
Theorem 3.5 Let ��� bedistributedaccording to ����+��� and � bea randomvari-
ablesampledfromthemixturedistributiondefinedin (3.19).Then������,���~- Ê !"# ��������� � �=Ë����� � ��1 (3.20)
where �����������,� is the � -functiondeterminedby Ϭ���������,� .Theproof followseasilyfrom thedefinitionof � .No matterwhetherthe underlyingmixing distribution ����+��� is discreteor con-
tinuous,it canalwaysbe estimatedby a discreteonewhile reachingthe same# -asymptoticconclusions. Further, in practice,a discreteestimateof an arbitrary���� � � is often easierto obtainwithout sacrificingpredictionaccuracy, andeasier
to manipulatefor furthercomputation.Supposethecorrespondingpdf &q������� of a
discrete�������� is definedas
&q��7 �± �~-(' ± 1 where
)° ±³² � ' ± - 6"2 (3.21)
Then(3.19)canbere-writtenasϬ�����,���~- )° ±³² � ' ± Ϭ����,7 �± ��1 (3.22)
45
and(3.20)as �������,���- )° ±³² � ' ± �������7 �± ��2 (3.23)
Althoughthegeneralforms(3.19)and(3.20)areadoptedin thefollowing analysis,
it is (3.22)and(3.23)thatareusedin practicalcomputations.Notethat if �������,��and Ϭ�����,��� areknown, the expectedcontribution of � given ��������� is given by
(3.12)as ������,���� �Ϭ�����,�� .Sinceall �� & ’s of an OLS estimatedmodelwhich is decomposedin an orthogonal
spacearestatisticallyindependent(see(3.5)), they form a samplefrom the mix-
turedistribution with a discretemixing distribution ���� � � . Thepdf of themixture
distribution takesthe form (3.22), while the pdf of the componentdistribution is
provided by (3.15). Likewise the � -function of the mixture hasthe form (3.23),
while thecomponent� -functionis givenby (3.18).
Fromnow on,themixing distribution ���� � � becomesourmajorconcern.If thetwo
functions Ϭ������+�,� and ������������ arewell defined(asthey arein OLS estimation),����+�,� uniquelydeterminesϬ�����,��� and ��������� by (3.19)and(3.20)respectively.
The following sectionsanalyzethe modelingprocesswith known ����+��� , show
how to build thebestmodelwith known �������� , andfinally tacklethequestionof
estimating����+�,� .3.4 The role of * +-,/. and 01+2,/. in modeling
The - and � -functionsilluminateour understandingof themodelingprocess,and
helpwith building modelstoo. Hereweusethemto illustrateissuesassociatedwith
theOLS subsetselectionproceduresdescribedin Section2.1,andto elucidatenew,
hitherto unreported,phenomena.While we concentrateon OLS subsetselection
criteria,we touchonshrinkagemethodstoo.
We alsoillustratewith examplesthebasisfor thenew proceduresthatareformally
46
definedin thenext section.Not only doessubsetselectionby thresholdingvariation
reductionseverelyrestrictthemodelingspace,but thevery ideaof subsetselection
is a limited one—whena wider modelingspaceis considered,betterestimators
emerge. Consequentlywe expandour horizonfrom subsetselectionto thegeneral
modelingproblem,producinga final model that is not a least-squaresfit at all.
This improveson OLS modelingeven whenthe projectionson all dimensionsare
significant.
Themethodologywe adoptis suggestedby contrastingthemodel-basedselection
problemthat we have studiedso far with the “dimension-basedselection”that is
usedin principal componentanalysis. Dimension-basedselectiontestseachor-
thogonaldimensionindependentlyfor elimination,whereasmodel-basedselection
analyzesasetof orthogonalnestedmodelsin sequence(asdiscussedin Section2.3,
the sequencemay be defineda priori or computedfrom the data). In dimension-
basedselection,deletinga dimensionremovesthetransformedvariableassociated
with it, andalthoughthis reducesthenumberof dimensions,it doesnotnecessarily
reducethenumberof original variables.
Chapter2 showedthatall theOLS subsetselectioncriteriasharetheideaof thresh-
olding the amountby which the variationis reducedin eachdimension.CIC sets
differentthresholdvaluesfor eachindividualdimensiondependingonits rankin the
orderingof all dimensionsin the model,whereasothermethodsusefixed thresh-
olds.While straightforwardfor dimension-basedselection,this needssomeadjust-
mentin model-basedselectionbecausethevariationreductionsof thenestedmodels
maynotbein thedesireddecreasingorder. Thenecessaryadjustment,if a few vari-
ablesaretested,is to comparethereductionsof thevariationby thesevariableswith
thesumof theircorrespondingthresholdvalues.Thecentralidearemainsthesame.
Therefore,thekey issuein OLS subsetselectionis thechoiceof threshold.We de-
notetheseschemesby OLSC ��?�� , where? is thethreshold(see(2.3)). Theoptimum
valueof ? is denotedby ?w� . We now considerhow ?� canbedeterminedusingthe -function,assumingthatall dimensionshave thesameknown . We begin with
dimension-basedselectionandtacklethenestedmodelsituationlater.
47
As discussedearlier, our modelinggoal is to minimizetheexpectedloss. Herewe
considertwo expectedlosses,theprior expectedloss ö ` ö ` ½ Æ ��Sâ� À andthepos-
terior expectedloss ö ` ½ Æ ��Sâ� À —or equivalentlyin our context ö ÷ ö~÷~½ Æ ��Sâ� À andö~÷¿½ Æ ��Sâ� À respectively—whichcorrespondto the situationsbeforeandafter see-
ing the data. Note that by � herewe meanall � ’s of the model S . We call the
correspondingoptimalitiesprior optimality andposterioroptimality. In practice,
posterioroptimality is usuallymoreappropriateto useafter the datais observed.
However, prior optimality, without taking specificobservationsinto account,pro-
videsa generalpictureaboutthemodelingproblemunderinvestigation.More im-
portantly, both optimalitiesconverge to eachotherasthe numberof observations
approachesinfinity, i.e., # -asymptoticallyhere.In thefollowing theorem,we relate? � to prior optimality, while in the next sectionwe seehow the estimatorsof two
optimalitiesconvergeto eachother.
Theorem 3.6 Givenan orthogonaldecompositionof a modelspaceV > , let SU�ú[V > beanyunderlyingmodeland S [�V > beits estimate. Assumethat all dimen-
sionalabsolutedistancesof S havethesamedistribution function ������� andthus
thesame -function.Rankthe á & ��Sâ� ’s in decreasingorder as � increases.Then
the estimator S OLSCa@� e d
of SU� possessesprior optimality amongall S OLSCa@��d
if
andonly if ? � -4365��k¹¸87÷:9(� ������i2 (3.24)
Proof. For dimension� , weknow from (3.8) thatÆ & ��Sâ�~- Æ & ��S���� p ò & ��Sâ�i2OLSC ��?�� discardsdimension� if á & ��Sâ�;�í? , soò & ��S OLSC
a@��d ��- <=?> / if á & ��Sâ���í?ò & ��Sâ� if á & ��Sâ�ÕÑM? ,48
where�0-gá & ��Sâ� is a randomvariablewhich is distributedaccordingto theCDF������ . Thus Ê !@# ö~÷~½ Æ & ��S OLSCa;��d � À Ë�������- Ê ! # ö~÷¿½ Æ & ��S���� À Ëw������� p Ê »� ö~÷¬½ ò & ��Sâ� À Ë������- Æ & ��S���� p ���}^� K ���?���2
Takingadvantageof additivity anddimensionalindependence,sumtheaboveequa-
tion overall # dimensions:ö ÷ ö~÷¿½ Æ ��S OLSCa@��d � À - Æ ��S��,� p #= ñ��}^� K #= ���?�i2 (3.25) ���?�� is theonly termon theright-handsideof (3.25)thatvarieswith ? . Thusmin-
imizing the prior expectedlossis equivalentto minimizing ���?�� , andvice versa.
Thiscompletestheproof. ATheorem3.6 requiresthedimensionalabsolutedistancesof the initially estimated
model—e.g.,theOLS full model—tobesortedinto decreasingorder. This is easily
accomplishedin the dimension-basedsituation,but not in the model-basedsitu-
ation. However, the nestedmodelsareusually invariably generatedin a way that
attemptsto establishsuchanorder. If so,thisconditionis approximatelysatisfiedin
practiceandthus S OLSCa;� e d
is agoodapproximationto theminimumprior expected
lossestimatorevenin themodel-basedsituation.
FromTheorem3.6,wehave
Corollary 3.6.1 Propertiesof ?� .1. ?w�Õ-B3C5���¹¸87EDGFIH�D@J8K ��L��� , where W�� ± _ is thesetof zerosof � .2. If ? � Ñ^/ , ����? � p �©Ñ^/ ; if ? � �í} , ����? � K ���^/ .3. ���? � �;�g/ and ���}^� p ñ��? � � ¾ / .
49
4. If thereexists � such that ñ��}^� p ñ�����©Ñ / , then ?���í} .
Properties1 and2 show that theoptimum ?� mustbea zeroof � —moreover, one
that separatesa negative interval to the left from a positive interval to the right
(unless?w��-Ù/ or } ). Properties3 and4 narrow thesetof zerosthat includesthe
optimalvalue ?� , andthushelpto establishwhich oneis theoptimum.
Four examplesfollow. The first two illustrateTheorem3.6 in a dimension-based
situationin which eachdimensionis processedindividually. In the first example,
eachdimension’sunderlying��� is known—equivalently, its ������������� is known. In
the second,the underlyingvalueof eachdimensionalabsolutedistanceis chosen
from two possibilities,andonly themixing distributionof thesetwo valuesandthe
corresponding� -functionsareknown.
Thelast two examplesintroducetheideasfor building modelsthatwe will explore
in thenext section.
Throughouttheseexamples,noticethatthedimensionalcontributionsareonly ever
usedin expected-valueform, andthe component� -function is the OLS ��� ����� � � .Further, we alwaystake the # -asymptoticviewpoint, implying that thedistribution���� � � canbeassumedknown.
Example3.1 Subsetselectionfrom a single mixture. Considerthe function���*������ � � illustratedin Figure3.1(a).Wesupposethatall dimensionshavethesame��� ������ � � .Noisydimensions,andoneswhoseeffectsareundetectable. A noisydimension,for
which ��������/�� is alwaysnegative,will beeliminatedfrom themodelnomatterhow
large its absolutedistance �� . Since �n¸n¹÷ e º �è���(��)���+�,��-����*���,/"� , non-redundant
dimensionsbehave more like noisy onesas their underlyingeffect decreases—in
otherwords,their contribution eventuallybecomesundetectable.When � � �×/=2;4 ,any contribution is completelyoverwhelmedby thenoise,andno subsetselection
procedurecandetectit.
50
Dimensionswith small vs. large effects. Whenthe estimateresidesin a negative
interval of � , its contribution is negative. All � s, no matterhow large their �+� ,have at leastonenegative interval � · �+��1�}^� . This invalidatesall subsetselection
schemesthateliminatedimensionsbasedon thresholdingtheir variationreductions
with a fixed threshold,becausea large estimate �� doesnot necessarilymeanthat
the correspondingvariableis contributive—its contribution also dependson � � .The reasonthat threshold-typeselectionworks at all is that the estimate �� in a
dimensionwhoseeffect is largeis lesslikely to fall into anegativeinterval thanone
whoseeffect is small.
The OLSC ��?�� criterion. The OLS subsetselectioncriterion OLSC ��?�� eliminates
dimensionswhoseOLS estimatefalls below thethreshold? , where ?b-Ù! for AIC,�n���~E for BIC, !¬�n�"�k# for RIC, andtheoptimalvalueis ? � asdefinedin (3.24).Since
dimensionsshouldbediscardedbasedonthesignof theirexpectedcontribution,we
considerthreecases:dimensionswith zeroandsmalleffects,thosewith moderate
effects,andthosewith largeeffects.
Whena dimensionis redundant,i.e. ���b-®/ , it shouldalways be discardedno
matterhow largetheestimate �� . Thiscanonly bedoneby OLSC ��?w��� , with ?w�©-0}in this case. Whenever ?M��} , dimensionswhose �� exceeds? arekept inside
the model: thusa certainproportionof redundantvariablesareincludedin thefi-
nal model. Dimensionswith small effectsbehave similarly to noisyones,andthe
thresholdvalue ?w�Õ-ß} is still best—whichresultsin thenull model.
Supposethatdimensionshave moderateeffects.As thevalueof �+� increasesfrom
zero, the valueof the cumulative expectedcontribution ��� will at somepoint
changesign.At thispoint, themodelfoundby OLSC ��?�� , whichheretoforehasbeen
betterthan the full model,becomesworsethan it. Hencethereis a valueof � �for which the predictive ability of S OLSC
a;�,dis the sameasthat of the full model.
Furthermore,thereexists a valueof ��� at which the predictive ability of the null
model is the sameas that of the full model. In thesecases,model S OLSCa@� e d
is
either the null modelor the full one,since ?w� is either / or } dependingon the
51
valueof ��� .Wheneachdimensionhasa largeeffect—largeenoughthatthepositionof thesec-
ondzeroof ���(��)��� � � is at least ? —any OLSC ��?� with fixed ?§ÑÙ/ will inevitably
eliminatecontributive dimensions.This meansthat the full model is a betterone
than S OLSCa@��d
. Furthermore,OLSC ��?w��� with ?�¦-U/ will alwayschoosethe full
model,which is theoptimalmodelfor every S OLSCa@��d
.
Shrinkagemethodsin orthogonalspace. In orthogonalregression,when G � G is a
diagonalmatrix,contribution functionshelpexplain why shrinkagemethodswork.
Thesemethodsshrink theparametervaluesof OLS modelsandusesmallervalues
thantheOLS estimates.Thismayor maynotchangethesignsof theOLS estimated
parameters;however, for orthogonalregressions,the signsof the parametersare
left unchanged.In this situation,therefore,shrinkingparametersis tantamountto
shrinkingthe �� ’s. Ridgeregressionshrinksall theparameterswhile NN-GAROTTE
andLASSO shrinkthelargerparametersandzerothesmallerones.
When ��� is small, it is possibleto choosea shrinkageparameterthat will shrink�� ’s that lie between� � and · � � to around � � , andshrink the negative contribu-
tionsoutside· � � to becomepositively contributive—despitethefactthat �� around
the maximumpoint �+� are shrunkto smallervalues. This may give the result-
ing model lower predictive error thanany modelselectedby OLSC ��?�� , including?^-�?w� . Zeroing the smaller �� ’s by NN-GAROTTE and LASSO doesnot guaran-
teebetterpredictive accuracy thanridgeregression,for thesedimensionsmight be
contributive. When �+� is large,shrinkagemethodsperformbadlybecausethedis-
tributionof �� tendsto besharperaround��� . This is why OLS subsetselectionoften
doesbetterin this situation.
Example3.2 Subsetselectionfroma doublemixture. Suppose���������- # �# ���(��)�,7 � ��� K # �# ���*����,7 �� �i1 (3.26)
where #\- # � K # � . For # � dimensionsthe underlying �+� is 7T�� , while for the
52
-0.14
-0.12
-0.1
-0.08
-0.06
-0.04
-0.02
0
0.02
0 5 10 15 20-0.1
-0.08
-0.06
-0.04
-0.02
0
0.02
0 5 10 15 20
(a) (b)
-0.4
-0.2
0
0.2
0.4
0.6
0.8
1
0 5 10 15 20
5:95
50:50
95:5
-0.03
-0.02
-0.01
0
0.01
0.02
0.03
0.04
0.05
0 5 10 15 20
(c) (d)
Figure3.2: ���*����- > o> ���(���,7¿�� � K > r> ���*����7¿�� � . (a). 7¿�� -�/=1�7¿�� -ON=1�# � Y©# � -:"/æY¿!y/ . (b). 7¿���- /=1,7T�� -PN=1�# � Y¿# � -RQ"4æYT!"4 . (c). 7¿��- /�1,7¿�� -ó!�/ , # � YT# �arerespectively 4bYTS�4�1�4�/fY�4�/=1�S"4fY�4 . (d). 7 � � and 7 �� arecarefullysetwith fixed# � Y�# � -BS�4)Y(4 suchthat ��L���i�~-0/ .remaining # � dimensionsit is 7 �� . �� is an observation sampledfrom the mixture
distribution Ϭ�*�����- > o> Ϭ�*����,7¿���� K > r> Ϭ�(����,7¿�� � . Altering the valuesof # � , 7T�� , # �and 7 �� yieldsthedifferent � s illustratedin Figure3.2. Theoptimalthreshold? � of
OLSC ��?�� is discussed.
In Figure3.2(a),where 7¿��Ð- / , 7¿�� -/N and # � Y�# � - :"/ Y�!�/ , no positive interval
existsdespitethefact that thereare !�/ non-redundantdimensions.This is because
the effect of all the non-redundantdimensionsis overwhelmedby the noisyones.
In principle,no matterhow largeits effect, any dimensioncanbeoverwhelmedby
a sufficient numberof noisyones. In this case?�- } andOLSC ��?��� selectsthe
null model.
53
In Figure3.2(b),which is obtainedfrom theprevioussituationby altering # � Y�# �to Q"4bY�!"4 , thereis a positive interval. But ñ��}^� is theminimum of all zeros,so
that ?w� remains} and the modelchosenby OLSC ��?w��� is still the null one. The
contributivedimensionsarestill submergedby thenoisyones.
If a finite thresholdvalue ?� exists,it mustsatisfy ���}^� p ���?���©Ñ^/ (Property4
of Corollary 3.6.1). It can take on any nonnegative value by adjustingthe four
parameters# � , 7¿�� , # � and 7¿�� . Figure3.2(c)shows threefunctions � , obtainedby
setting 7 � � and 7 �� to / and !�/ respectively andmakingtheratio between# � and # �4)Y"S�4 , 4�/ZY(4�/ , and S�4ZY*4 . As thesecurvesshow, thecorrespondingvaluesof ?w� are
about ! , · and : .In Figure3.2(d), # � Y(# � -BS�4ZY�4 and 7 � � and 7 �� aresetto make ��L������-9/ , where�U� is the third zeroof � . In this case,therearetwo possibilitiesfor ?� : theorigin� � , and ��� . �U� givesa simplermodel. Notice that the numberof parametersof
thetwo modelsis in theapproximateratio 4¦Y�6 /�/ . However, thebalanceis easily
broken—forexample,if 7¿�� increasesslightly, thenthereis a singlevalue0 for ?w� .Althoughthelargermodelhasslightly smallerpredictiveerrorthanthesmallerone,
it is muchmorecomplex. Hereis a situationwherea far moresuccinctmodelcan
beobtainedwith asmallsacrificein predictiveaccuracy.
Example3.3 Subsetselectionbasedon thesignof � . AlthoughOLSC ��? � � is op-
timal amongevery OLSC ��?�� , it haslimitations. It alwaysdeletesdimensionswhose�� ’s fall below the threshold,andretainsthe remainingones. Thus it may delete
dimensionsthatlie in thepositive intervalsof � , andretainonesin thenegative in-
tervals.Weknow thatthesignof � is thesignof theexpectedcontribution,andthe
selectioncanbe improvedby usingthis fact: we simply retaindimensionswhose���*���� is positiveanddiscardtheremainder.
In the next sectionwe formalize this idea and prove that its predictive accuracy
always improvesupon OLSC ��? � � . Becausethe positive andnegative intervals of� canlie anywherealongthe half real line, this proceduremay retaindimensions
with smaller �� anddiscardoneswith larger �� . Or it maydeleteadimensionwhose
54
�� lies betweenthoseof the otherdimensions.For example,in Figure3.2(b), the
new procedurewill keepall thedimensionswhose �� ’s lie within thesmallpositive
interval, despitethefactthatOLSC ��?w��� choosesthenull model.
Example3.4 General modeling. Sofar we have only discussedsubsetselection,
wheretheestimate �� is eitheralteredto zeroor remainsthesame.A naturalques-
tion is whetherbetterresultsmight be obtainedby relaxing this constraint—and
indeedthey canbe. For example,in Example3.1,whereall dimensionsarenoisy,
OLSC ��? � � deletesthemall andchoosesthenull model. This in effect replacesthe
estimate �� by theunderlying ��� , which is / in this case.Similarly, whenall esti-
mates �� have thesameunderlying ��� (which is non-zero),andall �� ’s areupdated
to ��� , the estimatedmodel improvessignificantly—even thoughno dimensionis
redundant.
A similar thing happenswhen � is sampledfrom a mixture distribution. In Fig-
ure 3.2(a),the � -function of the mixture hasno positive interval, although20 out
of 100dimensionshave ���¼-VN . Thebestthatsubsetselectioncando is to discard
all dimensions.However, a dimensionwith ��\- !�/ —despite���*���� beinglessthan
0—is unlikely to be a redundantone; it is more likely to belongto the groupfor
which � � -WN . Altering its OLS estimate �� from 20 to 3 is likely to convert the
dimensioninto acontributiveone.
Thenext sectionformulatesa formalestimationprocedurebasedon this idea.
3.5 Modeling with known XY+2,/ZE.We now assumethat the underlying �������� is known and definea group of six
procedures,collectively calledpaceregression, which build modelsby adjusting
the orthogonalprojectionsbasedon estimationsof theexpecteddimensionalcon-
tributions.They aredenotedPACE � 1 PACE � 13232�2�1 PACE A , andthemodelproducedby
PACE
±is written S PACEJ . Wewill discussin thenext sectionhow to estimate����� � � .
55
����+�,� is the distribution function of a randomvariable ��� that representsthe di-
mensionalabsolutedistanceof theunderlyingmodel SU� . Thesedistances,denoted
by �+���1323232Á1����> , form asampleof size # from ��������� . Whenthemixing distribution����+�,� is known,(3.19)canbeusedto obtainthemixturepdf Ϭ����,��� fromthecom-
ponentdistribution pdf Ͽ������ � � . Fromthis, alongwith thecomponent� -function��������� � � , thefunctions ������,��� and �����,��� canbefoundfrom (3.20)and(3.12)
respectively. As Theorem3.3 shows, OLS estimationprovidestheexpressionsfor
both Ϭ����������� and ��������+��� .Thepaceproceduresconsistof two steps.Thefirst, which is thesamefor all pro-
cedures,generatesan initial model S . We alwaysusethe OLS full model åS for
this, althoughany initial modelcanbeusedso long asthecomponentdistribution
andthecomponent� -functionareavailable. DecomposingåS in a givenorthogo-
nalspaceyieldsthemodel’sdimensionalabsolutedistances,say �� � 1�23232�1q�� > . These
arein fact a samplefrom a mixture distribution ��*��û�,��� with known component
distribution �� ��û��� � � andmixing distribution ���� � � . In thesecondstep,thefinal
modelis generatedfrom either(3.37)or (3.38),where
Ø� � 1323232Á1 Ø� > areobtainedby
updating �� � 13232�2�1q�� > . Thenew proceduresdiffer in how theupdatingis done.
To characterizetheresultingperformance,wedefineaclassof estimatorsandshow
that paceestimatorsare optimal within this class,or a subclassof it. The class
of estimators[ > (where # is thenumberof orthogonaldimensions)is asfollows:
giventheinitial model åS andits absolutedistancesin anorthogonaldecomposed
modelspace,every memberof [ > is anupdatingof åS by (3.37)andviceversa,
wherethe updatingis entirely dependenton the set of absolutedistancesof åS .
Clearly, S OLSCa@��d [\[ > for any ? .
Variouscorollariesbelow establishthat thepaceestimatorsarebetterthanothers.
Eachestimator’s optimality is establishedby a theoremthat appliesto a specific
subclass,andproving eachcorollary reducesto exhibiting an examplewherethe
performanceis actuallybetter. Theseexamplesareeasilyobtainablefrom the il-
lustrationsin Section3.4. When we say that one estimatoris “better than” (or
“equivalentto”) another, wemeanin thesenseof posteriorexpectedloss.With one
56
exception,betterestimatorshave lowerprior expectedloss(theexception,asnoted
below, is Corollary3.8.1,which comparesPACE � andPACE � ). Referto Section3.4
for definitionsof prior andposteriorexpectedlosses.
Of thesix procedures,PACE � andPACE � performmodel-basedselection,thatis, they
selecta subsetmodelfrom a sequence.ProceduresPACE � andPACE C addressthe
dimension-basedsituation,whereeachorthogonaldimensionis testedandselected
(or not) individually. If theseproceduresare usedfor a sequenceof orthogonal
nestedmodels,the resultingmodelmaynot belongto thesequence.The last two
procedures,PACE B and PACE A , arenot selectionprocedures.Instead,they update
theabsolutedistancesof theestimatedmodelto valueschosenappropriatelyfrom
thenonnegativehalf realline.
Theorem3.6showsthatthereis anoptimalthresholdfor threshold-typeOLS subset
selectionthatminimizestheprior expectedloss.Weusethis ideafor nestedmodels.
Procedure 1 � PACE � � . Given a sequenceof orthogonal nestedmodels,let ? -365j�k¹�¸�7 DGF]H�D"J8K ������ , where W�� ± _ is thesetof zerosof � . Outputthemodelin the
sequenceselectedby OLSC ��?�� .Accordingto Corollary3.6.1,Procedure1 findstheoptimalthreshold?� . Therefore
PACE � - OLSC ��?�,� (and S PACEo -0S OLSC
a@� e d). Sincethesequenceof dimensional
absolutedistancesof themodel åS is notnecessarilyalwaysdecreasing,S OLSCa@� e d
is not guaranteedto be the minimum prior expectedlossestimator. However, in
practicethesedistancesdo generallydecrease,andsoin thenestedmodelsituationS OLSCa@� e d
is anexcellentapproximationto theminimumprior expectedlossestima-
tor. In thissense,PACE � is superiorto otherOLS subsetselectionprocedures—OLS,
AIC, BIC, RIC, andCIC—for thesedonotusetheoptimalthreshold?� . In particular,
theprocedureCIC usesa thresholdthatdependson thenumberof variablesin the
subsetmodelaswell as the total numberof variables. The relative performance
of theseproceduresdependson the particularexperimentsusedfor comparison,
sincedatasetsexist for which any selectioncriterion’s thresholdcoincideswith the
optimalvalue ? � .57
Insteadof approximatingthe optimum as PACE � does,PACE � always selectsthe
optimalmodelfrom a sequenceof nestedmodels,whereoptimality is in thesense
of posteriorexpectedloss.
Procedure 2 � PACE � � . Amonga sequenceof orthogonalnestedmodels,outputthe
onewhich hasthelargestvalueof Þ & ±³² � ���(�� ± �j "Ϭ�*�� ± � for �)-�6"1�!(1323232i1�# .Theorem 3.7 Given �������,� , S PACE
rhasthe smallestposteriorexpectedlossin a
subclassof [ > in which each estimatorcan only selectfrom the sequenceof or-
thogonalnestedmodelsthat is provided.
Proof. Let S & betheselectedmodel,then
Ø� ± - / if ªúÑg� and
Ø� ± - �� ± if ª��0� .Fromthedefinitionof dimensionalcontribution (3.8),wehaveö H ÷EJ�K ½ Æ ��S & � À - Æ ��S���� p &° ±È² � �����*�� ± ��2 (3.27)
Because��� �� ± ��-ä��� �� ± �� �Ϭ� �� ± � by (3.12),minimizing ö H ÷EJ�K ½ Æ ��S & � À is equiva-
lent to maximizing Þ & ±³² � ���*�� ± �� �Ϭ�*�� ± � with respectto � . This completestheproof.ACorollary 3.7.1 Given����� � � andasequenceof orthogonalnestedmodels,S PACE
ris a betterestimatorthan S OLSC
a@��dfor any ? . This includesS PACE
o, S OLS, S AIC,S BIC, S RIC and S CIC.
Since, accordingto Section2.6, the cross-validation subsetselectionprocedure
CV ��£���-+� OLSC ����!uE p £��j w��E p £��j� andthe bootstrapsubsetselectionprocedure
BS ��¥¦��-+� OLSC �j��E K ¥¦�j y¥¦� , wehave
Corollary 3.7.2 Given ���� � � and a sequenceof orthogonal nestedmodels, E -
asymptoticallyS PACEr
is a better estimatorthan S CVa_^�d
for any £ and S BSa ) d
for any ¥ .
58
In fact,thedifferencebetweenthemodelsgeneratedby PACE � andPACE � is small,
becausewehave
Corollary 3.7.3 Given �������� , if theelementsof W��� & �¢�Z-×6"13232�2�1�# _ arein decreas-
ing order as � increases,S PACEo - > S PACE
ra.s.
Proof. Since W~�� & �s�^- 6"132�232�1�# _ are in decreasingorderas � increases,S PACEr
is in effect the model that minimizes Ì !@# ö ÷ ½ Æ ��S & � À Ë� > �*���� with respectto � ,where � > �*���� is the Kolmogorov empirical CDF of �� , and S PACE
ois the model
that minimizes Ì !@# ö ÷ ½ Æ ��S & � À Ë��*��+� with respectto � . Since, almost surely,� > � ���� | �� ��+� uniformally as #g| } (Glivenko-Cantelli theorem),it implies
that,almostsurely, Ì ! # ö ÷ ½ Æ ��S & � À Ë� > �(����~| Ì ! # ö ÷ ½ Æ ��S & � À Ë���*���� as #�| } ,
dueto the Helly-Bray theorem(see,e.g.,Galambos,1995). Therefore,minimiz-
ing theprior expectedlossis equivalentto minimizing theposteriorexpectedloss,# -asymptotically. Wecompletetheproof with S PACEo - > S PACE
ra.s. A
Thesetwo proceduresshow how to selectthe bestmodel in a sequenceof # K 6nestedmodels.However, no modelsequencecanguaranteethattheoptimalmodel
is one of the nestedmodels. Thuswe now considerdimension-basedmodeling,
wherethe final model canbe a combinationof any dimensions.With selection,
the numberof potentialmodelsgiven the projectionson # dimensionsis as large
as ! > . Whenthe orthogonalbasisis providedby a sequenceof orthogonalnested
models,this kind of selectionmeansthat the final modelmay not be oneof the
nestedmodels,and its parametervector may not be the OLS fit in termsof the
original variables.
Procedure 3 � PACE �i� . Let ?�-(3C5��k¹¸87EDGF]H�D"J8K ������ , where W�� ± _ is thesetof zeros
of � . Set
Ø� & -Ù/ if �� & ��? ; otherwise
Ø� & - �� & . OutputthemodeldeterminedbyW Ø� � 1323232Á1 Ø� > _ .Theorem 3.8 Given ���� � � , S PACE is theminimumprior expectedlossestimator
of SU� in an estimatorclasswhich is thesubclassof [ > in which everyestimator
is determinedby W � � 1323232i1 � > _ where
Ø� & [þWu/=1q�� & _ for all � .59
Theproof is omitted;it is similar to thatof Theorem3.6.
Since S PACEo
is the modeldeterminedby W Ø� � 1323232Á1 Ø� > _ where
� & is either �� & or/ dependingon whetherthe associatedvariableis includedor not, this estimator
belongsto theclassdescribedin Theorem3.8.Thisgivesthefollowing corollary.
Corollary 3.8.1 Given ����+��� , S PACE is a betterestimator(in the senseof prior
expectedloss)than S PACEo.
The differencebetweenthe modelsgeneratedby PACE � and PACE � is alsosmall,
because
Corollary 3.8.2 Given ����+�,� , if theelementsof W~�� & �¢�- 6"132�232Á1,# _ arein decreas-
ing orderas � increases,S PACEo -HS PACE .
As wehaveseen,whetheror notadimensionis contributiveis indicatedby thesign
of thecorresponding� -function.This leadsto thenext procedure.
Procedure 4 � PACE Ci� . Set
Ø� & -m/ if ��� �� & �a�Ý/ ; otherwise
� & - �� & . Outputthe
modeldeterminedby W � � 1323232i1 � > _ .PACE C doesnot rankdimensionsin orderof absolutedistanceandeliminatethose
with smallerdistances,asdo conventionalsubsetselectionproceduresandthepre-
cedingpaceprocedures.Instead,it eliminatesdimensionsthatarenot contributive
in theestimatedmodelirrespective of themagnitudeof their dimensionalabsolute
distance.It mayeliminatea dimensionwith a largerabsolutedistancethananother
dimensionthat is retained. (In fact the other proceduresmay end up doing this
occasionally, but they do soonly becauseof incorrectrankingof variables.)
Theorem 3.9 Given ���� � � , S PACEb hasthesmallestposteriorexpectedlossin the
subclassof [ > in which everyestimatoris determinedby W � � 1�23232�1 � > _ where
Ø� & [Wu/=1¤�� & _ for all � .60
Theproof is omitted;it is similar to thatof Theorem3.7.
Becausethe estimatorclassdefinedin Theorem3.9 coversthe classesdefinedin
Theorems3.7and3.8,
Corollary 3.9.1 Given ����+�,� , S PACEb isabetterestimatorthan S PACErand S PACE .
PACE � , PACE � , PACE � andPACE C areall selectionprocedures:eachupdateddimen-
sionalabsolutedistanceof theestimatedmodelmustbeeither / or �� & . Theoptimal
valueof
� & is often neitherof these. If the possiblevaluesarechosenfrom ] $instead,thebestupdatedestimate
� & is theonethatmaximizestheexpectedcon-
tribution of the � th dimensiongiven �� & and ���� � � . The optimality is achieved
over an uncountablyinfinite setof potentialmodels. This relaxationcanimprove
performancedramaticallyevenwhenthereareno noisydimensions.
Procedure 5 � PACE Bi� . Outputthemodeldeterminedby W Ø� � 1�23232i1 Ø� > _ , whereØ� & -4365��k¹c3]d÷:F ! # Ê"!"# �������������Ï¿������ � � Ϭ� �� & ��� � �=Ë���� � �i2 (3.28)
Theorem 3.10 Given �������,� , S PACEe has the smallestposteriorexpectedlossof
all estimators in [ > .Proof. Each OLS estimate �� & is an observation sampledfrom the mixture pdfϬ� ���,��� determinedby the componentpdf Ï¿� ������ and the mixing distribution�������� . If �� & is replacedby any �®[�]%$ , the expectedcontribution of � given�� & and ����+��� is��������q�� & 1���~- Ì ! # ����������+����Ϭ�*�� & �������=Ë�������,�Ì ! # Ϭ�*�� & ��� � �=Ë����� � � 2 (3.29)
Since ���������� � �b- ��������� � �j "Ϭ������ � � from (3.12) and Ì !@# Ϭ� �� & ��� � �=Ëw����� � �is constantfor every � , the integration on the right-handside of (3.28) actually
maximizesover the expectedcontribution of � . From (3.8), this is equivalentto
61
minimizing the expectedloss in the � th dimension. Becauseall dimensionsare
independent,the posteriorexpectedlossof the updatedmodelgiven the set Wk�� & _and ��������� is minimized. ACorollary 3.10.1 Given ��������� , S PACEe is a betterestimatorthan S PACEb .This motivatesa new shrinkagemethod:shrink the magnitude(or the sumof the
magnitude)of the orthogonalprojectionsof the model åS . This is equivalent to
updatingthe OLS estimate åS to a modelthatsatisfies
� & �z�� & for every � . Since
all shrinkageestimatorsof this typeobviously yield a memberof [ > , S PACEe is a
betterestimatorthanany of them.
Althoughwe do not pursuethis directionbecauseof theoptimality of S PACEe , this
helpsus to understandothershrinkageestimators.Modelsproducedby shrinkage
methodsin theliterature—ridgeregression(includingridgeregressionfor subsetse-
lection),NN-GAROTTE andLASSO—donotnecessarilyrequireanorthogonalspace
andhencedo not alwaysbelongto this subclass,andso we cannotshow that the
new estimatoris superiorto themin general.However, in theimportantspecialcase
of orthogonalregression,whenthecolumnvectorsof G aretakenastheorthogonal
axis,themodelsproducedby all theseshrinkagemethodsdobelongto thissubclass
(seeExample3.1). Therefore,
Corollary 3.10.2 Given ���� � � , S PACEe is a betterestimatorfor orthogonalregres-
sionthan S RIDGE, S NN-GAROTTE and S LASSO.
A generalexplicit solutionto (3.28)doesnotseemto exist. Ratherthanresortingto
numericaltechniques,however, a goodapproximatesolutionis available. Consid-
eringthat�������� � �Ϭ������� � � - � �s� �)�i� � � � ���s� ��i� � � � K � � p � ��Á� � � �f��� p � ���Á� � � ���� � �� � � � � K ��� p � �� � � � � 1 (3.30)
62
thedominantpartontheright-handsideis � � � �� � � � �f��� � �� � � � � —andits dom-
inanceincreasesdramaticallyas ��� increases.Replacing ��������+�,�� �Ϭ���������,� in
(3.28)by � � � ��� � � � � , weobtainthefollowing approximationto PACE B .Procedure 6 � PACE Ai� . Outputthemodeldeterminedby W Ø� � 1�23232i1 Ø� > _ whereØ� & -4365��k¹c3]d÷:F ! # Ê !"# � � � �� � � � �jϬ�*�� & ��� � �=Ë���� � �i2 (3.31)
Equation(3.31)canbesolvedby settingthefirst derivativeof theright-handsideto
zero,resultingin Ø� & - ÓiÌ !"# � � � Ï¿�*�� & ���+�,�=Ë��������Ì !@# Ϭ�*�� & ��� � �=Ëw����� � � Ö � 2 (3.32)
In (3.28),(3.31)and(3.32)thetruedistribution ����+�,� is discrete(asin (3.21)),so
they becomerespectivelyØ� & -B365���¹g3�d÷hF ! # )° ±³² � ������,7T�± �Ϭ�����,7 �± � Ϭ�(�� & �,7 �± �i' ± 1 (3.33)Ø� & -B3C5���¹c3]d÷:F ! # )° ±È² � � � � �� � 7 �± ��Ϭ�*�� & �,7 �± �j' ± 1 (3.34)
and Ø� & - Ó Þ )±³² � � 7 �± Ï¿� �� & �,7 �± �j' ±Þ )±³² � Ϭ�*�� & ��7 �± �i' ± Ö � 2 (3.35)
Thefollowing looseboundcanbeobtainedfor theincreasedposteriorexpectedloss
sufferedby thePACE A approximation.
Theorem 3.11/��göú½ Æ & ��S PACEk � l��� & 1,���� � � À p öú½ Æ & ��S PACEe � l��� & 1,���� � � À �í! � � � 2 (3.36)
63
Proof. SeeAppendix3.8.3.
Appendix3.8.3actuallyobtainsthetighterbound ·ml Ø� PACEk& 7 � � � ��n o÷ PACE k� p e, where7 � is thesupportpointof thedistribution ����� � � thatmaximizes� � l Ø� PACEk& �Á� � � � p����� Ø� PACEk& ���+��� . This boundrisesfrom zero at the origin in termsof
� PACEk& 7T� ,achievesthemaximum ! � � � at thepoint
Ø� PACEk& 7¿�û-ã/�254 , andthereafterdropsex-
ponentiallyto zeroas
Ø� PACEk& 7¿� increases.It follows that the increasedposterior
expectedlosscausedby approximatingPACE A is usuallycloseto zero.
Remarks
All thesepaceproceduresadjust the magnitudeof the orthogonalprojectionsof
the OLS estimate åS , basedon an estimateof the expecteddimensionalcontri-
butions. Among them, PACE B and PACE A go the furthest: eachprojectionof åSontotheorthogonalaxiscanbeadjustedto any nonnegativevalueandtheadjusted
valueachieves(or approximatelyachieves)thegreatestexpectedcontribution,cor-
respondingto the minimum posteriorexpectedloss. Thesetwo procedurescan
shrink,retainor evenexpandthevaluesof theabsolutedimensionaldistances.Sur-
prisingthoughit maysound,increasinga zerodistanceto a muchhighervaluecan
improvepredictiveaccuracy.
Of thesix paceprocedures,PACE � , PACE C andPACE A aremostappropriatefor prac-
tical applications.PACE A generatesa very goodapproximationto the modelfrom
PACE B , which is thebestof thesix procedures.ProcedurePACE � choosesthebest
memberof a sequenceof subsetmodelsthat is provided to it, which is useful if
prior informationdictatesthesequenceof subsetmodels.PACE � andPACE � involve
numericalintegrationandhave higher (posterior)expectedloss thanotherproce-
dures. PACE C , which is a lower expectedlossprocedurethanPACE � , is usefulfor
dimension-basedsubsetselection.
64
3.6 The estimationof XY+2, Z .Now it is time to considerhow to estimate���� � � from �� � 1q�� � 13232�2�1q�� > , which are
the dimensionalabsolutedistancesof the OLS estimate åS . Oncethis is accom-
plished,the proceduresdescribedin the last sectionbecomefully definedby re-
placingthetrue �������,� , which we assumedin thelastsectionwasknown, with the
estimate.Theestimationof ����+�,� is anindependentstepin thesemodelingproce-
dures,andcanbe investigatedindependently. It critically influencesthequality of
thefinal model—betterestimatesof ����� � � givebetterestimatorsfor theunderlying
model.�� � 1��� � 132�232�1q�� > areactuallyasamplefrom amixturedistributionwhosecomponent
distribution ��� ����� � � is known andwhosemixing distribution is ���� � � . (Strictly
speaking,this sampleis taken without replacement.However, this is asymptoti-
cally thesameassamplingwith replacement,andsodoesnot affectour theoretical
resultssincethey arebe establishedin the asymptoticsense.)Estimating �������,�from datapoints �� � 1q�� � 132�232�1q�� > is tantamountto estimatingthemixing distribution.
Notethatthemixturehereis acountableone—theunderlying����+��� hassupportat� � �,1�� �� 132�232�1�� �> , andthenumberof supportpointsis unlimitedas #�| } . Chapter5
tacklesthis problemin ageneralcontext.
Thefollowing theoremguaranteesthat if themixing distribution is estimatedsuffi-
cientlywell, thepaceregressionprocedurescontinueto enjoy thevariousproperties
provedabove in thelimit of large # .Theorem 3.12 Let Wu� > ��� � � _ be a sequenceof CDF estimators. If � > ��� � � |rq�������� a.s.as # | } , andtheknown ��������� is replacedby theestimator� > �����,� ,Theorems3.6–3.11(andall their corollaries)hold # -asymptoticallya.s.
Proof. Accordingto theHelly-Bray theorem,� > ���+����|rqO��������� a.s.as #æ| }implies the almostsurepointwiseconvergenceof all the objective functionsused
in thesetheorems(andtheir corollaries)to theunderlyingcorrespondingfunctions,
becausethesefunctionsarecontinuous.Thisfurtherimpliesthealmostsureconver-
65
genceof theoptimalvaluesandof the locationswheretheseoptimaareachieved,
as #�| } . Thiscompletestheproof. AAll of theaboveresultsutilize thelossfunction lnl F ` p F � ltl � uQ � . However, our real
interestis ltl F ` p F � lnl � . Thereforeweneedthefollowing corollary.
Corollary 3.12.1 If thelossfunction lnl F ` p F � lnl � yQ � is replacedby lnl F ` p F � lnl � in
Theorems3.6–3.11(andall their corollaries),
1. Theorem3.12continuesto hold, if Q � is known;
2. Theorem3.12 holdsalmostsurely as E×| } if QR� is replacedwith an E -
asymptoticallystronglyconsistentestimator.
It is well known that both the unbiasedOLS estimator $Q � (for ��E p #w��| } asE | } ) and the biasedmaximumlikelihood estimator(for #w yEó| / ) are E -
asymptoticallystronglyconsistent.
In view of Theorem3.12,any estimatorof themixing distributionis ableto provide
thedesiredtheoreticresultsin thelimit if it is stronglyconsistentin thesensethat,
almostsurely, it convergesweaklyto theunderlyingmixing distributionas #| } .
From Chapter5, the maximumlikelihoodestimatoranda few minimum distance
estimatorsareknown to bestronglyconsistent.If any of theseestimatorsareusedto
obtain � > ��� � � , Theorem3.12is secured.This,finally, closesourcircleof analysis.
However, it is pointedout in Chapter5 that all the minimum distancemethods
exceptthenonnegative-measure-basedonesuffer from a seriousdefectin a finite-
samplesituation: they maycompletelyignoresmallnumbersof datapointsin the
estimatedmixture, no matterhow distantthey arefrom the dominantdatapoints.
This severely impactstheir usein our modelingprocedures,becausethe valueof
onedimensionalabsolutedistanceis frequentlyquite different to all the others—
andthis implies that the underlyingabsolutedistancehasa very high probability
of beingdifferenttoo. In addition,themaximumlikelihoodapproach,which does
66
not seemto have this minority clusterproblem,canbeusedfor paceregressionif
computationalcostis notanissue.
For all theseestimators,thefollowing threeconditions,dueto Robbins(1964),must
besatisfiedin orderto ensurestrongconsistency (seealsoSection5.3.1).
(C1) ����{��,Â"� is continuouson s Nut .
(C2) Define v to be theclassof CDFson t . If �xw o -ã�xw r for � � 1,� � [�v , then� � -H� � .(C3) Either t is a compactsubsetof ] , or �n¸t¹zy ºU{»}| y~F Î ���{���Â"� exists for each{b[�s andis notadistribution functionon s .
Paceregressioninvolvesmixturesof ��� � ���+�, "!"� , where �+� is themixing parameter.
ConditionsC1 and C3 are clearly satisfied. C2, the identifiability condition, is
verifiedin Appendix3.8.4.
3.7 Summary
This chapterexploresandformally presentsa new approachto linear regression.
Not only doesthisapproachyield accuratepredictionmodels,it alsoreducesmodel
dimensionality. It outperformsothermodelingproceduresin the literature,in the
senseof # -asymptoticsand,aswe will seein Chapter6, it producessatisfactory
resultsin simulationstudiesfor finite # .We have limited our investigationto linearmodelswith normallydistributednoise,
but the ideasareso fundamentalthatwe believe they will soonfind applicationin
otherrealmsof empiricalmodeling.
67
3.8 Appendix
3.8.1 Reconstructingmodel fr om updated absolutedistances
Oncefinal estimatesfor theabsolutedistancesin eachdimensionhave beenfound,
themodelneedsto bereconstructedfromthem.Considerhow to build amodelfrom
a setof absolutedistances,denotedby � � 1323232Á1�� > . Let 7 -m��« � � � � 132�232�1�« > � � > � � ,where « & is either K 6 or p 6 dependingon whetheror not the � th projectionof the
predictionvector Fx�` hasthesamedirectionastheorthogonalbaseÜ & . This choice
of « & ’s valueis basedon thefactthattheprojectionsof F �` and F `fe aremostlikely
in thesamedirection—i.e.,any otherchoicewoulddegradetheestimate.
Ourestimateof theparametervectorisIJ-Ý��G � Gæ� � � G ��� 7�Qq1 (3.37)
where � is a column matrix formed from the basesÜ � 13232�2�1 Ü > . For example, ifW'� � 1323232Á1�� > _ arethe OLS estimatesof theabsolutedistancesin thecorresponding
dimensions,(3.37)gives I thevalueof theOLS estimate�I .
It maybe that not all the � & ’s areavailable,but a predictionvectoris known that
correspondsto all missing � & ’s. This situationwill occurif somedimensionsare
forcedto bein thefinal model—forexample,theconstantterm,or dimensionsthat
give very greatreductionsin variation(very large � & ’s). Supposethe numberof
dimensionswith known � & ’s is # � , and call the overall predictionvector for the
remaining# p # � dimensionsF���� �8� . ThentheestimatedparametervectorisIJ-��G � G8� � � G � � F���� �8� K � 7�Q��i1 (3.38)
where� is an EæNJ# � matrixand 7 a # � -vector.
The estimationof I from � & ’s is fully describedby (3.37)and(3.38). However,
in practicethecomputationtakesa different,moreefficient, route.Oncethe E Næ#68
approximationequationof theoriginal least-squaresproblemhasbeenorthogonally
transformed,finding theleastsquaressolutionreducesto solvingamatrixequation
� IJ-0£1 (3.39)
where�
is a #�N�# upper-triangularmatrixand £ is a # -vector(LawsonandHanson,
1974,1995). As a matterof fact, thesquareof the � th elementin £ is exactly the
OLS estimate�� & QR� . Whenanew setof estimates,say
� & (�Z-�6"1323232i1�# ), is obtained,
thecorrespondingestimateof I � is thesolutionof (3.39)with the � th elementin £replacedby l
Ø� & Q without changingsign. If not all �� & ’s areknown, sothat(3.38)
is usedinsteadof (3.37),only dimensionswith known �� & ’s arereplaced.
3.8.2 The Taylor expansionof ùúìjîñýuîbÿ'ï with respectto � îLet % - � � ¾ / and % �©- � � � ¾ / . Denote�§-×6' � !]� . Because
� � % � % � ��- % � � p � % p % � � � -0! %*% � p % �and
��� % � % � �~-(� � � � � ê � e � rr -4� � r �� e ê � rr � � � e rrhence � � % � % ��� ��� % � % �,�� � � � e rr- ��! % � p % � %@� r � e ê �rP�- ��! % � p % � % K ��! % � p % �¨�! % �TK ��! % � p % � �: % � K ��� % C �- ! % � % K � p 6 K ! % � � � % � K � p ! % � K % � � � % � K ��� % C �
69
Similarly
� � p % � % � �~- % � � p � % K % � � � - p ! %*% � p % �and
��� p % � % � ��-4� � � � � # � e� rr -(� � ê � r ê r �� er � � � e rrhence � � p % � % � �f��� p % � % � �� � � � e rr- � p ! % � p % � %@� ê r � e ê �r �- p ��! % � K % � % K ��! % � K % ���! % � p ��! % � K % � �: % � K ��� % C �- p ! % � % K � p 6 K ! % � � � % �¬K ��! % � p % � � � % � K ��� % C �Therefore �������������� � � � e rr - � � % � % ���f��� % � % �,� K � � p % � % ���f��� p % � % ���! % � � � � e rr- � p 6 K ! % � � � % K ��� % � �thatis, �������� � �k- 6� !�� � ��� er ½ � p 6 K !y� � � � � K ����� `r � À 2 (3.40)
3.8.3 Proof of Theorem 3.11
To proveTheorem3.11,weneedasimplelemma.
Lemma 1 For any � ¾ / and ��� ¾ / ,/�� � � � �)� � � � � p �������� � �;� · � �+� � � � Cj� ÷=÷ e �H! � � � 2 (3.41)
70
Proof. For every � ¾ / and �+� ¾ / , � � � ��� � � � � ¾ / and � � p � �� � � � �z�Í/ ,hence
� � � �� � � � �- � � � �� � � � �f��� � �� � � � � K � � � �� � � � �f��� p � �� � � � ���� � �� � � � � K ��� p � �� � � � �¾ � � � �� � � � �f��� � �� � � � � K � � p � ��� � � � � ��� p � �� � � � ���� � �� � � � � K ��� p � ��� � � � �- ���������� � �i1and
� � � ��� � � � � p ��������� � �- ½ � �¢� ��Á� � � � p � � p � ���Á� � � � À ��� p � ���Á� � � ���� � �� � � � � K ��� p � ��� � � � �- · � �+� � ��� p � ���Á� � � ���� � �� � � � � K ��� p � ��� � � � �� · � � � � ��� p � �� � � � ����s� ���Á� � � �- · � � � � � � � � ÷�÷ e� ! � � � 1thuscompletingtheproofof thelemma. AProof of Theorem3.11. Accordingto thedefinitionof dimensionalcontribution, to
proveTheorem3.11we needto show that/��g����� Ø� PACEe& �q�� & 1,���� � ��� p ����� Ø� PACEk& �q�� & 1������ � �j���í! � � � 1 (3.42)
where �������,� , no mattercontinuousor discrete,is themixing distribution function
usedin bothprocedures.Thefirst inequalityin (3.42)is obviousbecause
� PACE e& is
theoptimalsolution.For thesecondinequality, from theabove lemmaandthefact
71
that
Ø� PACEk& is theoptimalsolutionof (3.34),we have����� Ø� PACEe& �q�� & 1,���� � ���- Ì !@# ���Ø� PACE e& �����,�jϬ�*�� & �������=Ë����+�,�Ì ! # Ϭ� �� & ��� � �=Ëw����� � �� Ì ! # � � lØ� PACEe& �i� � � �jϬ�*�� & ��� � �=Ëw����� � �Ì !@# Ϭ�*�� & ��� � �=Ëw����� � �� Ì !@# � � lØ� PACEk& � � � � �jϬ�*�� & �����,�=Ëw�������,�Ì ! # Ϭ�*�� & ��� � �=Ëw����� � � 2
Hence ��� Ø� PACE e& ���� & 1,���� � ��� p ��� Ø� PACEk& �q�� & 1,���� � ���� Ì ! # W � � l
Ø� PACEk& � � � � � p ����� Ø� PACEk& ��� � � _ Ϭ�*�� & ��� � �=Ë���� � �Ì !@# Ϭ�*�� & ��� � �=Ëw����� � � 2Accordingto the lemma, � � l Ø� PACEk& �Á� � � � p ����� Ø� PACEk& ��� � � ¾ / for every � � ,andlet 7 � -(3C5��k¹c3�dH�÷ e K � � l Ø� PACEk& � � � � � p ��� Ø� PACEk& ��� � ��1where W'� � _ is thesetof supportpointsof ����� � � . Therefore,��� Ø� PACE e& �q�� & 1,���� � ��� p ��� Ø� PACEk& �R�� & 1,���� � ���� � � l Ø� PACEk& �Á� 7 � � p ����� Ø� PACEk& �,7 � �� · l Ø� PACEk& 7 � � � � n o÷ PACE k� p e
� ! � � � 1whichfinishestheproof of Theorem3.11. A
72
3.8.4 Identifiability for mixtur esof �}� �(ìjî�ÿ��G�Rï distrib utions
Although we have beenunableto locatethe following theoremin the literature,
it seemsunlikely to be original. Note that here,identifiability appliesonly to the
situationwherethemixing function is limited to beinga CDF. Note that the iden-
tifiability of mixturesusingCDFsasmixing functionsimpliestheidentifiability of
mixturesusingany finite nonnegative functionsasmixing functions(Lemma4 in
Section5.4.3),asrequiredby (5.19).
Theorem 3.13 Themixture of ��� �������, �!"� distributions,where ��� is themixingpa-
rameter, is identifiable.
Proof. Let ����{������ be the CDF of the distributionP �L�¬1�6'� , and �������� � � be the
CDFof thedistribution � � ����� � �!"� . Fromthedefinitionof the � � �Á��� � "!"� distribution
andthesymmetryof thenormaldistribution,�������� � �U- �� � �)� � � � � p �� p � ��� � � � �- �� � �)� � � � � K �� � �� p � � � � p 6"1 (3.43)
where �� � �� � � � � is theCDF ofP � � � � 136'� and �� � ��� p � � � � is theCDF ofP � p � � � 136 � . Therefore,if themixtureof ������������ wereunidentifiable,themix-
ture of normaldistributionswherethe mean � is the mixing parameterwould be
unidentifiable.Clearlythiscontradictsthewell-known identifiability resultfor mix-
turesof normaldistributions(Teicher, 1960),thuscompletingtheproof. A
73
Chapter 4
Discussion
4.1 Intr oduction
Several issuesrelatedto paceregressiondeserve furtherdiscussion.Somearenec-
essaryto completethedefinitionof thepaceregressionproceduresin specialsitua-
tions. Othersexpandtheir implicationsinto a broaderarena,while yet othersraise
interestingopenquestions.
4.2 Finite � vs. � -asymptotics
We have seenthat the paceregressionproceduresare optimal in a # -asymptotic
sense.Larger numbersof variablestend to produceestimatorsthat arecloserto
optimal. If thereareonly a few candidatevariables,paceregressionwill not neces-
sarily outperformothermethods.Since # is inevitably finite in practice,it is worth
expandingon this.
The central idea of paceregressionis to decomposethe predictionvector of an
estimatedmodel into # orthogonalcomponents,and then adjusteachaccording
to aggregatedmagnitudeinformationfrom all components.The morediversethe
magnitudesof thedifferentcomponents,thelessthey caninform theadjustmentof
75
any particularone. If onecomponent’s magnitudediffers greatly from that of all
others,thereis little basison which to alterits value.
Paceregressionshineswhenmany variableshave similar effects—acommonspe-
cial situationis whenmany variableshavezeroeffect. As theeffectsof thevariables
disperse,paceregression’ssuperiorityoverotherproceduresfades.Whentheeffect
of eachvariableis isolatedfrom thatof all others,thepaceestimatoris exactly the
OLS one.In principle,theworstcaseis whenno improvementoverOLS is possible.
Althoughpaceregressionis # -asymptoticallyoptimal, this doesnot meanthat in-
creasingthe numberof candidatevariablesnecessarilyimprovesprediction. New
contributivevariablesshouldincreasepredictiveaccuracy, but new redundantvari-
ableswill decreaseit. Pre-selectionof variablesbasedon backgroundknowledge
will alwayshelpmodeling,if suitablevariablesareselected.
4.3 Collinearity
Paceregression,like almostany otherlinear regressionprocedure,fails whenpre-
sentedwith (approximately)collinearvariables.Hencecollinearity shouldbe de-
tectedandhandledbeforeapplyingtheprocedure.Wesuggesteliminatingcollinear-
ity by discardingvariables.Thenumberof candidatevariables# shouldbereduced
accordingly, becausecollinearitydoesnot provide new independentdimensions,a
prerequisiteof paceregression.In otherwords,themodelhasthesamedegreesof
freedomwithout collinearityaswith it. Appropriatevariablescanbe identifiedby
examiningthematrix G � G , or QR-transformingG (see,e.g.,LawsonandHanson,
1974,1995;Dongarraet al., 1979;Andersonet al., 1999).
Notethat OLS subsetselectionproceduresaresometimesdescribedasa protection
againstcollinearity. However, the fact is that noneof theseautomaticprocedures
canreliably eliminatecollinearity, for collinearity doesnot necessarilyimply that
theprojectionsof thevector F aresmall.
76
4.4 Regressionfor partial models
The full OLS modelforms the basisfor paceregression.In somesituations,how-
ever, the full model may be unavailable. For example, theremay be too many
candidatevariables—perhapsmorethanthenumberof observations.Althoughthe
largestavailable partial model can be substitutedfor the full model, this causes
practicaldifficulties.
The clusteringstepin paceregressionmust take all candidatevariablesinto ac-
count. This is possibleso long asthe statisticaltestusedto determinethe initial
partialmodelcansupplyanapproximatedistribution for thedimensionalabsolute
distancesof the variablesthat do not participatein it. For example,the partial-�testmaybeusedto discardvariablesbasedon forwardselection;typically, dimen-
sionalabsolutedistancesaresmallerfor thediscardedvariablesthanfor thoseused
in the model. It seemslikely that a sufficiently accurateapproximatedistribution
for thedimensionalabsolutedistancesof thediscardedvariablescanbederivedfor
this test,thoughfurtherinvestigationis necessaryto confirmthis.
Insteadof providing an approximatedistribution for the discardedvariables,it is
alsopossibleto estimatethemixing distributiondirectlybyassumingthattheeffects
of thesevariablesarelocatedin anintervalof smallvalues.Estimationof themixing
distribution canthenproceedasusual,provided theboundariesof fitting intervals
areall chosento beoutsidethis interval. Our implementationadoptsthis approach.
In many practicalapplicationsof datamining, somekind of feature selection(see,
e.g.Liu andMotoda,1998;Hall, 1999)is performedbeforeaformalmodelingpro-
cedureis invoked,which is helpful whentherearetoo many features(or variables)
availablefor theprocedureto handlecomputationally. However, it is generallynot
acknowledgedthatbiasis introducedby discardingvariableswithout passingrele-
vantinformationon to themodelingprocedure—thoughadmittedlymostmodeling
procedurescannotmakeuseof this kind of information.
Estimatingthenoisevariance,shouldit be unknown a priori, is anotherissuethat
77
is affectedwhen only a partial initial model is available. The noisecomponent
will containthe effectsof all variablesthat arenot includedin the partial model.
Moreover, becauseof competitionbetweenvariables,the OLS estimateof QR� from
thepartialmodelis biaseddownwards.How to compensatefor this is aninteresting
topicworthyof furtherinvestigation.
4.5 Remarks on modelingprinciples
As notedin Section1.2, paceregressionmeasuresthe successof modelingusing
two separateprinciples. The primary oneis accuracy, or the minimizationof ex-
pectedloss.Thesecondaryoneis parsimony, or apreferencefor thesmallestmodel,
andis only usedwhenit doesnot conflict with thefirst—thatis, to decidebetween
severalmodelsthat have aboutthe sameaccuracy. The secondaryprinciple is es-
sentialif dimensionalityreductionis of interest.For example,if thesupportpoints
foundby theclusteringsteparenotall zero,noneof theupgraded
Ø� & ’sby PACE B and
PACE A will bezero. But many maybetiny, andeliminatingtiny
� & hasnegligible
influenceonpredictiveability.
In the following, we discussfour widely-acceptedgeneralprinciplesof modeling
by fitting linear models. Whenthe goal is to minimize the expectedloss, the ( # -asymptotic)superiorityof paceregressioncastsdoubton theseprinciples. In fact,
so doesthe existing empirical Bayesmethodology, including Stein estimation—
however, it doesnotseemto havebeenappliedto fitting linearmodelsin awaythat
is directlyevaluatedbasedon thelossfunction;seethereview in Section2.8.
First, paceregressionchallengesthe generalleastsquaresprinciple—or perhaps
moreprecisely, theunbiasednessprinciple.All six paceproceduresoutperformOLS
estimation:PACE � andPACE � in thesenseof prior expectedlossandtheremainder
in thesenseof posteriorexpectedloss. Accordingto theGauss-Markov Theorem,
OLS yieldsa“bestlinearunbiasedestimator”(or BLUE). OLS’s inferiority, however,
is duepreciselyto the unbiasednessconstraint,which fails to utilize all the infor-
78
mationimplicit in thedata. It is well known thatbiasedestimatorssuchassubset
selectionandshrinkagecanoutperformthe OLS estimatorin particularsituations.
We have shown thatpaceregressionestimators,which arealsobiased,outperform
OLS in all situations.In fact, in modernstatisticaldecisiontheory, therequirement
of unbiasedestimationis usuallydiscarded.
Second,paceregressionchallengesthemaximumlikelihoodprinciple. In thelinear
regressionsituation,the maximumlikelihoodestimatoris exactly the sameasthe
OLS one.
Third, someusesof Bayes’ rule comeunderattack. The BayesianestimatorBIC
is threshold-basedOLS subsetselection,wherethe a priori densityis usedto de-
terminethe threshold.Yet thesix paceregressionproceduresequalor outperform
the bestOLS subsetselectionestimator. This improvementis not basedon prior
information—thegeneralvalidity of which haslongbeenquestioned—but on hith-
erto unexploited informationthat is implicit in the very samedata. Furthermore,
whenthe non-informative prior is used—andthe useof the non-informative prior
is widely accepted,evenby many non-Bayesians—theBayesestimatoris thesame
asthe maximumlikelihoodestimator, i.e., the OLS estimator, and thereforeinfe-
rior to the paceregressionestimators.Although hierarchical Bayes(alsoknown
asBayesempiricalBayes), which involvessomehigherlevel, subjectiveprior, can
producesimilar resultsto empiricalBayes(see,e.g.,Berger, 1985;Lehmannand
Casella,1998),bothdiffer essentiallyin whetherto utilize prior informationor data
informationfor thedistributionof trueparameters(or models).
Fourth, questionsariseconcerningcomplexity-basedmodeling. Accordingto the
minimum descriptionlength(MDL) principle (Rissanen,1978), the bestmodel is
the onethat minimizesthe sumof the modelcomplexity andthe datacomplexity
giventhemodel.In practicethefirst partis anincreasingfunctionof thenumberof
parametersrequiredto definethemodel,while thesecondis theresubstitutionerror.
Ouranalysisandexperimentsdonotsupportthisprinciple.Wehavefound—seethe
examplesandanalysisin Sections3.4 and3.5 andtheexperimentsin Chapter6—
thatpaceregressionmaychoosemodelsof thesameor even larger size,andwith
79
largerresubstitutionerrors,thanthoseof otherprocedures,yet givesmuchsmaller
predictionerrorson independenttestsets. In addition,the MDL estimatorderived
by Rissanen(1978)is thesameastheBIC estimator, whichhasalreadybeenshown
inferior to thepaceregressionestimatorsin termsof ourpredictioncriterion.
4.6 Orthogonalization selection
Our analysisassumesthat the dimensionalmodelsassociatedwith the axesin an
orthogonalspaceareindependent.This statement,however, is worth furtherexam-
ination.Whena suitableorthogonalizationis givenbeforeseeingthedata(or more
precisely, the F values),this is clearly true. Theproblemis thatwhentheorthog-
onalizationis generatedfrom thedata,say, by a partial-� test,it correspondsto a
selectionfrom a groupof candidateorthogonalizations.Suchselectionwill defi-
nitely affect thepaceestimators(andmany othertypesof estimatorwhich employ
orthogonalization).
Theeffectof selectingorthogonalizationcanbeshown by extremeexamples.After
eliminatingcollinearity(seeSection4.3),wecouldalwaysfind anorthogonalization
suchthat �� � - �� � - v v v - �� > , in which casethe updatingof the PACE B and
PACE A estimatorswill choosethesamevaluesof the OLS estimates.Or, possibly,
these �� & valuescould be adjustedby selectingan appropriateorthogonalbasisto
be “distributed” like a uni-component� � sample,thereforePACE B andPACE A will
updatethemto the centerof this distribution. Generallyspeaking,suchupdating
will not achieve the goal of betterestimation,becauseit manipulatesthe datatoo
much.
Despitethesephenomena,it is not suggestedthat the proceduresareonly useful
whenanorthogonalizationis providedin advance.Rather, employing thedatato set
upanorthogonalbasiscanhelpto eliminateredundantvariables.Thisphenomenon
resemblestheissueof modelselection—hereit is justorthogonalizationselection—
becauseboth involve manipulatingthe dataand thus affect the estimation. The
80
selectionof anorthogonalizationis acompetitionwith uncertaintyfactorsin which
thewinneris theonethatbestsatisfiestheselectioncriterion,e.g.,thepartial-� test.
Orthogonalizationselectioninfluencesthe proposedestimators,yet its effect re-
mainsunclearto us. Althoughexperimentswith paceregressionproducesatisfac-
tory results(seeChapter6) without taking this effect into account,the theoretical
analysisof paceregressionis incompletewhentheorthogonalizationis basedupon
thedata.We leave this problemopen:it is a futureextensionof paceestimation.
Despitethe optimality of paceregressiongiven an orthogonalbasis,differentor-
thogonalsystemsresult in differentestimatedmodels. This obviously contradicts
theinvarianceprinciple.Weconsiderthisquestionasfollows.
An orthogonalsystem,like any otherassumptionusedin empiricalmodeling,can
bepre-determinedby employing prior information(backgroundknowledge)or de-
terminedfrom data.Which of theseis appropriatedependson knowing (or assum-
ing) which helpsthe modelingmost. An orthogonalsystemthat is naturalin an
applicationis notnecessarilythebestto employ. In situationswhentherearemany
redundantvariables,thedata-drivenapproachbasedon thepartial-� testseemsto
beagoodchoice.
4.7 Updating signed �� � ?All proposedpaceproceduresconcernthe mixture of �� & ’s, or the estimationof�������� . An interestingquestionarises:is it possible,or indeedbetter, to estimate
andusethemixing distribution of themixtureof signed $%'& , wherethesignof each$%'& is determinedby thedirectionof eachorthogonalaxis? Henceeach $%'& is inde-
pendentandnormallydistributedwith mean% �& . Oncethemixing distribution,say,�� % � � , is obtainedfrom the given $%'& ’s throughan estimationprocedure,each $%'&could be updatedin a similar way to the proposedpaceestimators.Note that the
sign of $%'& hereis differentfrom that definedin Section3.3.1. Both $%'& and % �& are
signedvalueshere,while previously % �& is alwaysnonnegative.
81
a*0
Figure4.1: Uni-componentmixture
Employingonly theunsignedvaluesimpliesthatany potentialinfluenceof thesigns
is shieldedout of the estimation,no matterwhetherit improves or worsensthe
estimate.Therefore,it is worth examiningsituationswheresignshelp andwhere
they hinder. At first glance,using ��� % ��� ratherthan ����+��� seemsto producebetter
estimates,sinceit employsmoreinformation—informationaboutthesignaswell as
themagnitude.Nevertheless,our answerto this questionis negative: it is possible
but generally notasgood.Considerthefollowing examples.
Start from a simple, ideal situationin which the magnitudeof all projectionsare
equal.Further, assumethatthedirectionsof theorthogonalaxesarechosensothat
thesignsof % � � 1 % �� 132�232�1 % �> arethesame,i.e., % � � - % �� - v v v*- % �> - % � ��Ñ^/ , without
lossof generality).Thenthemixture is a uni-componentnormaldistribution with
mean% � , asshown in Figure4.1.Thebestestimatorusing �� % �,� — # -asymptotically
equivalentto usingastronglyconsistentestimatorof ��� % ��� —canideallyupdateall$%'& to % � , resultingin zeroexpectedloss. In contrast,however, theestimatorbased
on ���� � � will only updatethe magnitude,but not the sign, i.e., the dataon the
negativehalf realline will keepthesamenegativesignwith theupdatedmagnitude.
Obviously in this casetheestimatorbasedon ��� % � � shouldbebetterthantheother
one. Note this is an idealsituation.Also, for finite sample,the improvementonly
takeseffect for thedatapointsaroundtheorigin.
The secondexampleinvolvesasymmetricmixtures. Sincein practicewe areun-
likely to beso fortunatethatall % �& have thesamesign,considera lessidealsitua-
82
a*0-a*
Figure4.2: Asymmetricmixture
tion: therearemorepositive % �& thannegative % �& . In Figure4.2, a mixture of two
componentsis displayed,wheretwo-thirdsof the % �& ’s aresetto % � andthe resttop % � . In this asymmetriccase,theadvantageof using �� % �,� over using ��������� de-
creases,becausetheratio of datapointsthatchangesignafterupdatingdecreases.
Further, in thefinite samplesituation,thevalueof % � estimatedfrom theset W$%'& _ is
lessaccuratethanfrom theset W��� & _ , sincethesamenumberof datapointsaremore
sparselydistributedwhenusing W�$%u& _ thanwhenusing W �� & _ ; seealsoSection4.2.
More importantly, theassumptionof anasymmetricmixture implies theexistence
of correlationbetweenthedeterminedaxialdirectionsandtheprojectionsof thetrue
responsevector. This is equivalentto employing prior information,whoseeffect,as
always,mayupgradeor deterioratetheestimate,dependingonhow correctit is. For
example,if we know beforehandthatall thetrueparametersarepositive,wemight
be ableto determinean orthogonalbasisthat takesthis informationinto account,
thusimproving estimation.However, moregenerally, we have no ideawhetherthe
directionsof thechosenbasisaresomehow correlatedwith thedirectionsof thetrue
projections.Without this kind of prior information,wecanoftenexpectto obtaina
symmetricmixture,asshown in Figure4.3. Choosingaxial directionswithout any
implicationabouttheprojectiondirectionsis equivalentin that theaxial directions
arerandomlychosen.
In the caseof symmetric �� % � � , updatinginvolvesonly updatingthe magnitude,
not thesign, just aswhenusing ���� � � . However, a betterestimateof ����� � � than
83
a*0-a*
Figure4.3: Symmetricmixture
of ��� % ��� is obtainedfrom thesamenumberof datapoints,becausethey aremore
closelypacked together. Furthermore,randomassignmentof signsimplies extra,
imposeduncertainty, thusdeterioratingestimation.
In theabove,only onesinglemagnitudevalueis considered,but theideaextendsto
situationswith multipleor evencontinuousmagnitudevalues.Without usefulprior
informationto suggestthechoiceof axialdirections,onewouldexpectasymmetric
mixtureto beobtained.In this case,it is betterto employ �������� than ��� % ��� .
84
Chapter 5
Efficient, reliableand consistent
estimationof an arbitrary mixing
distrib ution
5.1 Intr oduction
A commonpracticalproblemis to fit anunderlyingstatisticaldistributionto asam-
ple. In someapplications,this involvesestimatingtheparametersof a singledistri-
bution function—e.g.themeanandvarianceof a normaldistribution. In others,an
appropriatemixtureof elementarydistributionsmustbefound—e.g.asetof normal
distributions,eachwith its own meanandvariance.Paceregression,aspointedout
in Section3.6,concernsmixturesof noncentral��� � distributions.
In many situations,thecumulative distribution function(CDF) of a mixture distri-
bution (or model) hastheform��w���{¤��- Ê Î ����{��,Â"��£*����Â"�i1 (5.1)
where Â8[(t , the parameterspace,and {^[4s , the samplespace.This givesthe
CDF of the mixture distribution ��w~��{¤� in termsof two moreelementarydistribu-
tions: the componentdistribution ����{��,Â"� andthe mixing distribution ����Â"� . The
85
formerhasa singleunknown parameter ,1 while the lattergivesa CDF for  . For
example, ����{��,Â"� might be thenormaldistribution with mean  andunit variance,
where is a randomvariabledistributedaccordingto ����Â"� .The mixing distribution ����Â"� can be either continuousor discrete. In the latter
case,����Â"� is composedof a numberof masspoints,say,  � 1323232Á1, > with masses' � 1�23232�1~' > respectively, satisfying Þ >±³² � ' ± - 6 . Then(5.1)canbere-writtenas
��w~��{¤��- >° ±³² � ' ± ���{��, ± �i1 (5.2)
eachmasspoint providing a component,or cluster, in the mixture with the corre-
spondingweight, wherethesepointsarealsoknown asthe supportpointsof the
mixture. If thenumberof components# is finite andknown a priori , themixture
distributionis calledfinite; otherwiseit is treatedascountablyinfinite. Thequalifier
“countably” is necessaryto distinguishthiscasefrom thesituationwith continuous���Â�� , which is alsoinfinite.
Themaintopic in mixturemodels—whichis alsotheproblemthatwe will address
in this chapter—is theestimationof ����Â"� from sampleddatathatareindependent
andidentically distributedaccordingto the unknown distribution �xw~��{¤� . We will
focuson the estimationof an arbitrary mixing distribution, i.e., the true ����Â"� “is
treatedcompletelyunspecifiedas to whetherit is discrete,continuousor in any
particularfamily of distributions;this will becallednonparametricmixturemodel”
(Lindsay, 1995,p.8).
For tackling this problem, two main approachesexist in the literature(seeSec-
tion 5.2), themaximumlikelihoodandminimumdistanceapproaches, althoughthe
maximumlikelihoodapproachcanalsobe viewedasa specialminimum distance
approachthat usesthe Kullback–Leiblerdistance(Titteringtonet al., 1985). It is
generallybelievedthatthemaximumlikelihoodapproachshouldproduce(slightly)
betterestimationthroughsolvingcomplicatednonlinearequations,while the(usual)1In fact, � couldbea vector, but we only focuson thesimplest,univariatesituation,which is all
thatis neededfor paceregression.
86
minimumdistanceapproach,whichonly requiressolvinglinearequationswith lin-
earconstraints,offers the advantageof computationalefficiency. This advantage
makestheminimumdistanceapproachto bethemaininterestin ourwork, because
fitting linearmodelscanbeembeddedin a largemodelingsystemandrequiredre-
peatedly.
Theminimum distanceapproachcanbe further categorizedbasedon thedistance
measureused,for example,CDF-based, pdf-based, etc.As weshow in moredetail
in Section5.5,existing minimumdistancemethodsall suffer from a seriousdraw-
backin finite-datasituations(theminority clusterproblem): smalloutlying groups
of datapointscanbecompletelyignoredin theclustersthatareproduced.To rectify
this,a new minimumdistancemethodusinga nonnegativemeasureandtheideaof
localfitting is proposed,whichsolvestheproblemwhile remainingascomputation-
ally efficient asotherminimum distancemethods.Beforeproposingthis method,
wegeneralizetheCDF-basedmethodandintroducetheprobability-measure-based
methodfor reasonsof completeness.Theoreticalresultsof strongconsistency of the
proposedestimatorswill alsobe established.Strongconsistencyheremeansthat,
almostsurely, the estimatorsequencesconverge weakly to any given ����Â"� asthe
samplesizeapproachesinfinity, i.e.,
� 53���t¸n¹� º¼» �����Â"��-9���Â"��1,Â�367��z���67"��¸�7" w¸¡�¢�z£è�"¸�7��T�6¤T���-�6�2 (5.3)
Theminority clusterphenomenonseemsto have beenoverlooked,presumablyfor
threereasons:small amountsof datamaybe assumedto representa small loss;a
few datapointscaneasilybe dismissedasoutliers; and in the limit the problem
evaporatesbecausethe estimatorsare strongly consistent. However, often these
reasonsareinappropriate:thelossfunctionmaybesensitiveto thedistancebetween
clusters;thesmallnumberof outlyingdatapointsmayactuallyrepresentsmall,but
important,clusters;andany practicalclusteringsituationwill necessarilyinvolve
finite data.Paceregressionis suchanexample(seeChapter3).
Note that the maximumlikelihoodapproach(reviewed in Subsection5.2.1)does
87
not seemto have this problem. The overall likelihoodfunction is the productof
eachindividual likelihoodfunction,henceany singlepoint thatis not in the“neigh-
borhood”of the resultingclusterswill make the productzero,which is obviously
not themaximum.
Finally, it is worth mentioningthat althoughour applicationinvolvesestimating
the mixing distribution in an empirical modeling setting, there are many other
applications. Lindsay (1995) providesan extensive discussion,including known
componentdensities,the linear inverseproblem,randomeffectsmodels,repeated
measuresmodels,latentclassandlatenttrait models,missingcovariatesanddata,
randomcoefficient regressionmodels,empiricalandhierarchicalBayes,nuisance
parametermodels,measurementerror models,de-convolution problems,robust-
nessandcontaminationmodels,over-dispersionandheterogeneity, clustering,etc.
Many of thesesubjects“have [a] vastliteratureof their own.” Booksby Tittering-
ton et al. (1985),McLachlanandBasford(1988)alsoprovide extensive treatment
in considerabledepthaboutgeneralandspecifictopicsin mixturemodels.
In thischapterwefrequentlyusetheterm“clustering”for theestimationof amixing
distribution. In our context, both termshave a similar meaning,althoughin the
literaturethey haveslightly differentconnotations.
Thestructureof this chapteris asfollows. Existingestimationmethodsfor mixing
distributionsarebriefly reviewedin Section5.2. Section5.3 generalizestheCDF-
basedapproach,anddevelopsa generalproof of thestrongconsistency of themix-
ing distributionestimator. Thenthisproof is adaptedfor measure-basedestimators,
including the probability-measure-basedandnonnegative-measure-basedones,in
Section5.4. Section5.5describestheminority clusterproblemandillustrateswith
experimentshow the new methodovercomesit. Simulationstudieson predictive
accuracy of theseminimumdistanceproceduresin thecaseof overlappingclusters
aregivenin Section5.6.
88
5.2 Existing estimationmethods
In this section,we briefly review theexisting methodsof thetwo mainapproaches
to theestimationof a mixing distribution: themaximumlikelihoodandminimum
distanceapproaches.Thesemethodswork generallyfor all typesof component
distribution.
TheBayesianapproachandits variationsarenot included,sincethis methodology
is usuallyusedto control thenumberof resultingsupportpoints,or to mitigatethe
effectof outliers,which in our context doesnot appearto bea problem.
5.2.1 Maximum lik elihoodmethods
The likelihood function for an independentsamplefrom the mixture distribution
hastheform xú�����k- �¥ ±È² � x ± �����i1 (5.4)
wherex ± ���� hastheintegralform Ì x ± ��Â"��£*����Â"� . Lindsay’s(1995)monographpro-
videsanexcellentcoverageof the theory, geometry, computationandapplications
of maximumlikelihood(ML) estimation;seealsoBohning(1995)andLindsayand
Lesperance(1995).
The ideaof finding thenonparametricmaximumlikelihoodestimator(NPMLE) of
a mixing distribution wasoriginatedby Robbins(1950)in anabstract.It waslater
substantiallydevelopedby Kiefer andWolfowitz (1956),who providedconditions
thatensureconsistency of theML estimator.
Algorithmic computationof the NPMLE, however, only begantwenty yearslater.
Laird (1978)provedthat the NPMLE is a stepfunctionwith no morethan E steps,
where E is the samplesize. Shealsosuggestedusing the EM algorithm(Demp-
steret al., 1977)to find the NPMLE. An implementationis providedby DerSimo-
89
nian(1986,1990).SincetheEM algorithmoftenconvergesveryslowly, this is anot
popularmethodfor finding the NPMLE (althoughit hasapplicationsin estimating
finite mixtures).
Gradient-basedalgorithmsaremoreefficient thanEM-basedones.Well-known ex-
amplesincludethevertex directionmethod(VDM) andits variations(e.g.,Fedorov,
1972;Wu, 1978a,b;Bohning,1982;Lindsay, 1983), the vertex exchange method
(VEM) (Bohning, 1985,1986; Bohning et al., 1992), the intra-simplex direction
method(ISDM ) (LesperanceandKalbfleisch,1992),andthesemi-infiniteprogram-
mingmethod(SIP) (CoopeandWatson,1985;LesperanceandKalbfleisch,1992).
LesperanceandKalbfleisch(1992)point out that ISDM andSIP arestableandvery
fast—inoneof their experiments,bothneedonly 11 iterationsto converge,in con-
trastto 2177for VDM and143for VEM.
The computationof the NPMLE hasbeensignificantly improved within the last
twenty-five years.Nevertheless,dueto the essenceof nonlinearoptimization,the
computationalcostof all theseML methodsincreasesdramaticallyasthe number
of final supportpointsincreases,andin practicethis numbercanbeaslargeasthe
samplesize. Theminimumdistanceapproachis moreefficient, sinceoften it only
involvesa linearor quadraticmathematicalprogrammingproblem.
5.2.2 Minimum distancemethods
Theideaof theminimumdistancemethodis to definesomemeasureof thegood-
nessof theclusteringandoptimizethis by suitablechoiceof a mixing distribution�����Â"� for a sampleof size E . We generallywant theestimatorto bestronglycon-
sistentas EÍ| } , in the sensedefinedin Section5.1, for an arbitrary mixing
distribution. We alsowantto take advantageof any specialstructureof mixturesto
comeup with anefficientalgorithmicsolution.
We begin with somenotation.Let { � 1�23232�1j{� bea samplechosenaccordingto the
mixturedistribution, andsuppose(without lossof generality)that thesequenceis
90
orderedso that { � ��{ � �zv v v}� {� . Let �����Â"� be a discreteestimatorof the
underlyingmixing distribution with a setof supportpointsat WuÂ�� & �¢�-Ù6"1323232Á1�#�� _ .Each Â�� & providesa componentof thefinal clusteringwith weight '©� & ¾ / , whereÞ >j¦& ² � '©� & -Ý6 . Giventhesupportpoints,obtaining ���w��Â�� is equivalentto comput-
ing theweightvector §���- �L'©� � 1�'©� � 1�23232�1~'©� >j¦ � Ä . Denoteby ��w ¦ ��{¤� theestimated
mixtureCDFwith respectto �����Â"� .Two minimumdistanceestimatorswereproposedin thelate1960s.Choi andBul-
gren(1968)used 6E �° ±³² � ½Ç��w ¦ ��{ ± � p ª¢ yE À � (5.5)
as the distancemeasure. Minimizing this quantity with respectto ��� yields a
stronglyconsistentestimator. This estimatorcanbeslightly improveduponby us-
ing theCramer-von Misesstatistic6E �° ±³² � ½Ç��w ¦ ��{ ± � p ��ª p 6' "!"�j yE À � K 6' =��6'!yE � ��1 (5.6)
whichessentiallyreplacesªs yE in (5.5)with ��ª p �� �j yE withoutaffectingtheasymp-
totic result.As might beexpected,this reducesthebiasfor small-samplecases,as
wasdemonstratedempiricallybyMacdonald(1971)in anoteonChoiandBulgren’s
paper.
At aboutthesametime,DeelyandKruse(1968)usedthesup-normassociatedwith
theKolmogorov-Smirnov test.Theminimizationis over
¨ ©£�¢ª ± ª � Wl �xw ¦ ��{ ± � p ��ª p 6'�� yEkl51�l ��w ¦ ��{ ± � p ª¢ yEkl _ 1 (5.7)
andthis leadsto a linearprogrammingproblem.DeelyandKrusealsoestablished
thestrongconsistency of theirestimator��� .Ten yearslater, the above approachwasextendedby Blum andSusarla(1977)to
approximatedensityfunctionsby usingany function sequenceWyÏ'� _ that satisfies
91
¨ ©£ZlÇÏ'� p ÏCwÕlk| / a.s.as E\| } , where Ï�w is the underlyingpdf of the mix-
ture. Each Ï'� canbe obtained,for example,by a kernel-baseddensityestimator.
Blum andSusarlaapproximatedthefunction Ï'� by theoverallmixturepdf Ï�w ¦ , and
establishedthestrongconsistency of theestimator��� underweakconditions.
For reasonof simplicity andgenerality, we will denotetheapproximationbetween
two mathematicalentitiesof thesametypeby ¶ , which implies theminimization
with respectto anestimatorof adistancemeasurebetweentheentitiesoneitherside.
Thetypesof entity involvedin this thesisincludevector, functionandmeasure,and
weusethesamesymbol ¶ for each.
In the work reviewed above, two kinds of estimatorare used: CDF-based(Choi
andBulgren,Macdonald,andDeelyandKruse)andpdf-based(Blum andSusarla).
CDF-basedestimatorsinvolve approximatinganempiricaldistribution with anes-
timatedone ��w ¦ . Wewrite thisas ��w ¦ ¶ ���=1 (5.8)
where��� is theKolmogorov empiricalCDF,�����{¤��- 6E �° ±È² �¬«] ® J | » d ��{¤�i1 (5.9)
or indeedany empiricalCDF thatconvergesto it. Pdf-basedestimatorsinvolve the
approximationbetweenprobabilitydensityfunctions:ÏCw ¦ ¶ Ï �=1 (5.10)
whereÏCw ¦ is theestimatedmixturepdf and Ï � is theempiricalpdf describedabove.
Theentitiesin (5.8)and(5.10)arefunctions.Whentheapproximationis computed,
however, it is computedbetweenvectorsthat representthe functions. Thesevec-
torscontainthe functionvaluesat a particularsetof points,which we call “fitting
points.” In the work reviewed above, the fitting pointsarechosento be the data
92
pointsthemselves.
5.3 Generalizing the CDF-basedapproach
In thissectionwegeneralizetheCDF-basedapproachusingtheform (5.8).A well-
definedCDF-basedestimatorneedsto specify(a) thesetof supportpoints,(b) the
setof fitting points,(c) the empiricalfunction,and(d) thedistancemeasure.The
following generalization,dueto theconsiderationsgiven in Subsection5.3.3,will
coverall of theabove four aspects.In Subsection5.3.1,conditionsfor establishing
the estimator’s strongconsistency are given. The proof of strongconsistency is
givenin Subsection5.3.2.
5.3.1 Conditions
Conditionsfor definingtheestimatorsandensuringstrongconsistency aregivenin
this section. They aredivided into two groups. The first includesthe continuity
condition (Condition5.1), the identifiability condition (Condition5.2), andCon-
dition 5.3. Their satisfactionneedsto be checked in practice. The secondgroup,
which canalwaysbe satisfied,is usedto preciselydefinethe estimator. They are
Condition5.4 for theselectionof (potential)supportpoints,Conditions5.5–5.6for
distancemeasure,Conditions5.7for empiricalfunction,andConditions5.8–5.9for
fitting points.
Condition 5.1 ���{��,�� is a continuousfunctionover s N\t .
Condition 5.2 Define v to betheclassof CDFson t . If ��w8-×��¯ for ��1, Ô[°v ,
then �×-9 .
Condition5.2 is known astheidentifiability condition.Theidentifiability problem
of mixturedistributionshasattractedmuchresearchattentionsincetheseminalpa-
93
persby Teicher(1960,1961,1963). Referencesandsummariescanbe found in
PrakasaRao(1992).
Condition 5.3 Either t is a compactsubsetof ] , or �t¸n¹�y ºU{»}| y~F Î ����{��,Â"� exist
for each {b[±s andarenot distribution functionson s .
Conditions5.1–5.3wereinitially consideredby Robbins(1964),andareessential
to establishthestrongconsistency of theestimatorof themixing distributiongiven
theuniformconvergenceof themixtureestimator.
Condition 5.4 Define vè� to be the classof discretedistributionson t with sup-
port at Â�� � 1�23232�1�Â�� & ¦ , where the WuÂ�� & _ are chosenso that for any � [rv there is a
sequenceWu� q� _ with � q� [�v¤� that convergesweaklyto G a.s.
Condition5.4concernstheselectionof supportpoints.In fact,aweaklyconvergent
sequenceWu� q� _ , insteadof thealmostsurelyweaklyconvergentsequence,is always
obtainable.This is becauseif t is compact,a weaklyconvergentsequenceWu� q� _for any � canbe obtainedby, say, equallyspacing�� & throughoutt ; if t is not
compact,somekind of mappingof the equallyspacedpoints in a compactspace
onto t canprovidesuchasequence.
Despitethis fact, we adopta weaker conditionherethat only requiresa sequence
which,almostsurely, convergesweaklyto any given � . We will discover that this
relaxationdoesnotchangetheconclusionaboutthestrongconsistency of theresult-
ing estimator. Theadvantageof usinga weaker conditionis that thesetof support
points Wu�� & _ canbeadaptedto thegivensample,oftenresultingin a data-oriented
andhenceprobablysmallersetof supportpoints,which implies lesscomputation
andpossiblyhigheraccuracy. For example,eachdatapoint could provide a sup-
port point, suggestedby the unicomponentmaximumlikelihoodestimators.This
setalwayscontainsasequenceWu� q� _ whichconvergesweaklyto any given � a.s.
Let £¤�L²k1´³¬� bea distancemeasurebetweentwo # -vectors² and ³ . This distanceis
notnecessarilyametric;thatis, thetriangleinequalitymaynotbesatisfied.Denote
94
the x¶µ -norm by ltlwvTltl µ . Conditions5.5 and5.6 provide a wide classof distance
measures.
Condition 5.5 Either
(i) £è�L²k1~³¿�~-ãltl·² p ³últl »or
(ii) if �n¸n¹ >,º¼» �> lnl ² p ³¼lnl �¹¸-H/ , then �n¸n¹ >�ºú» £è�L²k1~³¿� ¸-H/ .Condition 5.6 £è��²�1~³¿���H����ltl·² p ³¼lnl » � , where ����º"� is a univariate, non-decreasing
functionand �n¸t¹z» º �è���Lº��.-0����/���-0/ .Examplesof distancesbetweentwo # -vectorssatisfyingconditions5.5and5.6arelnl ² p ³¼lnl µ' �¼� # ( /½�°�¾�^} ), lnl ² p ³¼lnl µµ "# ( /¿�±�À�í} ), and ltl·² p ³¼ltl » .
Let ��� be any empirical function obtainedfrom observations(not necessarilya
CDF) satisfying
Condition 5.7 �t¸n¹)� º¼» lnl ��w p ���¤ltl » -0/ a.s.
Accordingto the Glivenko-Cantellitheorem,the Kolmogorov empiricalCDF sat-
isfies this condition. Here «-Á ��{¤� is the indicator function: «ÂÁ ��{¤�f- 6 if {à[Äà ,
and0 if otherwise.Otherempiricalfunctionswhich (a.s.)uniformally convergeto
theKolmogorov empiricalCDF as E | } satisfythis conditionaswell. Empir-
ical functionsotherthantheKolmogorov empiricalCDF resultin thesamestrong
consistency conclusion,but canprovide flexibility for betterestimationin the fi-
nite datasituation.For example,anempiricalCDF obtainedfrom theCramer-von
Misesstatisticcanbe easilyconstructed,which is differentfrom the Kolmogorov
empiricalCDF but uniformallyconvergentto it.
Denotethechosensetof fitting pointson s by a vector Å=�Z-m� % � � 1 % � � 1323232Á1 % � ± ¦ �¢Ä .Thenumberof points ª�� needsto satisfy
Condition 5.8 �t¸n¹)� º¼» ª���-0} .
95
Let Æ be the setof all Borel subsetson s . Denoteby ª,�LÃ�� the numberof fitting
pointsthatarelocatedwithin thesubsetÃ�[ÇÆ , andby ��w~� Ã=� theprobabilitymea-
suredeterminedby ��w over à .
Condition 5.9 There existsa positiveconstant� such that, for any Ã�[\Æ ,�t¸n¹� º¼» ª,� Ã��ª�� ¾ � ��w¬�LÃ�� % 2;õ"2 (5.11)
Conditions5.8–5.9aresatisfiedwhenthe setof fitting points is an exact copy of
the set of datapoints,as in the minimum distancemethodsreviewed in Subsec-
tion 5.2.2. Using a larger setof fitting pointswhenthe numberof datapoints is
small makes the resultedestimatormoreaccuratewithin tolerablecomputational
cost.Usingfewer fitting pointswhentherearemany datapointsovera small inter-
val decreasesthecomputationalburdenwithout sacrificingmuchaccuracy. In the
finite samplesituation,Condition5.9 givesgreatfreedomin the choiceof fitting
points. For example,more points canbe placedaroundspecificareasfor better
estimation.All thesecanbedonewithout sacrificingstrongconsistency.
5.3.2 Estimators
For any discretedistribution �����Â"�~- Þ & ¦& ² � '©� & « y ¦ � | » d ��Â�� on t , thecorresponding
mixturedistribution is�xw ¦ ��{¤�k- Ê ����{��,Â"�=Ë���w��Â"��- & ¦° & ² � '©� & ����{��,Â�� & �i2 (5.12)
Let �� beavectorof elementsfrom s ; for example,thesetof fitting points.Denote
thevectorsof thefunctionvalues�xw ¦ ��{¤� and �����{¤� correspondingto theelements
of Å=� by ��w ¦ �LÅ=�"� and ���w�LÅ���� . The distancebetweentwo vectors ��w ¦ �LÅ=��� and���� Å=��� is written Û �w�������~-9£è����w ¦ �LÅ=���i1,���=�LÅ=�����i2 (5.13)
96
Theestimator��� is definedastheoneminimizingÛ ������"� . Wehave
Theorem 5.1 UnderConditions5.1–5.9,�t¸n¹)� º¼» ����|rqÒ� a.s.
Proof. Theproof of thetheoremconsistsof threesteps.
(1). �n¸n¹�� º¼» Û ������"�~-9/ a.s.
Clearly, for any (discreteor continuous)distribution on t , the distribution ��¯on s is pointwisecontinuousunderCondition5.1. � q� |rq�� a.s.(Condition5.3)
implies �xw©È¦ | ��w pointwisea.s.accordingto theHelly-Braytheorem.Thisfurther
impliesuniformconvergence(Polya,1920),thatis, �n¸t¹)� º¼» lnl ��w Ȧ p �xwÕlnl » -0/ a.s.
Because�n¸n¹�� º¼» lnl �xw p ���èlnl » -9/ a.s.by Condition5.7,andlnl ��w Ȧ p ���¤ltl » �Ùlnl �xw Ȧ p ��wÕltl » K ltl ��w p ���¤ltl » 1 (5.14)
then �t¸n¹� º¼» ltl �xw Ȧ p ���èlnl » -H/ a.s.SinceÛ �w��� q� �;�H���,lnl ��w Ȧ � Å=��� p ���w�LÅ=���'ltl » �������ltl ��w©È¦ p ���èlnl » � , this meansthat �t¸n¹� º¼» Û �w��� q� �þ- / a.s.by Condition 5.6.
Becausetheestimator��� is obtainedby a minimizationprocedureoverÛ �w��vÈ� , for
every E ¾ 6 Û ������"��� Û ���� q� �i2 (5.15)
Therefore,�n¸n¹)� º¼» Û �w��������-9/ a.s.
(2). �n¸n¹�� º¼» lnl �xw ¦ p �xw.lnl » -9/ a.s.
Let be an arbitrary elementin the set Wu���9Y�Eã| } 367wËÝ�n¸n¹)� º¼» lnl �xw ¦ p��wklnl » ¸-0/ _ , i.e. ltl ��¯ p ��w.ltl » ¸-9/ . Then É�{�� , ��¯ ��{��i� ¸-H��w~��{��i� .Because��¯ and ��w are both continuousand non-decreasing,and ��¯ � p }^�ñ-��w¬� p }^�Z-É/ and ��¯ ��}^�Z-ô��w~��}^�Z-U6 , theremustexist an interval Ã�� in the
rangeof ��w near��w���{���� suchthatthelength Ê�� of Ã�� is positiveand ¸87E¤¿l ��¯ ��{¤� p��w¬��{¤�'l ¾ L Ñ^/ for all {�[J� � �w �LÃÁ�i� , where� � �w �LÃ��i� denotestheinversemappingof��w over Ã�� . (Notethat � � �w � Ã���� maynot beunique,but theconclusionremainsthe
97
same.)Becauselnl �xw p ���¤lnl » | / a.s.,when E is sufficiently large, ltl ��w p ���¤ltl » can
bemadearbitrarily smalla.s.Therefore,within � � �w � Ã���� , ¸87E¤¿l ��¯ ��{¤� p ���w��{¤�'l ¾ La.s. as E | } . By Condition5.9, �n¸t¹)� º¼» ª,��� � �w � Ã����j�� yª�� ¾ � �xw~��� � �w � Ã��i��� ¾� Ê�� a.s.By Conditions5.5, either
Û ����˯.� ¾ L a.s.if £¤�L²k1´³¬��-zltl·² p ³¼ltl » , or�t¸n¹� ºú» �± ¦ Þ ± ¦±³² � l ��¯ �LÅ���� p ���w�LÅ=���'l ¾ � Ê�� a.s.if £è�L²k1~³¿� is definedotherwise.In
eithersituation,Û �w�� b� ¸-9/ a.s.
Fromthelastparagraph,thearbitrarinessof andthecontinuityof ��¯ in ametric
spacewith respectto , wehave
� 5'��n¸n¹� º¼» Û �w�����"� ¸-H/¤lR�t¸n¹� º¼» lnl ��w ¦ p ��w.ltl » ¸-0/���-�6�2 (5.16)
By step(1), immediately, ltl �xw ¦ p ��wklnl » | / a.s.
(3). ����|rq� a.s.
Robbins(1964)proved that lnl ��w ¦ p ��wklnl » | / a.s.implies that ���æ|rqg� a.s.
underConditions5.1–5.3. AClearly, thegeneralizedestimatorobtainedby minimizing (5.13)coversChoi and
Bulgren(1968)andCramer-vonMisesstatistic(Macdonald,1971).It canbefurther
generalizedas follows to cover Deely andKruse (1968),who usetwo empirical
CDFs.
Let Wu��� ± 1�ªk-ã6"132�232�1�¥ _ bea setof empiricalCDFs,whereeach��� ± satisfiesCon-
dition 5.7.Denote Û � ± �������~-9£è����w ¦ �LÅ=���i1,��� ± �LÅ=������2 (5.17)
Theorem 5.2 Let ��� be the estimatorby minimizing ¹c3]d �¢ª ± ª ) Û � ± ������� . Then
underConditions5.1–5.9,����|rq�� a.s.
Theproof is similar to thatof Theorem5.1andthusomitted.
98
Theproof of Theorem5.1 is inspiredby thework of Choi andBulgren(1968)and
DeelyandKruse(1968). Themaindifferencein our proof is thesecondstep.We
claim this proof is moregeneral:it coverstheir results,while their proof doesnot
cover ours. A similar proof will be usedin the next sectionfor measure-based
estimators.
5.3.3 Remarks
In theabove,we have generalizedtheestimationof a mixing distribution basedon
approximatingmixture CDFs. As will be shown in Section5.5, the CDF-based
approachdoesnotsolve theminority clusterproblem.However, thisgeneralization
is includedbecauseof thefollowing considerations:
1. It is of theoreticinterestto show that thereactually exists a large classof
minimumdistanceestimators,not justa few individualones,thatcanprovide
stronglyconsistentestimatesof anarbitrarymixing distribution.
2. We would like to formalizethedefinition of theestimatorsof a mixing dis-
tribution, includingtheselectionof supportpointsandfitting points.For ex-
ample,ChoiandBulgren(1968)did notdiscusshow thesetof supportpointsWu�� & _ shouldbe chosenfor their estimator. Clearly, the consistency of the
estimatorrelieson how this setis chosen.
3. A moregeneralclassof estimatorsmayprovideflexibility andadaptabilityin
practicewithout losingstrongconsistency. Condition5.4is suchanexample.
4. TheCDF-basedapproachis a specialandsimplercaseof themeasure-based
approach.It providesastartingpoint for themorecomplicatedsituation.
99
5.4 The measure-basedapproach
This sectionextendstheCDF-basedapproachto themeasure-basedone. Theidea
is to approximatetheempiricalmeasurewith theestimatedmeasureover intervals,
which we call fitting intervals. The two measuresarerepresentedby two vectors
thatcontainvaluesthat themeasurestake over thefitting intervals. Thendistance
in thevectorspaceis minimizedwith respectto thecandidatemixing distributions.
Two approachesare consideredfurther: the probability-measure-based(or PM-
based)approachandthenonnegative-measure-based(NNM-based)approach.For
theformer, theempiricalmeasure��� is approximatedby theestimatedprobability
measure��w ¦ , whichcanbewritten ��w ¦ ¶ ���=1 (5.18)
where ��� is a discreteCDF on t . For the latter, we abandonthe normalization
constraintfor ��� sothat theestimateis only a nonnegativemeasure,thusnot nec-
essarilyaprobabilitymeasure.Wewrite this as��wE̦ ¶ ���=1 (5.19)
where� �� is adiscretefunctionwith nonnegativemassatsupportpoints. � �� canbe
normalizedafterwardsto becomeadistribution function,if necessary.
As with theCDF-basedestimators,weneedto specify(a) thesetof supportpoints,
(b) thesetof fitting intervals,(c) theempiricalmeasure,and(d) thedistancemea-
sure. Only (b) and(c) differ from CDF-basedestimation.They aregivenin Sub-
section5.4.1 for the PM-basedapproach,andmodified in Subsection5.4.3 to fit
theNNM-basedapproach.Strongconsistency resultsfor bothestimatorsareestab-
lishedin Subsections5.4.2and5.4.3.
100
5.4.1 Conditions
For the PM-basedapproach,we needonly specifyconditionsfor fitting intervals
andtheempiricalmeasure—theotherconditionsremainidenticalto thosein Sub-
section5.3.1. We first discusshow to choosethe empirical measure,then give
conditions(Conditions5.11–5.14)for selectingfitting intervals.
Denotethe valueof a measure� over an interval à by ��L�� . Let ��� be an em-
pirical measure(i.e., a measureobtainedfrom observations,but not necessarilya
probability measure).Definethe correspondingnondecreasingempirical function��� (hencenot necessarilyaCDF) as���w��{¤��-9����j� p }91�{ À �i2 (5.20)
Clearly ��� and ��� areuniquelydeterminedby eachotheron Borel sets.We have
thefollowing conditionfor ��� .Condition 5.10 The ��� correspondingto ��� satisfiesCondition5.7.
If ��� is theKolmogorov empiricalCDF asdefinedin (5.9), thecorresponding���overasubsetÃ�[�Æ is ���w�LÃ��~- 6E �° ±³² � «-Á ��{ ± �i2 (5.21)
An immediateresultfollowing Condition5.10is �n¸t¹� º¼» lnl ��w¬� Ã�� p ���=�LÃ�� lnl » - /a.s.for all Ã�[\Æ .
We determinethe set of fitting intervals by the set of the right endpointsand a
function Í���{¤� which,givena right endpoint,generatesa left endpointfor thefitting
interval. Giventhesetof right endpointsW % � � 1323232Á1 % � ± ¦ _ , thesetof fitting intervals
is determinedas ÎR�J- W]Ãi� � 1�23232�1~ÃÁ� ± ¦ _ - W��LÍ�� % � � ��1 % � � ��1323232Á1'��Í�� % � ± ¦ �i1 % � ± ¦ � _ . Set-
ting eachinterval to be open,closedor semi-openwill not changethe asymptotic
conclusion,becauseof thecontinuitycondition(Condition5.1).
101
In orderto establishstrongconsistency, we require Í���{¤� to satisfythefollowing.
Condition 5.11 For someconstant« ® Ñ^/ , Ïè{�[±s , { p Í���{¤� ¾ « ® , exceptfor those{ ’s such that Í���{¤� [�s ; for theexceptions,let Íj��{¤��-H¸�7Ф:s .
An exampleof sucha functionis Í���{¤�k-ß{ p % for { p % [°s , and Íj��{¤�Õ-9¸87E¤Ñs if
otherwise.Further, define
Í > ��{¤�k- <= > {�1 if #�-9/ ;Í���Í > � � ��{¤�j�i1 if #�Ñ^/ . (5.22)
Thefollowing conditionsgoverntheselectionof fitting intervals;in particular, they
determinethenumberandlocationsof right endpoints.
Condition 5.12 �t¸n¹� º¼» ª���-ß} .
We useanextra CDF � � ��{¤� to determinethedistribution of theright endpointsof
fitting intervals.If possible,wealwaysreplaceit with �xw~��{¤� sothatfitting intervals
canbe determineddynamicallybasedon the sample;otherwise,someadditional
techniquesareneeded.Wewill discussthis issuein Subsection5.4.4.
Condition 5.13 � � ��{¤� is a strictly increasingCDF throughouts .
Denoteby � � �LÃ�� its correspondingprobabilitymeasureoverasubsetÃ�[\Æ . Clearly� � � Ã�� ¸-ã/ unlesstheLebesguemeasureof à is zerofor every Ãñ[ÒÆ . Denotebyª��LÃ�� thenumberof pointsin theintersectionbetweenthesetof theright endpointsW % � ± ��ª�- 6"1�23232i1�ª�� _ andasubsetà on s .
Condition 5.14 There existsa constant� Ñ / such that, for all Ã�[\Æ ,�t¸n¹� º¼» ª��LÃ��ª�� ¾ � � � � Ã�� % 25õ"2 (5.23)
102
If ����{��,Â"� is strictly increasingthroughout s for every Âí[Ót , ��w¬��{¤� is strictly
increasingthroughouts aswell. In this situationwe canuse ��w���{¤� to substitute
for � � ��{¤� . Using ��w¬��{¤� insteadof thevague� � ��{¤� to guidetheselectionof fitting
intervalssimplifiestheselectionandmakesit data-oriented.For example,set ª��Z-E , and % � ± -ä{ ± , ª�-�6�1323232i1�E . Thenwe have �n¸t¹)� º¼» ª,� Ã=�j yE^-ó�xw~�LÃ�� a.s.,thus
satisfyingCondition5.14,where � is a valuesatisfying/¿� � �\6 .5.4.2 Approximation with probability measures��LÃ�� denotesthevalueof a probabilitymeasure� over Ãb[ÔÆ . Further, let ��ÕÎq���denotethe vectorof the valuesof the probability measureover the set of fitting
intervals.Our taskis to find anestimator���Z[�v¤� (definedin Condition5.4)of the
mixing distributionCDFwhich minimizesÛ ������"��-0£è���xw ¦ ��ÎR�"�i1,���w��ÎR�"���i2 (5.24)
Theproof of thefollowing theoremfollows thatof Theorem5.1. Thedifferenceis
thatherewe dealwith probability measureandthe setof fitting intervalsof (pos-
sibly) finite lengthasdefinedby Conditions5.11–5.14,while Theorem5.1 tackles
CDFs,which useintervalsall startingfrom p } .
Theorem 5.3 UnderConditions5.1–5.6and5.10–5.14,�n¸n¹�� º¼» ����|rq�� a.s.
Theproofof this theoremneedstwo lemmas.
Lemma 2 UnderCondition5.1and5.11,if ltl �x¯ p ��w.ltl » ¸-0/ for f1,�Ý[\v , there
existsa point {�� such that �x¯+�LÍj��{��i�i1�{��i� ¸-H��w~�LÍj��{��i�i1�{��i� .Proof of Lemma2. Assume��¯ ��Í���{¤��1�{¤�©-×��w~�LÍ���{¤�i1j{¤� for every {æ[Çs . �x¯+��{¤�©-Þ »> ² � �x¯���Í > ��{¤��1�Í > � � ��{¤�j� and ��w���{¤�+- Þ »> ² � �xw~��Í > ��{¤�i1�Í > � � ��{¤��� for every { [Òs(notethatsince��¯ and ��w arecontinuous,theprobabilitymeasureoveracountable
103
setof singletonsis zero). Thus ��¯ ��{¤�)-É��w~��{¤� for all {0[Ös . This contradictsltl ��¯ p ��w.ltl » ¸-9/ , thuscompletingtheproof. ALemma 3 If there is a point {���[�s such that ��¯ ��Í���{��i��1�{���� ¸-à��w~�LÍ���{��i�i1j{��i� for ¦1,�Ý[�v , thengivenconditions5.1,5.5,5.10-5.14,�n¸n¹�� º¼» Û ��� J� ¸-9/ a.s.
Proof of Lemma3. Since �x¯���{¤� and ��w¬��{¤� arebothcontinuous,�x¯+�LÍj��{¤�i1�{¤� and�xw~��Í���{¤��1�{¤� arecontinuousfunctionsof { . Theremustbean interval à of nonzero
lengthnear {�� suchthat for every point { [Óà , ��¯��LÍ���{¤�i1j{¤� p ��w~�LÍj��{¤�i1�{¤� ¸-â/ .Accordingto therequirementfor � � ��{¤� , we canassume� � �LÃ��.- L~× ÑH/ . By Con-
ditions5.12and5.14,no fewer than ª�� � L~× pointsarechosenastheright endpoints
of fitting intervalsfrom à a.s.as Ef| } . Sincelnl ��¯ú�ÕÎR�"� p ���w�ÕÎq��� lnl » ¾ lnl ��¯ú�ÕÎq��� p �xw~�ÕÎR�"�'ltl » p lnl ��w¿�ÕÎR�"� p ���w�ÕÎR�"�'ltl » (5.25)
and �n¸t¹)� º¼» lnl ��w~��ÎR��� p �����ÎR��� lnl » -H/ , wehave �t¸n¹� º¼» ltl �x¯+��ÎR��� p ���w��ÎR�*� lnl » ¾�t¸n¹� ºú» ltl ��¯ú��ÎR��� p ��wk��ÎR��� lnl » Ñ / a.s.Also, �t¸n¹)� º¼» �± ¦ lnl ��¯ú�ÕÎR��� p ������ÎR��� lnl � Ñí/a.s.as Ef| } . This implies �t¸n¹)� ºú» Û �w�� b� ¸-9/ a.s.by Condition5.5. AProofof Theorem5.3. Theproof of thetheoremconsistsof threesteps.
(1). �n¸t¹� º¼» Û ������"�~-9/ a.s.
Following the first stepin the proof of Theorem5.1, we obtain �n¸t¹)� º¼» lnl ��w Ȧ p���¤ltl » -â/ a.s. Immediately, �n¸n¹)� º¼» lnl �xw©È¦ p ���èlnl » -U/ a.s.for all Borel sets
on s . SinceÛ �w��� q� �Ô� ���,lnl ��w©È¦ ��ÎR�"� p ���w�ÕÎR�"�'ltl » �Ç� ����ltl �xw©È¦ p ���èlnl » � , this
meansthat �n¸t¹� º¼» Û ���� q� ��-9/ a.s.by Condition5.6.Becausetheestimator��� is
obtainedby aminimizationprocedureofÛ �w��v5� , for every E ¾ 6Û ��������;� Û �w��� q� ��2 (5.26)
Hence�t¸n¹� º¼» Û �w�����"�k-H/ a.s.
(2). �n¸t¹� º¼» lnl ��w ¦ p �xw.lnl » -H/ a.s.
104
Let beanarbitraryelementin theset Wu���(��Ef| } Y"�n¸t¹)� º¼» lnl �xw ¦ p �xwÕlnl » ¸-0/ _ ,i.e. lnl ��¯ p ��wÕltl » ¸- / . According to Lemmas2 and 3, �n¸t¹)� º¼» Û ��� J� ¸- /a.s. Therefore,dueto the arbitrarinessof , the continuity of ��¯ , andstep(1),lnl ��w ¦ p ��w.ltl » | / a.s.,as Ef| } .
(3). ���û|rq � a.s.
Theproof is thesameasthethird stepin theproof of Theorem5.1. AIn Subsection5.4.1,eachfitting interval à ± is uniquelydeterminedby theright end-
point % � ± —theleft endpointis givenby a function Í�� % � ± � . If we insteaddetermine
eachfitting interval in thereverseorder—i.e., giventhe left endpoint,computethe
right endpoint—thefollowing theoremis obtained.The function Í���{¤� for comput-
ing theright endpointcanbedefinedanalogouslyto Subsection5.4.1.Theproof is
similar andomitted.
Theorem 5.4 Let ��� betheestimatordescribedin thelastparagraph.Thenunder
thesameconditionsasin Theorem5.3, �n¸n¹�� º¼» ����|rqÒ� a.s.
Further, let theratio betweenthenumberof intervalsin asubsetof thesetof fitting
intervalsand ª�� begreaterthana positive constantas Eþ| } , andthe intervalsin
this subsetareeitherall in theform �LÍ�� % � ± ��1 % � ± � or all in theform � % � ± 1�Í�� % � ± ��� and
satisfyConditions5.13and5.14. We have the following theorem,whoseproof is
similar andomittedtoo.
Theorem 5.5 Let ��� betheestimatordescribedin thelastparagraph.Thenunder
thesameconditionsasin Theorem5.3, �n¸n¹�� º¼» ����|rqÒ� a.s.
Theorem5.5providesflexibility for practicalapplicationswith finite samples.This
issuewill bediscussedin Subsection5.4.4.
In fact, � � ��{¤� canalwaysbereplacedwith ��w���{¤� in Condition5.9 for establishing
thetheoremsin thissubsection,evenif ���{��,�� is not strictly increasingthroughout
105
s . This is becausethe normalizationconstraintof the estimator��� ensuresthat�x¯ in Lemma3 is a distribution function on s , thus ��¯+� p }^�-ó��w~� p }^��-ó/and ��¯���}^�~-0�xw���}^�~-�6 . This impliesthatif atany point {�� , ��¯+��{��i� ¸-9�xw~��{��i� ,theremustexist an interval à near {�� , satisfying �xw~�LÃ�� ¸-�/ , suchthat �x¯+��{¤� p�xw~��{¤� ¸-0/ , Ïè{J[\à , which canbeusedto establishLemma3.
We will seethat theestimatorinvestigatedin thenext subsectioncouldnot always
replace� � ��{¤� with �xw���{¤� .Barbe(1998)investigatedthepropertiesof mixing distributionestimators,obtained
by approximatingthe empiricalprobability measurewith the estimatedprobabil-
ity measure.He establishedtheoreticalresultsbasedon certainpropertiesof the
estimator. However, hedid not developanoperationalestimationscheme.Thees-
timatorsdiscussedin this subsectionalsoform a wider classthantheoneheused.
For example,��� maynotbeaprobabilitymeasure.
5.4.3 Approximation with nonnegativemeasures
In the lastsubsection,we discussedtheapproximationof empiricalmeasureswith
estimatedprobabilitymeasures.Now weconsiderthrowing awaythenormalization
constraint.As in (5.19), � �� is adiscretefunctionwith nonnegativemassat support
points. Let � �� ��Â�� - Þ & ¦& ² � ' �� & « a � »Ø| y ¦ �Ù ��Â�� with every ' �� & ¾ / . The mixture
functionbecomes��w ̦ ��{¤��- Ê ���{��,Â�� d� �� ��Â���- & ¦° & ² � ' �� & ���{��,Â�� & ��2 (5.27)
Note that the '©� & ’s arenot assumedto sumto one,henceneither � �� nor �xw ̦ are
CDFs. Accordingly, the correspondingnonnegative measureis not a probability
measure.Denotethis measureby ��w ̦ �LÃ�� for any à [rÆ . Theestimator� �� is the
onethatminimizes Û �w��� �� ��-0£è����wE̦ �ÕÎR���i1,���=�ÕÎR�"����2 (5.28)
106
Of course,� �� canbenormalizedafterwards.
Wewill establishstrongconsistency of � �� . Basedonthework in thelastsubsection,
the result is almoststraightforward. We needto usethegeneralmeaningof weak
convergencefor functions,insteadof CDFs. In Condition5.2, v is changedto v � ,the classof finite, nonnegative, nondecreasingfunctionsdefinedon s suchthat� � � p }^�-ó/ for every � � [4v � . Also, in Condition5.4, vè� is replacedwith v �� .Theidentifiability is now aboutgeneralfunctions.In fact,wehave
Lemma 4 Theidentifiability of mixture ��w where �Ù[uv impliestheidentifiability
of mixture �xw Ì where � � [�v � .Proof. Assume �xw Ì - �x¯ Ì , where � � 1, � [Wv � . Then �xw Ì ��}^� - �x¯ Ì ��}^� ,i.e., Ì'Î � � £*Âb- Ì'Î � £* . Let � - � � Ì'Î � � £* and -ó � Ì Î � £* . Hence��wb-ß��¯ . Sinceboth � and areCDFson t , �Ý-ß , or equivalently, � � -ß � ,thuscompletingtheproof. AAlso, weneeda lemmawhich is similar to Theorem2 in Robbins(1964).
Lemma 5 UnderConditions5.1-5.3with changesdescribedin theparagraphbe-
foreLemma4, if �n¸t¹� º¼» lnl ��wE̦ p ��w.ltl » -0/ a.s.,then � �� |rqÒ� a.s.
The proof is intactly the sameas that of Theorem2 in Robbins(1964)and thus
omitted.
Lemma 6 Let ��� bea CDF obtainedby normalizing � �� and �n¸n¹)� º¼» � �� |rqþ�a.s.where � [uv . Then�n¸n¹�� º¼» ���û|rq � a.s.
Let � �� bethesolutionobtainedby minimizingÛ �w��� �� � in (5.28).We have
Theorem 5.6 Undersimilar conditionswith changesmentionedabove, theconclu-
sionsof Theorems5.3-5.5remaintrue if ��� is replacedwith � �� (or theestimator
obtainedbynormalizing � �� ).107
Theproofof Theorem5.6followsexactlythatof Theorem5.3.For step3,Lemma5
shouldbeusedinstead.
It wasshown in the lastsubsectionthat ��w���{¤� canalwaysbe replacedwith � � ��{¤�in Condition5.9 to establishstrongconsistency for ��� obtainedfrom (5.18). This
is, however, notsofor � �� obtainedfrom (5.19).For example,if ���{��,�� is bounded
on s (e.g., uniform distribution), then ��w~��{¤� may have intervals with zero ��w -
measure.Mixing distributions,with or without componentsboundedwithin zero�xw -measureintervals,have thesameÛ ��¨vÈ� valueas Ef| } .
5.4.4 Considerationsfor finite samples
Fromtheprevioussubsections,it becomesclearthatthereis alargeclassof strongly
consistentestimators.Thebehavior of theseestimators,however, differs radically
in thefinite samplesituation.In thediscussionof a few finite sampleissuesthatfol-
lows,we will seethat,without carefuldesign,someestimatorscanperformbadly.
We take an algorithmicviewpoint andensurethat the estimationis workable,fast
andaccuratein practice.Theconditionscarefullydefinedin Subsections5.3.1and
5.4.1providesuchflexibility andadaptability.
Of all minimumdistanceestimatorsproposedsofar, only thenonnegative-measure-
basedone is able to solve the minority clusterproblem(seeSection5.5), in the
sensethat it canalwaysreliably detectsmall clusterswhentheir locationsarefar
away from dominantdatapoints. The experimentsconductedin Sections5.5 and
5.6alsoseemto suggestthattheseestimatorsaregenerallymoreaccurateandstable
thanotherminimumdistanceestimators.We focuson theNNM-basedestimator.
The conditionsin Subsection5.3.1show that, without losing strongconsistency,
thesetof supportpointsandthesetof fitting intervalscanbe determinedwithout
taking the datapoints into account. Nevertheless,unlesssufficient prior knowl-
edgeis providedaboutthesupportpointsandthedistribution of datapoints,such
an independentprocedureis unlikely to provide satisfactoryresults.An algorithm
108
thatdeterminesthe two setsbeforeseeingthedatawould not work for many gen-
eralcases,becausetheunderlyingmixing distribution changesfrom applicationto
application.Thesampleneedsto beconsideredwhendeterminingthetwo sets.
Thesupportpointscouldbesuggestedby thesampleandthecomponentdistribu-
tion. As mentionedearlier, eachdatapoint could generatea supportpoint using
theunicomponentmaximumlikelihoodestimator. For example,for themixtureof
normaldistributionswherethe meanis the mixing parameter, datapointscanbe
directly takenassupportpoints.
Eachdatapoint canbetakenasanendpointof a fitting interval, while thefunctionÍ���{¤� canbedecidedbasedon thecomponentdistribution. In theabove exampleof
a mixtureof normaldistributions,we canset,say, Í���{¤�Ð-\{ p NyQ . Notethatfitting
intervalsshouldnot be chosenso large thatdatapointsmay exert an influenceon
the approximationat a remoteplace,or so small that the empirical (probability)
measuresarenotaccurateenough.
Thenumberof supportpointsandthenumberof fitting intervalscanbeadjustedac-
cordingto Conditions5.4,5.12–5.14.For example,whentherearefew datapoints,
moresupportpointsandfitting intervalscanbeusedto getamoreaccurateestimate
within a tolerablecomputationalcost;whentherearea largenumberof datapoints
in asmallregion, fewersupportpointsandfitting intervalscanbeused,whichsub-
stantiallydecreasescomputationalcostwhile keepingtheestimateaccurate.
Basingtheestimateon solving(5.19)hascomputationaladvantagesin somecases.
Thesolutionof anequationlike thefollowing,ÚÛÖÜ � w Ì ¦ ÝÝ Ü � �w Ì ¦Þß ÚÛ §gà᧠à àá
Þß ¶ ÚÛ4â àáâ à·àáÞßäã
(5.29)
subjectto å àá�æèç and å à àáÒæèç , is the sameascombiningthe solutionsof two
subequationsé à êEëì å àáBí â àá subjectto å àá æ ç , and é à àêEëì å à àá(í â à·àá subjecttoå à àá æîç . Eachsub-equationcanbe further partitioned.This implies that clusters
separatingfrom oneanothercanbeestimatedseparately. Further, normalizingthe
109
separatedclusters’weight vectorsbeforerecombiningthemcanproducea better
estimate.The computationalburdendecreasessignificantlyasthe extent of sepa-
ration increases.Thebestcaseis thateachdatapoint is isolatedfrom every other
one,suggestingthat eachpoint representsa clustercomprisinga singlepoint. In
this case,thecomputationalcomplexity is only linear.
Wewishto preservestrongconsistency, anddeterminethesupportpointsandfitting
intervalsfrom thedata.Underourconditions,however, anextraCDF ïxð�� ñmò , which
shouldbestrictly increasing,is neededto determinefitting intervals. It canbe re-
placedwith ï ê � ñmò when ï� ñ�ó�ô ò is strictly increasing,makingit data-oriented.But
when ï��fñ�ó�ô ò is notstrictly increasing,ï ê � ñmò cannotsubstitutefor ïxðy� ñmò . Sincethe
selectionof fitting intervalsshouldbettermakeuseof thesample,this is adilemma.
However, accordingto Theorem5.5 (and its counterpartfor õ àá , Theorem5.6),ï�ð��fñmò couldbea CDF usedto determineonly partof all fitting intervals,no matter
how small theratio is. This givescompleteflexibility for almostanything we want
to do in thefinite samplesituation.Theonly problemis thatif ö ê � ñmò is usedasthe
selectionguidefor fitting intervals,unwantedcomponentsmayappearat locations
wherenodatapointsarenearby. Thiscanbeovercomeby providing enoughfitting
intervals at locationsthat may producecomponents.This, in turn, suggeststhat
supportpointsshouldavoid locationswherethereareno datapointsaround.Also,
this reducesthenumberof fitting intervalsandhencecomputationalcost.
In fact,for finite samples,usingthestrictly increasingö ê �fñmò astheselectionguide
for fitting intervalssuffersfrom theproblemdescribedin thelastparagraph.Almost
all distributionshave dominantpartsanddwindlequickly away from thedominant
parts,hencethey areeffectively boundedin finite samplesituations.Thereforethe
strategy discussedin the last paragraphfor boundeddistributionsshouldbe used
for all distributions—i.e.,thedeterminationof fitting intervalsneedsto takesupport
points into account,so that all regionswithin thesesupportpoints’ sphereof in-
fluencearecoveredby a largeenoughnumberof fitting intervals. Theinformation
aboutthesupportpoints’sphereof influenceis availablefrom thecomponentdistri-
butionandthenumberof datapoints.For regionsonwhichsupportpointshavetiny
110
influencein termsof probabilityvalue,it is notnecessaryto providefitting intervals
there,becausetheequation ÷ø é ê ë ìç
ùß å á í÷ø â áç
ùß(5.30)
hasthesamesolutionas é ê ë ì å á í â á .Finally, in termsof distancemeasures,notall of thosedeterminedby Conditions5.5
and5.6 canprovide straightforward algebraicsolutionsto (5.18)and(5.19). The
chosendistancemeasuredeterminesthe type of the mathematicalprogramming
problem. If a quadraticdistancemeasureis chosen,a leastsquaressolutionun-
der constraintswill be pursued.This approachcanbe viewed asan extensionof
Choi andBulgren(1968)from CDF-basedto measure-based.Elegantandefficient
methodsLSE andNNLS providedin LawsonandHanson(1974,1995)canbeem-
ployed. If the sup-normdistancemeasureis used,solutionscanbe found by the
efficient simplex methodin linear programming.This approachcanbe viewedas
anextensionof DeelyandKruse(1968).
5.5 The minority cluster problem
This sectiondiscussesthe minority clusterproblem,andillustratesit with simple
experiments.Theminimumdistanceapproachbasedonnonnegativemeasurespro-
videsa solution,while otherminimumdistanceonesfail. This problemis vital for
paceregression,sinceignoringevenoneisolateddatapoint in theestimatedmix-
ing distribution cancauseunboundedloss. Thediscussionbelow is intendedto be
intuitiveandpractical.
111
5.5.1 The problem
Although they perform well asymptotically, the minimum distancemethodsde-
scribedabove suffer from the finite-sampleproblemdiscussedearlier: they can
neglect small groupsof outlying datapointsno matterhow far they lie from the
dominantdatapoints. The underlyingreasonis that the objective function to be
minimized is definedglobally ratherthan locally. A global approachmeansthat
thevalueof theestimatedprobabilitydensityfunctionat a particularplacewill be
influencedby all datapoints,nomatterhow farawaythey are.Thiscancausesmall
groupsof datapointsto be ignoredevenif they area long way from thedominant
part of the datasample. From a probabilisticpoint of view, however, thereis no
reasonto subsumedistantgroupswithin the major clustersjust becausethey are
relatively small.
The ultimate effect of suppressingdistantminority clustersdependson how the
clusteringis applied.If theapplication’s lossfunctiondependson thedistancebe-
tweenclusters,theresultmayprove disastrousbecausethereis no limit to how far
away theseoutlying groupsmay be. Onemight arguethat small groupsof points
caneasilybe explainedaway asoutliers,becausethe effect will becomelessim-
portantas the numberof datapoints increases—andit will disappearin the limit
of infinite data. However, in a finite-datasituation—andall practicalapplications
necessarilyinvolvefinite data—the“outliers” mayequallywell representsmallmi-
nority clusters.Furthermore,outlying datapointsarenot really treatedasoutliers
by thesemethods—whetheror not they arediscardedis merelyan artifact of the
global fitting calculation. When clustering,the final mixture distribution should
takeall datapointsinto account—includingoutlying clustersif any exist. If practi-
cal applicationsdemandthatsmalloutlying clustersaresuppressed,this shouldbe
donein aseparatestage.
In distance-basedclustering,eachdatapoint hasa far-reachingeffect becauseof
two global constraints.Oneis theuseof thecumulative distribution function; the
otheris thenormalizationconstraintúBû ìü~ýmþ@ÿ á ü�� �. Theseconstraintsmaysacrifice
112
a smallnumberof datapoints—atany distance—fora betteroverall fit to thedata
asa whole. Choi andBulgren(1968),theCramer-von Misesstatistic(Macdonald,
1971),andDeelyandKruse(1968)all enforceboththeCDFandthenormalization
constraints.Blum andSusarla(1977)droptheCDF, but still enforcethenormaliza-
tion constraint.Theresultis thattheseclusteringmethodsareonly appropriatefor
finite mixtureswithout smallclusters,wheretherisk of suppressingclustersis low.
Hereweaddressthegeneralproblemfor arbitrarymixtures.Of course,theminority
clusterproblemexistsfor all typesof mixture—includingfinite mixtures.
5.5.2 The solution
Now thatthesourceof theproblemhasbeenidentified,thesolutionis clear, at least
in principle: drop both the approximationof CDFs,asBlum andSusarla(1977)
do, andthenormalizationconstraint—nomatterhow seductive it mayseem.This
considerationleadsto thenonnegative-measure-basedapproachdescribedin Sub-
section5.4.3.
To definetheestimationprocedurefully, weneedto determine(a)thesetof support
points, (b) the setof fitting intervals, (c) the empiricalmeasure,and(d) the dis-
tancemeasure.Theoreticalanalysisin theprevioussectionsguaranteesa strongly
consistentestimator:herewediscussthesein anintuitivemanner.
Support points. Thesupportpointsareusuallysuggestedby thedatapointsin the
sample.For example,if thecomponentdistribution ï��fñ�ó�ô ò is thenormaldistribu-
tion with meanô andunit variance,eachdatapoint canbetakenasasupportpoint.
In fact,thesupportpointsaremoreaccuratelydescribedaspotentialsupportpoints,
becausetheir associatedweightsmay becomezero after solving (5.19)—and,in
practice,many oftendo.
Fitting intervals. The fitting intervals arealsosuggestedby the datapoints. In
thenormaldistribution examplewith known standarddeviation � , eachdatapointñ�� canprovide oneinterval, suchas � ñ������ ã ñ�� � , or two, suchas �·ñ������� ã ñ�� � and
113
� ñ�� ã ñ��������� , or more. Thereis no problemif the fitting intervals overlap. Their
lengthshouldnot besolargethatpointscanexert aninfluenceon theclusteringat
anundulyremoteplace,nor sosmall thattheempiricalmeasureis inaccurate.The
experimentsreportedbelow useintervalsof a few standarddeviationsaroundeach
datapoint,and,aswewill see,this workswell.
Empirical measure. The empiricalmeasurecanbe the probability measurede-
terminedby the Kolmogorov empiricalCDF, or any measurethat convergesto it.
Thefitting intervalsdiscussedabove canbeopen,closed,or semi-open.This will
affect theempiricalmeasureif datapointsareusedasinterval boundaries,although
it doesnot changethevaluesof theestimatedmeasurebecausethecorresponding
distribution is continuous.In small-samplesituations,biascanbereducedby care-
ful attentionto thisdetail—asMacdonald(1971)discusseswith respectto Choiand
Bulgren’s (1968)method.
Distancemeasure. Thechoiceof distancemeasuredetermineswhatkind of math-
ematicalprogrammingproblemmustbesolved. For example,a quadraticdistance
will give rise to a leastsquaresproblemunderlinearconstraints,whereasthesup-
normgivesriseto a linearprogrammingproblemthatcanbesolvedusingthesim-
plex method.Thesetwo measureshaveefficientsolutionsthataregloballyoptimal.
5.5.3 Experimental illustration
Experimentsareconductedin thefollowing to illustratethefailureof existingmin-
imum distancemethodsto detectsmall outlying clusters,and the improvement
achievedby thenew scheme.Theresultsalsosuggestthatthenew methodis more
accurateandstablethantheothers.
Whencomparingclusteringmethods,it is not alwayseasyto evaluatetheclusters
obtained.To finessethis problemwe considersimpleartificial situationsin which
the properoutcomeis clear. Somepracticalapplicationsof clustersdo provide
objective evaluationfunctions—forexample,in our settingof empiricalmodeling
114
(seeSection5.6).
ThemethodsusedareChoi andBulgren(1968)(denotedCHOI), MacDonald’s ap-
plicationof the Cramer-von Misesstatistic(CRAMER), the methodwith theprob-
ability measure(PM), and the methodwith the nonnegative measure(NNM). In
eachcase,equationsinvolving non-negativity and/orlinearequalityconstraintsare
solved asquadraticprogrammingproblemsusing the elegantandefficient proce-
duresNNLS and LSEI provided by Lawson and Hanson(1974, 1995). All four
methodshave thesamecomputationaltimecomplexity.
Experiment 5.1 Reliability of clusteringalgorithms. We setthe samplesize �to
�����throughouttheexperiments.Thedatapointsareartificially generatedfrom a
mixtureof two clusters:� þ pointsfrom ��� � ã � ò and ��� pointsfrom ��� ����� ã � ò . The
valuesof � þ and ��� arein theratios ����� � , � �!�" , ��#�$� , % � �"& � and ' � � ' � .Every datapoint is takenasa potentialsupportpoint in all four methods:thusthe
numberof potentialcomponentsin theclusteringis 100. For PM andNNM, fitting
intervalsneedto bedetermined.In theexperiments,eachdatapoint ñ�� providestwo
fitting intervals, � ñ��"�( ã ñ��)� and � ñ�� ã ñ��*�+�� . Any datapoint locatedontheboundary
of an interval is countedashalf a point whendeterminingthe empiricalmeasure
over thatinterval.
Thesechoicesareadmittedlycrude,andfurther improvementsin theaccuracy and
speedof PM andNNM arepossiblethattakeadvantageof theflexibility providedby
(5.18)and(5.19).For example,accuracy will likely increasewith more—andmore
carefully chosen—supportpoints and fitting intervals. The fact that it performs
well even with crudelychosensupportpointsandfitting intervals testifiesto the
robustnessof themethod.
Ourprimaryinterestin thisexperimentis theweightsof theclustersthatarefound.
To casttheresultsin termsof theunderlyingmodels,we usetheclusterweightsto
estimatevaluesfor � þ and��� . Of course,theresultsoftendonotcontainexactlytwo
clusters—but becausetheunderlyingclustercenters,0 and100,arewell separated
115
comparedto their standarddeviation of 1, it is highly unlikely thatany datapoints
from oneclusterwill fall anywhereneartheother. Thuswe usea thresholdof 50
to divide the clustersinto two groups:thosenear0 andthosenear100. The final
clusterweightsarenormalized,andthe weightsfor thefirst grouparesummedto
obtainanestimate,� þ of � þ , while thosefor thesecondgrouparesummedto give
anestimate,��� of ��� . � þ-� ��� � þ.� � � � þ/� �� � þ.� % � � þ.� ' ��0� � � ��� � ��� � � �0� � & � �0� � ' �CHOI Failures 86 42 4 0 0,� þ21 ,��� 99.9/0.1 99.2/0.8 95.8/4.2 82.0/18.0 50.6/49.4
SD( ,� þ ) 0.36 0.98 1.71 1.77 1.30CRAMER Failures 80 31 1 0 0,� þ21 ,��� 99.8/0.2 98.6/1.4 95.1/4.9 81.6/18.4 49.7/50.3
SD( ,� þ ) 0.50 1.13 1.89 1.80 1.31PM Failures 52 5 0 0 0,� þ21 ,��� 99.8/0.2 98.2/1.8 94.1/5.9 80.8/19.2 50.1/49.9
SD( ,� þ ) 0.32 0.83 0.87 0.78 0.55NNM Failures 0 0 0 0 0,� þ21 ,��� 99.0/1.0 96.9/3.1 92.8/7.2 79.9/20.1 50.1/49.9
SD( ,� þ ) 0.01 0.16 0.19 0.34 0.41
Table5.1: Experimentalresultsfor detectingsmallclusters
Table 5.1 shows resultsfor eachof the four methods. Eachcell representsone
hundredseparateexperimentalruns. Threefiguresarerecorded.At the top is the
numberof timesthemethodfailedto detectthesmallercluster, that is, thenumber
of times ,��� � �. In themiddlearetheaveragevaluesfor ,� þ and ,�0� . At thebottom
is thestandarddeviationof ,� þ and ,��� (which areequal).Thesethreefigurescanbe
thoughtof asmeasuresof reliability, accuracy andstability respectively.
The top figuresin Table5.1 show clearly that only NNM is alwaysreliablein the
sensethat it never fails to detectthesmallercluster. Theothermethodsfail mostly
when ��� � �; their failurerategraduallydecreasesas ��� grows. Thecenterfigures
show that,underall conditions,NNM givesa moreaccurateestimateof thecorrect
valuesof � þ and �0� thantheothermethods.As expected,CRAMER showsanotice-
ableimprovementoverCHOI , but it is veryminor. ThePM methodhaslowerfailure
ratesandproducesestimatesthataremoreaccurateandfar morestable(indicated
116
by thebottomfigures)thanthosefor CHOI andCRAMER—presumablybecauseit is
lessconstrained.Of the four methods,NNM is clearlyandconsistentlythewinner
in termsof all threemeasures:reliability, accuracy andstability.
Theresultsof the NNM methodcanbe further improved. If thedecomposedform
(5.29)is usedinsteadof (5.19),andthesolutionsof thesub-equationsarenormal-
izedbeforecombiningthem—whichis feasiblebecausethetwo underlyingclusters
are so distantfrom eachother—the correctvaluesare obtainedfor ,� þ and ,��� in
virtually every trial.
5.6 Simulation studiesof accuracy
In this section,we studythe predictive accuracy of clusteringproceduresthrough
simulations.Sincethecomponentdistribution hasa limited sphereof influencein
finite samplesituations,datapointsshouldhave little effecton theestimationof the
probabilitydensityfunctionat far away points. BecauseNNM useslocal fitting, it
mayalsohavesomeadvantagein accuracy overproceduresthatadoptglobalfitting,
evenin thecaseof overlappingclusters.Wedescribetheresultsof two experiments.
Unlikesupervisedlearning,thereseemsto benowidely acceptedcriterionfor eval-
uatingclusteringresults(Hartigan,1996).However, for thespecialcaseof estimat-
ing amixing distribution,it is reasonableto employ thequadraticlossfunction(2.6)
usedin empiricalBayesor, equivalently, the lossfunction(3.4) in paceregression
(whenthe varianceis a known constant),andthis is whatwe adopt. More details
concerningtheir usearegivenlater.
Experiment5.2 investigatesa mixture of normal distributions with the meanas
the mixing parameter. The underlyingmixing distribution is normal as well—
informationthat is not usedby the clusteringprocedureswe tested. This type of
mixturehasmany practicalapplicationsin empiricalBayesanalysis,aswell asin
Bayesiananalysis;see,for example,Berger (1985),Maritz andLwin (1989),and
CarlinandLouis (1996).Thesecondexperimentconsidersamixtureof noncentral
117
3 � þ distributionswith thenoncentralityparameterasthemixing one.Theunderlying
mixing distribution follows Breiman(1995),whereheusesit for subsetselection.
Thisexperimentcloselyrelatesto paceregression.
UnlikeExperiment5.1,datapointsin Experiment5.2and5.3aremixedsoclosely
thatthereis nocleargrouping.Therefore,reliability is notaconcernfor them.It is
worthmentioning,however, thatignoringevenoneisolatedpoint in othersituations,
asin Experiment5.1,maydramaticallyincreasetheloss.
Experiment 5.2 Accuracy of clusteringproceduresfor mixturesof normal distri-
butions. Weconsiderthemixturedistribution
ñ��24 5��76 ���85�� ã � ò5��76 ��� � ã � �9 ò ãwhere : � � ã & ã ;<;=; ã �<��� . The simulationresultsgiven in Table5.2 arelossesav-
eragedover twenty runs. For eachrun, 100 datapoints are generatedfrom the
mixturedistribution. Given thesedatapoints,themixing distribution is estimated
by thefour proceduresCHOI , CRAMER, PM andNNM, andtheneachñ�� is updatedto
be ñ EB� using(2.8). Theoverall lossfor all updatedestimatesis calculatedby (2.6),
i.e., ú þ?>@>� ýmþ 4)4 ñ EB� ��5��24A4 � .Eleven caseswith differentvaluesof � �9 arestudied. Generally, as � �9 increases,
thepredictionlossesincrease.This is dueto thedecreasingnumberof neighboring
pointsto supportupdating;seealsoSection4.2.For almosteverycase,PM andNNM
performsimilarly, bothoutperformingCHOI andCRAMER with aclearmargin.
Experiment 5.3 Accuracyof clusteringproceduresfor mixturesof 3 � þ distributions.
In this experiment,we considera mixtureof noncentral3 � þ distributionsusingthe
noncentralityparameterasthe mixing one,i.e., the mixture (3.19)or (3.22). The
underlyingmixing distributionis amodifiedversionof adesignby Breiman(1995);
seealsoExperiment6.6.Sincethisexperimentrelatesto linearregression,weadopt
thenotationusedfor paceregressionin Chapter3.
118
� �9 CHOI CRAMER PM NNM
0 4.00 3.93 3.27 3.951 59.2 59.1 57.4 57.82 75.6 75.5 73.4 73.83 86.2 86.2 84.2 84.24 89.1 88.8 87.0 87.05 96.5 96.5 93.6 93.56 100.0 100.0 97.7 97.87 100.4 100.5 97.8 97.88 102.3 101.9 100.3 100.39 104.9 104.8 102.1 102.110 104.9 104.8 102.0 102.0
Table5.2: Experiment5.2. Averagelossof clusteringproceduresover mixturesofnormaldistributions.
Breimanconsiderstwo clusteringsituationsin termsof the B ü (C � � ã ;=;<; ã D ) values.
One,withD � & � , hastwo clustersaround' and
� ' andaclusteratzero.Theother,
withD �FE � , hasthreeclustersaround
���, & � and � anda clusterat zero. We
determineB ü in thesameway asBreiman. Thenwe set G"Hü �JI B ü and KLHü � G$Hü � ,whereI is always
þM for bothcases.Thevalueof I is sochosenthattheclustersare
neithertoo faraway from nor toocloseto eachother.
Thesampleof themixturedistributionis theset NOK ü�P , whereeachK üQ� G �ü and G ü 6���?G Hü ã � ò . Eachclusteringprocedureis appliedto thegivensample,K þ ã KL� ã ;<;<; ã K û ,to producean estimatedmixing distribution õ û �?K H ò . TheneachK ü is updatedby
PACE R to obtain
ØK ü . Thelossfunction(3.4) in paceregression,thatis thequadratic
loss function (2.6), is usedto obtain the true prediction loss for eachestimate.
The signsof SG ü and G Hü are taken into accountwhen calculatingthe loss, whereG Hü is alwayspositive and SG ü hasthesamesignas G ü . Therefore,theoverall lossisúBûü~ýmþ 4A4TSG ü �UG Hü 4A4 � .Thesimulationresultsaregivenin Table5.3,eachfigurebeinganaverageover100
runs. The variable V�W (sameasusedby Breiman)is the radiusof eachclusterin
the true mixing distribution, i.e., larger V�W implies moredatapoints in eachclus-
ter. It canbe observed from theseresultsthat NNM (andprobablyPM aswell) is
alwaysa safemethodto apply. For small V�W , bothCHOI andCRAMER produceless
119
V�W CHOI CRAMER PM NNMD � & �1 15.9 14.1 9.4 11.12 13.3 13.1 13.2 13.03 15.3 15.7 16.1 16.14 17.5 18.0 18.5 18.55 19.6 19.9 20.5 20.4D �XE �1 55.0 48.8 17.4 17.42 29.7 29.9 26.5 26.73 32.3 32.5 33.9 33.84 36.7 37.0 38.4 38.45 40.7 40.8 41.4 41.4
Table5.3: Experiment5.3. Averagelossesof clusteringproceduresover mixturesof 3 � þ distributions.
accurateor sometimesvery bad results—notethat in this experimentdatapoints
are not easily separable.For large V�W , however, CHOI and CRAMER do seemto
produceslightly moreaccurateresultsthantheothertwo. This suggeststhat CHOI
andCRAMER may have a small advantagein predictive accuracy whenclustering
is within a relatively small areaandthereareno small clustersin this area. This
slightly betterperformanceis probablydueto CHOI and CRAMER’s useof larger
intervals,whenthereis no risk of suppressingsmallamountsof datapoints.
5.7 Summary
Theideaexploitedin thischapteris simple:anew minimumdistancemethodwhich
is ableto reliably detectsmall groupsof datapoints,including isolatedones.Re-
liable detectionin this caseis importantin empiricalmodeling.Existingminimum
distancemethodsdonotcopewell with this situation.
Themaximumlikelihoodapproachdoesnotseemto suffer from thisproblem.Nev-
ertheless,it often consumesmorecomputationalcost thanthe minimum distance
one. For many applications,fastestimationis important.Whenpaceregressionis
120
embeddedin amoresophisticatedestimationtask,it maybenecessaryto repeatedly
fit linearmodels.
Thepredictive accuracy of minimumdistanceestimatorsis alsoempirically inves-
tigatedin Section5.6 for situationswith overlappingclusters. Although further
investigationis needed,NNM seemsgenerallypreferable.
Despitethe simplicity of the underlyingidea,mostof this chapteris devotedto a
theoreticalaspectof new estimators—strongconsistency. This is anecessarycondi-
tion for establishingasymptoticpropertiesof paceregression,andgivesaguarantee
for largesamplebehaviour.
121
Chapter 6
Simulation studies
6.1 Intr oduction
It is time for someexperimentalresultsto illustratetheideaof paceregressionand
give someindication how it performsin practice. The resultsare very much in
accordancewith the theoreticalanalysisin Chapter3. In addition, they illustrate
someeffectsthathavebeendiscussedin previouschapters.
We will testAIC, BIC, RIC, CIC (in theform (2.5)),PACE � , PACE Y , andPACE R . The
OLS full modeland the null model—whichareactuallygeneratedby procedures
OLS (being OLSC � � ò ) and OLSC �[Z�ò —are includedfor comparison. Procedures
PACE þ , PACE M , andPACE \ arenot includedbecausethey involvenumericalsolution
andcanbeapproximatedby PACE � , PACE Y , andPACE R , respectively.
Two shrinkagemethods,NN-GAROTTE andLASSO,1 areusedin Experiments6.1,
6.2 and 6.7. Becausethe choiceof shrinkageparameterin eachcaseis some-
what controversial,we usethe bestestimatefor this parameterobtainedby data
resampling—moreprecisely, five-fold cross-validationoveranequally-spacedgrid
of twenty-onediscretevaluesfor theshrinkageratio. This choiceis computation-
ally intensive, which is why resultsfrom thesemethodsarenot includedin all ex-
periments. The Stein estimationversionof LASSO (Tibshirani, 1996) is usedin1The procedure LASSO is downloaded from StatLib (http://lib.stat.cmu.edu/S/)and
NN-GAROTTE from Breiman’s ftp siteat UCB (ftp://ftp.stat.berkeley.edu/pub/users/breiman).
123
Experiments6.1and6.2,dueto theheavy timeconsumptionof thecross-validation
version.2 It is acknowledgedthatin thesetwo experimentsthecross-validationver-
sionmayyield betterresults,assuggestedin Experiment6.7.
We usethe partial-ï testandbackward eliminationof variablesto setup the or-
thogonalization,andemploy theunbiased,� � estimateof the OLS full model. The
non-central3 � distribution is usedinsteadof the ï -distribution for computational
reasons.Toestimatethemixing distribution õ]�?K H ò , weadopttheminimumdistance
procedurebasedonnonnegativemeasure(NNM) given(5.19)or equivalently(5.28)
in Chapter5, alongwith thequadraticprogrammingalgorithmNNLS providedby
LawsonandHanson(1974,1995). Supportpointsare the set N-^K ü�P plus zero(if
thereis any ^K ü nearzero),except that points lying in the gapbetweenzero and
threearediscardedin orderto simplify the model. This gaphelpsselectioncrite-
ria PACE � andPACE Y to properlyeliminatedimensionsin situationswith redundant
variables,becausewhenthe _ -function’s sign nearthe origin is negative—which
is crucial for bothto eliminateredundantdimensions—itcouldbeincorrectlyesti-
matedto be positive (notethat _`� � �zóaK H ò]b �when K H æ � ; ' ò ). This choiceis,
of course,slightly biasedtowardssituationswith redundantvariables,but doesnot
sacrificemuchpredictionaccuracy in any situation,andthuscanbeconsideredas
an applicationof the simplicity principle (Section1.2). For PACE R ,ØK ü ’s smaller
than� ; ' arediscardedto simplify themodelaswell (seealsoSection4.5). These
areroughandreadydecisions,takenfor practicalexpediency. For moredetail,refer
to thesourcecodeof theS-PLUS implementationin theAppendix.
Recallingthe modelingprinciplesthat we discussedin Section1.2, our interests
focusmainly on predictive accuracy, andtheestimatedmodelcomplexities appear
asanaturalby-product.
Both artificial andpracticaldatasetsare includedin our experiments. Onemain
advantageof usingartificial datasetsis that the experimentsarecompletelyunder
control. Sincethe true model is known, the predictionerrorscanbe obtainedac-2In our simulations,NN-GAROTTE takesabouttwo weekson our computerto obtaintheresults
for Experiments6.1and6.2,while LASSO c�d appearsto beseveraltimesslower.
124
curatelyandthe estimatedmodelcomplexity canbe comparedwith the true one.
Furthermore,wecandesignexperimentsthatcoverany situationsof interest.
Section6.2 considersartificial datasetsthat containonly statisticallyindependent
variables. Hencethe columnvectorsof the designmatrix form a nearly, yet not
completely, orthogonalbasis.
Section6.3investigatestwomorerealisticsituationsusingartificial datasets.In one,
studiedby Miller (1990),theeffectsof variablesaregeometricallydistributed.The
otherconsidersclustersof coefficients,following theexperimentof Breiman(1995).
As any modelingproceduresshouldfinally beappliedto solve realproblems,Sec-
tion 6.4 extendsthe investigationto real world datasets.Twelve practicaldatasets
areusedto comparethemodelingproceduresusingcross-validation.
6.2 Artificial datasets:An illustration
To illustratethe ideasof paceregression,in this sectionwe give threeexperimen-
tal examplesthat comparepaceregressionwith otherproceduresin termsof both
predictionaccuracy andmodelcomplexity, usingartificial datasetswith statistically
independentvariables.In thefirst, contributive variableshave smalleffects;in the
secondthey have largeeffects.Thethird exampleteststheinfluenceof thenumber
of candidatevariableson paceregression. Resultsaregiven in the form of both
tablesandgraphs.
Experiment 6.1 Variableswith smalleffects. Theunderlyingmodelis e �gfh> �úBûü~ýmþ fiü ñ ü �i��� � ã � � ò , where fh>j� � ã ñ ü 6k��� � ã � ò , Cov �fñ�� ã ñ ü ò � �for :ml� C ,
and � � � & ��� . Eachparameterfiü is either1 or 0, dependingon whetherit hasan
effector is redundant.For eachsample,� � �������and
D � �����(excluding fh> which
is alwaysincludedin themodel).Thenumberof non-redundantvariablesD H is set
to� ã ��� ã & � ã ;<;<; ã �<��� . For eachvalue,the resultis theaverageof twenty runsover
twenty independenttraining samples,testedon a large independentset. This test
125
n H 0 10 20 30 40 50 60 70 80 90 100Dim
NN-GAR. 0.6 12.8 27.2 40.9 50.5 64.7 75.2 87.6 96.7 98.5 99.6LASSO oqpAr?sut 47.4 47.2 51.0 56.0 60.0 64.2 67.6 72.8 77.4 78.6 81.5AIC 14.8 21.4 28.1 34.5 40.6 46.9 53.4 59.9 65.2 71.2 77.6BIC 1.1 4.2 7.4 11.1 14.5 16.7 20.1 22.9 26.6 30.1 32.5RIC 0.2 1.6 3.7 6.0 7.9 10.4 12.2 14.6 17.9 19.6 21.8CIC 0.0 0.2 1.3 2.6 28.4 72.0 100 100 100 100 100PACE � 0.0 4.6 18.6 37.8 59.6 84.0 95.7 99.4 100 100 100PACE Y 0.0 4.6 18.8 37.9 59.6 84.2 95.8 99.4 100 100 100PACE R 0.6 8.9 22.1 34.0 47.6 62.0 72.8 87.2 94.5 99.1 100PE
NN-GAR. 0.4 7.3 11.3 14.4 17.6 20.1 21.9 23.0 23.1 23.0 22.9LASSO oqpAr?sut 5.69 7.46 10.1 12.5 15.7 18.9 22.6 25.6 29.0 35.8 42.6full 21.6 21.6 21.6 21.6 21.6 21.6 21.6 21.6 21.6 21.6 21.6AIC 11.2 13.1 15.6 18.2 20.9 23.8 26.4 28.8 31.1 33.5 35.7BIC 1.7 9.1 16.9 24.5 31.8 41.0 49.4 57.3 65.6 73.3 82.1RIC 0.5 9.3 18.2 27.5 36.6 45.6 55.4 64.6 72.6 83.5 92.5CIC 0.0 10.1 19.7 29.3 33.9 29.3 21.6 21.6 21.6 21.6 21.6PACE � 0.0 9.8 15.6 19.3 21.1 22.1 21.8 21.6 21.6 21.6 21.6PACE Y 0.0 9.8 15.6 19.3 21.3 22.2 22.0 22.0 21.8 21.8 21.7PACE R 0.1 7.0 10.6 13.2 15.4 16.9 17.1 16.5 15.5 14.5 10.7
Table 6.1: Experiment6.1. Averagedimensionalityand predictionerror of theestimatedmodels.
setis generatedfrom thesamemodelstructurewith 10,000observationsandzero
noisevariance,andhencecanbetakenasthetruemodel.Theresultsareshown in
Figure6.1andrecordedin Table6.1.
Figure 6.1(a)and Figure 6.1(b) show the dimensionality(Dim) of the estimated
modelsand their predictionerrors(PE), respectively; the horizontalaxis isD H in
bothcases.In Figure6.1(a),thesolid line givesthesizeof theunderlyingmodels.
Sincepredictionaccuracy ratherthanmodelcomplexity is usedasthestandardfor
modeling,thebestestimatedmodeldoesnot necessarilyhave thesamecomplexity
astheunderlyingone—dimensionalityreductionis merelya byproductof parame-
ter estimation.Modelsgeneratedby PACE � , PACE Y andPACE R find theunderlying
null hypothesisv > andtheunderlyingfull hypothesisv#w correctly;their seeming
inconsistency betweentheseextremesis in factnecessaryfor theestimatedmodels
to produceoptimal predictions. AIC overfits v > andunderfits v#w . BIC and RIC
126
0
20
40
60
80
100
0 20 40 60 80 100
Dim
ensi
onal
ityx
k*
true
AIC
BIC
CIC
RIC
PACE6
PACE2,4
LASSO
NN-GAROTTE
0
20
40
60
80
100
0 20 40 60 80 100
Pre
dict
ion
erro
ry
k*
full
null
AIC
BIC
CIC
RIC
PACE2,4
PACE6
NN-GAROTTE
LASSO
(a) (b)
Figure6.1: Experiment6.1,D � �����{z � � � & ��� . (a) Dimensionality. (b) Prediction
error.
bothfit v > well, but dramaticallyunderfitall theotherhypotheses—includingv#w .CIC successfullyselectsboth v > and v#w but eitherunderfitsor overfitsmodelsin
between.NN-GAROTTE generallychoosesslightly larger modelsthanPACE R , but
LASSO oqp)r8sut significantlyoverfitsfor smallD H andunderfitsfor large
D H .In Figure6.1(b),theverticalaxisrepresentstheaveragemeansquarederrorin pre-
dicting independenttestsets.Themodelsbuilt by BIC andRIC have errorsnearly
as greatas the null model. AIC is slightly better for v#w than BIC and RIC, but
fails on v > . CIC eventually coincideswith the full model asD H increases,and
produceslarge errorsfor somemodelstructuresbetweenv > and v#w . It is inter-
estingto notethatPACE � alwaysperformsaswell asthebestof OLSC �[Z�ò (thenull
model),RIC, BIC, AIC, CIC, andOLS (thefull model):no OLS subsetselectionpro-
ceduresproducemodelsthataresensiblybetter. Recallthat PACE � selectstheop-
timal subsetmodelfrom thesequence,andin this respectresemblesPACE þ , which
is the optimal threshold-basedOLS subsetselectionprocedureOLSC �}| H ò . PACE Yperformssimilarly to PACE � . Remarkably, PACE R outperformsPACE � andPACE Y by
a largemargin, evenwhentherearenoredundantvariables.Theshrinkagemethods
NN-GAROTTE and LASSO oqp)r8sut arecompetitive with PACE � and PACE Y —betterfor
smallD H andworsefor large
D H —but cannever outperformPACE R in any practical
sense,which is consistentwith Corollary3.10.2andTheorem3.11.
127
n H 0 10 20 30 40 50 60 70 80 90 100Dim
NN-GAR. 0.6 21.0 32.6 45.7 57.1 67.2 78.0 87.0 94.3 98.5 100LASSO oqpAr?sut 47.4 46.8 52.8 59.0 64.2 71.0 76.6 82.6 88.3 93.7 92.3AIC 14.8 23.4 32.0 40.5 49.2 58.2 66.4 75.3 83.0 91.4 99.7BIC 1.1 10.6 20.1 29.4 39.0 48.4 57.8 67.2 75.9 85.4 94.0RIC 0.2 9.6 18.8 27.4 36.6 45.3 54.0 62.5 70.3 78.0 86.0CIC 0.0 9.6 21.1 72.4 100 100 100 100 100 100 100PACE � 0.0 10.3 21.4 31.5 42.0 53.0 63.2 74.0 83.4 93.3 100PACE Y 0.0 10.3 21.4 31.5 42.1 53.0 63.4 74.0 83.5 93.3 100PACE R 0.6 11.5 22.1 32.6 43.0 54.2 63.9 74.3 83.4 92.5 100PE
NN-GAR. 0.10 1.35 2.12 2.83 3.61 4.17 4.72 5.27 5.56 5.70 5.71LASSO oqpAr?sut 1.42 1.91 2.65 3.43 4.43 5.27 6.15 7.00 7.94 9.61 35.60full 5.40 5.40 5.40 5.40 5.40 5.40 5.40 5.40 5.40 5.40 5.40AIC 2.80 3.00 3.27 3.53 3.85 4.16 4.40 4.70 4.95 5.32 5.56BIC 0.43 1.00 1.59 2.71 3.44 4.50 5.62 6.35 7.88 8.76 10.5RIC 0.12 1.00 2.03 3.82 5.04 6.81 8.59 10.3 13.1 15.8 18.3CIC 0.00 0.99 1.83 4.31 5.40 5.40 5.40 5.40 5.40 5.40 5.40PACE � 0.00 1.06 1.71 2.40 2.95 3.56 3.96 4.57 5.03 5.29 5.40PACE Y 0.00 1.06 1.71 2.40 2.92 3.58 3.98 4.57 5.04 5.29 5.40PACE R 0.03 0.80 1.19 1.82 2.31 2.66 2.92 3.24 3.28 3.22 2.48
Table 6.2: Experiment6.2. Averagedimensionalityand predictionerror of theestimatedmodels.
Experiment 6.2 Variableswith large effects. In this experimentthe underlying
modelis thesameasin thelastexperimentexceptthat � � � ' � ; resultsareshown
in Figure6.2andTable6.2.
The resultsaresimilar to thosein the previous example. As Figure6.2(a)shows,
modelsgeneratedby PACE � , PACE Y andPACE R lie closerto the line of underlying
modelsthanin thelastexample.AIC generallyoverfitsby anamountthatdecreases
asD H increases.RIC andBIC generallyunderfitby anamountthatincreasesas
D H in-
creases.CIC settlesonthefull modelearlierthanin thelastexample.NN-GAROTTE
and LASSO oqpAr?sut generallychoosemodelsof larger dimensionalitiesthan PACE � ,PACE Y andPACE R .In termsof predictionerror(Figure6.2(b)),PACE � andPACE Y arestill thebestof all
OLS subsetselectionprocedures,andarealmostalwaysbetterandnever meaning-
fully worsethanNN-GAROTTE andLASSO oqp)r8sut (seealsothediscussionof shrinkage
128
0
20
40
60
80
100
0 20 40 60 80 100
Dim
ensi
onal
ityx
k*
true
AIC
BICCIC
RIC
NN-GAROTTE
LASSO
PACE2,4,6
0
5
10
15
20
0 20 40 60 80 100
Pre
dict
ion
erro
ry
k*
full
AIC
BIC
CIC
RIC
PACE2,4
PACE6
NN-GAROTTE
LASSO
(a) (b)
Figure6.2: Experiment6.2.D � �����~z � � � ' � . (a) Dimensionality. (b) Prediction
error.
estimationin situationswith differentvariableeffects;Example3.1),while PACE R is
significantlysuperior. Whentherearenoredundantvariables,PACE R choosesafull-
sizedmodelbut usesdifferentcoefficientsfrom OLS, yielding a modelwith much
smallerpredictionerrorthantheOLS full model.Thisdefiesconventionalwisdom,
whichviews theOLS full modelasthebestpossiblechoicewhenall variableshave
largeeffects.
Experiment 6.3 Rateof convergence. Ourthird exampleexplorestheinfluenceof
thenumberof candidatevariables.ThevalueofD
is chosento be�<�~z & �~z ;<;<; z<�����
respectively, andfor eachvalueD H is chosenas
�,D 1 & and
D; otherwisetheexper-
imentalconditionsareasin Experiment6.1. Note thatvariableswith smalleffects
areharderto distinguishin the presenceof noisy variables.Resultsareshown in
Figure6.3andrecordedin Table6.3.
The paceregressionproceduresarealwaysamongthe bestin termsof prediction
error, a propertyenjoyed by noneof the conventionalprocedures.None of the
paceproceduressuffer noticeablyasthenumberof candidatevariablesD
decreases.
Apparently, they arestablewhenD
is assmallasten.
129
n10 20 30 40 50 60 70 80 90 100
Dim AIC 1.7 2.5 5.0 5.8 7.8 8.2 12.3 10.9 13.2 14.8(n H.��� ) BIC 0.0 0.2 0.3 0.2 0.2 0.5 0.8 0.5 0.9 1.1
RIC 0.2 0.3 0.3 0.2 0.2 0.1 0.2 0.2 0.3 0.2CIC 0.0 0.1 0.0 0.0 0.0 0.0 0.1 0.0 0.0 0.0PACE � 0.9 0.3 0.3 0.2 0.1 0.1 0.1 0.1 0.1 0.0PACE Y 0.9 0.3 0.3 0.2 0.1 0.1 0.1 0.1 0.1 0.0PACE R 0.8 0.7 1.1 0.6 0.3 0.6 1.4 0.5 0.6 0.6
PE AIC 1.1 2.0 3.5 4.4 5.8 6.1 9.5 8.5 10.7 11.2(n H.��� ) BIC 0.0 0.5 0.5 0.5 0.5 0.7 1.4 0.8 1.6 1.7
RIC 0.3 0.6 0.5 0.5 0.4 0.3 0.6 0.4 0.7 0.5CIC 0.0 0.1 0.0 0.0 0.0 0.0 0.2 0.0 0.0 0.0PACE � 0.5 0.5 0.4 0.4 0.1 0.2 0.2 0.1 0.1 0.0PACE Y 0.5 0.5 0.4 0.4 0.1 0.2 0.2 0.1 0.1 0.0PACE R 0.2 0.2 0.3 0.2 0.2 0.1 0.4 0.1 0.2 0.1
Dim AIC 4.8 8.9 14.6 18.8 23.4 27.2 33.2 37.0 40.2 46.9(n H � û � ) BIC 2.0 3.5 6.2 6.5 7.9 10.6 12.0 13.3 13.9 16.7
RIC 3.0 4.2 6.2 5.8 6.7 8.0 8.3 9.7 8.7 10.4CIC 0.0 0.1 0.0 0.0 0.0 0.0 0.1 0.0 0.0 0.0PACE � 7.2 14.6 22.3 32.8 41.0 45.9 57.9 59.1 65.6 84.0PACE Y 7.2 14.6 22.3 32.8 41.0 46.0 58.0 59.1 65.8 84.2PACE R 5.9 11.1 18.2 24.8 31.4 34.0 44.8 46.6 51.0 62.0
PE AIC 2.3 4.7 7.2 9.0 11.4 13.8 17.1 18.1 22.0 23.8(n H.� û � ) BIC 3.5 8.0 11.0 15.8 20.2 22.9 27.8 32.2 37.3 41.0
RIC 2.8 7.5 10.9 16.5 20.9 24.7 30.4 35.3 41.8 45.6CIC 2.8 5.4 8.0 11.3 12.5 16.9 19.7 21.6 29.4 29.3PACE � 2.1 4.5 6.8 8.3 10.7 12.6 16.0 16.9 19.4 22.1PACE Y 2.1 4.5 6.8 8.3 10.8 12.7 16.2 16.9 19.5 22.2PACE R 1.8 3.4 5.3 6.8 8.3 9.6 12.2 12.9 15.9 16.9
Dim AIC 7.7 15.6 23.6 31.0 38.6 45.9 52.0 60.9 67.5 77.6(n H.� n
) BIC 3.3 7.2 11.1 13.2 15.2 19.9 22.1 24.4 27.4 32.5RIC 5.3 8.5 11.3 12.3 13.1 15.6 16.6 18.1 18.8 21.8CIC 10.0 20.0 30.0 40.0 50.0 60.0 70.0 80.0 90.0 100PACE � 9.4 20.0 30.0 40.0 50.0 60.0 70.0 80.0 90.0 100PACE Y 9.4 20.0 30.0 40.0 50.0 60.0 70.0 80.0 90.0 100PACE R 9.2 20.0 29.9 39.6 49.9 60.0 70.0 80.0 90.0 100
PE full 2.0 4.0 6.4 8.2 10.4 12.1 16.2 16.6 19.1 21.6(n H � n
) AIC 3.5 7.0 10.3 13.8 17.1 21.2 26.7 28.3 33.1 35.7BIC 7.7 15.3 22.8 31.3 40.5 47.2 56.4 65.7 74.0 82.1RIC 5.6 13.9 22.6 32.1 42.5 51.3 61.9 71.8 82.4 92.5CIC 2.0 4.0 6.4 8.2 10.4 12.1 16.2 16.6 19.1 21.6PACE � 2.3 4.0 6.4 8.2 10.4 12.1 16.2 16.6 19.1 21.6PACE Y 2.3 4.0 6.4 8.3 10.6 12.3 16.4 16.9 19.4 21.7PACE R 1.2 1.0 2.6 2.7 4.5 5.2 7.8 8.0 10.4 10.7
Table6.3: Experiment6.3—for thetruehypotheses,D H � �~z û � , and
D, respectively.
130
0
5
10
15
0 20 40 60 80 100
Dim
ensi
onal
ityxk*
AIC
PACE2,4,6,
BIC,RIC,CIC 0
5
10
0 20 40 60 80 100
Pre
dict
ion
erro
ryk*
AIC
BICRIC,CIC,
PACE2,4,6
(a) (b)
0
20
40
60
80
0 20 40 60 80 100
Dim
ensi
onal
ityxk*
true
AIC
BICRIC
CIC
PACE2,4
PACE6
0
10
20
30
40
50
0 20 40 60 80 100
Pre
dict
ion
erro
ryk*
AIC
BICRIC
CIC
PACE2,4PACE6
(c) (d)
0
20
40
60
80
100
0 20 40 60 80 100
Dim
ensi
onal
ityxk*
true,CIC,
AIC
BICRIC
PACE2,4,6
0
20
40
60
80
100
0 20 40 60 80 100
Pre
dict
ion
erro
ryk*
AIC
BICRIC
full,CIC,PACE2,4PACE6
(e) (f)
Figure6.3: Experiment6.3. (a), (c) and(e) show thedimensionalitiesof theesti-matedmodelswhentheunderlyingmodelsarethenull, half, andfull hypotheses.(b), (d) and (f) show correspondingpredictionerrorsover large independenttestsets.
131
6.3 Artificial datasets:Towards reality
Miller (1990)considersanartificial situationfor subsetselectionthatassumes“geo-
metricprogressionof valuesof thetrueprojections”,i.e., thegeometricdistribution
of K H . Despitethefactthat,ashepointsout (p.181),“thereis no evidencethatthis
patternis realistic”,it is probablycloserto realitythanthemodelsin thelastsection,
asevidencedby thedistribution of eigenvaluesof a few practicaldatasetsgivenin
Table6.2,Miller, 1990.Hencethissectiontestshow paceregressioncompareswith
othermethodsin this artificial situation. Realworld datasetswill be testedin the
next section.
Weknow thattheextentof upgradingeach ^K ü by paceregressionbasicallydepends
on the numberof ^K aroundit. In particular, PACE � and PACE Y , both essentially
selectionprocedures,rely moreon thenumberaroundthe true thresholdvalue | H .Experiment6.4 usesa situationin which therearerelatively more K Hü around | H ,so that PACE � and PACE Y shouldperformsimilarly to how they would if | H were
known. Experiment6.5roughlyfollowsMiller’ s idea,in which case,however, fewK Hü arearound| H , andhencepaceregressionwill producelesssatisfactoryresults.
Breiman(1995) investigatesanotherartificial situationin which the nonzerotrue
coefficient valuesare distributed in two or threegroups,and the columnvectors
of � arenormally, but not independently, distributed.Experiment6.6 follows this
idea.
For the experimentsin this section,we includeresultsfor both the ñ -fixed and ñ -
randomsituations.From theseresults,it canbe seenthat all proceduresperform
almostidentically in both situations.We shouldnot conclude,of course,that this
will remaintruein othercases,in particularfor small � .
Experiment 6.4 Geometricallydistributedeffects. We now investigatea simple
geometricdistributionof effect. Exceptwheredescribedbelow, all otheroptionsin
this experimentarethesameasusedpreviously.
132
-0.06
-0.04
-0.02
0
0.02
0.04
0.06
0.08
0 5 10 15 20 25 30 35
Figure6.4: _`� ^Kzó�õËò in Experiment6.4.
Thelinearmodelwith 100explanatoryvariablesis used,i.e.,
e � þ?>@>� ü~ýmþ B ü ñ ü �� z (6.1)
wherethenoisecomponent��6���� �~z � � ò . Theprobabilitydensityfunctionof ^K is
determinedby thefollowing mixturemodel
�������+� �"^K ó�õ ò � ' ���+� �"^Kaó � ò���&�' �(� �"^KËó � ò�� � �+� �"^KËó E ò���� �+� �"^K óa��ò�� �+� �"^K ó � ��ò���& ��� �{^K óh&�' ò�� ���+� �"^K óa���ò z (6.2)
i.e., K H � �{z<��z E z � z<� � z &�' and �� with probabilities.5, .25, .13, .6, .3, .2 and.1
respectively.
The correspondingmixture _ -function is shown in Figure6.4. It canbe seenthat| H í & ; ' , for OLSC �8|Ñò . Hencethis situation seemsfavorable to AIC, for ( � -
asymptotically)AIC � OLSC ��& ò , which is nearly the optimal selectionprocedure
here. SincehereBIC � OLSC �8�)���.�¶ò � OLSC ��� ; ��ò and RIC � OLSC �[&��A��� D ò �OLSC �?� ; & ò , bothshouldbeinferior to AIC.
The effects K H � �~z<��z E z � z<� � z &�' z �� areconvertedto parametersthroughthe ex-
133
Procedure � -fixedmodel � -randommodelDim PE Dim PE
full (OLS) 100.0 1.03 100.0 1.11null 0.0 2.73 0.0 2.68AIC 37.8 0.96 37.2 1.03BIC 15.4 1.09 14.6 1.16RIC 11.3 1.21 10.9 1.30CIC 13.7 1.15 12.4 1.26PACE � 45.2 1.00 41.2 1.05PACE Y 45.3 1.03 41.4 1.08PACE R 39.2 0.80 36.7 0.84
Table6.4: Resultsof Experiment6.4.
pression
B ü���� K Hü � �� ; (6.3)
We let ñ ü 6���� �~z<� ò independentlyfor all C . For eachtrainingset,let � � � �<�and
thenumberof observations� � �<�����. Thenthetruemodelbecomes
e � ��� � ñ þ �������O� ñ�\ > ò�� � ; ��� � ñ�\ þ �������O� ñ���\Âò�� � ; & � � ñ���R��������<� ñ��@�Âò��� ; � �fñ��@ ��X�����O� ñ� �Y-ò�� � ; E � � ñ� @\/� ñ� @R�� ñ� ¡� ò�� � ; ' � � ñ� @��� ñ� @ Âò��� ; � � ñ þ?>@> �� ; (6.4)
To testtheestimatedmodel,two situationsareconsidered:ñ -fixedand ñ -random.
In the ñ -fixedsituation,thetrainingandtestsetssharethesame� matrix,but only
thetrainingsetcontainsthenoisecomponent� . In the ñ -randomsituation,a large,
independent,noise-freetestset( � � ���������, anddifferent � from thetrainingset)
is generated,andthecoefficientsin (6.4)areusedto obtaintheresponsevector.
Table6.4 presentstheexperimentalresults.Eachfigure in the tableis theaverage
of�����
runs. ThemodeldimensionalityDim is thenumberof remainingvariables,
excluding the intercept,in the trainedmodel. PE is the meansquaredprediction
errorby thetrainedmodelover thetestset.
As shown in Table6.4, the null model is clearly the worst in termsof prediction
134
-5
-4
-3
-2
-1
0
1
2
3
0 2 4 6 8 10
k = 10
k = 20
k = 30
k = 40
Figure6.5: The mixture _ -function forD � ���~z & �{z �~z E � predictors,and K¢Hü �� �<� I üh£Ñþ ò � with I+� � ; % .
error. The full model is relatively good,even betterthan BIC, RIC and CIC—this
is due to the high thresholdvaluesemployed by the latter. AIC and threepace
estimatorsarebetterthanthefull model.AIC, PACE � andPACE Y performsimilarly:
AIC is slightly betterbecausethe other two have to estimatethe thresholdvalue
while it usesa nearlyoptimal thresholdvalue. PACE R beatsall othersby a clear
margin, althoughit choosesmodelsof similar dimensionalityto AIC, PACE � and
PACE Y .Experiment 6.5 Geometricallydistributedeffects,Miller’ s case. We now look
at Miller’ s (1990,pp.177–181)casewhereeffectsaredistributedgeometrically. In
our experiment,however, we examinethe situationwith independentexplanatory
variables,whereasMiller (pp.186-187)performssimilarity transformationson the
variance-covariancematrix. Correlatedexplanatoryvariableswill be inspectedin
Experiments6.6and6.7.
Miller’ sassignmentof theeffectsis equivalentto this: for amodelwithD
predictors
(excludingtheconstantterm),theunderlying K Hü � � �<� I üh£Ñþ ò � for C � ��z ;<;=; z D . In
particular, heconsidersthesituationsD � ���~z & �~z � or E � and IU� � ; % or
� ; � ; i.e.,
for I�� � ; % , K Hü � �����{z � E z E ��z &�� z=� � z<����z � ; � z E ; E z & ; % z<� ; % z ;<;=; , andfor I�� � ; � ,135
-8
-7
-6
-5
-4
-3
-2
-1
0
1
0 2 4 6 8 10
k = 10
k = 20
k = 30
k = 40
Figure6.6: The mixture _ -function forD � ���~z & �~z �~z E � predictors,and KLHü �� ��� I üh£Ñþ ò � with I+� � ; � .
¤ ����¥§¦ ¤ ����¥§¨n �ª©h� n �U«<� n �U¬<� n ��O� n �®©h� n �U«<� n �U¬<� n ��O�¯ H/° 0 0.8 3.5 5 2 5.5 6.5 7.5
Table6.5: | H for I±� � ; % and� ; � respectively in Experiment6.5.
K Hü � �<���~z �� z<� z E ; � z=� ; � zh� ; � zh� ; &�& z ;=;<; . Figure6.5 and6.6 show the mixture _ -
functionsfor the I�� � ; % and� ; � situationsrespectively. From both figures,we
canroughlyestimatethevaluesof | H for eachsituation,givenin Table6.5. When
the thresholdvalueof a modelselectioncriterion is near | H , this criterion should
resembletheoptimalselectioncriterion in performance.This observationhelpsto
explain theexperimentalresults.
As in Experiment6.4, eachB ü is chosenaccordingto (6.3), where � � �������and� � � �<�
. Thereforethemodelfor I+� � ; % is of theform
e � � ñ þ � � ; %Cñ���� � ; � E ñ M � � ; ' � &�ñ�Y/� � ; E � ñ�\²� � ; �Cñ�R�� � ; &���ñ��²�� ; & � ñ��²� � ; � ��ñ� ²� � ; � �ñ þ?> �X������� � ; % û £Ñþ ñ û �� z (6.5)
136
andthemodelfor I³� � ; � is of theform
e � � ñ þ � � ; ��ñ���� � ; ��Cñ M � � ; &�&�ñ�Y�� � ; � Cñ�\²� � ; � ��%Cñ�R²� � ; � E �]ñ����� ; � &�%Cñ���� � ; �~� ��ñ� ²� � ; �~� ñ þ?> ��������� � ; � û £Ñþ ñ û �� ; (6.6)
Trainingandtestsetsaregeneratedasin Experiment6.4.
Theexperimentalresultsaregivenin Table6.6and6.7 for I³� � ; % and� ; � respec-
tively. Amongall theseresults,PACE R is clearlythebest:it never(sensibly)losesto
any otherprocedures,andoftenwins by a clearmargin. In somecases,procedures
PACE � andPACE Y areinferior to thebestselectioncriterion—despitethe fact that,
theoretically, they areD-asymptoticallythebestselectioncriteria. This is theprice
that paceregressionpaysfor estimatingõ]�?K H ò in the finiteD
situation. It could
be slightly inferior to otherselectionprocedureswhich have the thresholdvalues
near | H ; nevertheless,no individual oneof the OLS, null, AIC, BIC � OLSC ��� ; ��ò ,RIC � OLSC �[&��)��� D ò andCIC selectioncriteriacanalwaysbeoptimal.
The accuracy of paceestimatorsrelies on the accuracy of the estimatedõ´�8K H ò ,whichfurtherdependsonthenumberof predictors(i.e.,thenumberof K ü ’s)—more
precisely, the numberof K ü ’s arounda particularpoint of interest,for example,| H for PACE � andPACE Y . In this experiment,thereareoften few datapointsclose
enoughto | H to provideenoughinformationfor accuratelyestimatingõ]�?K H ò around| H . In thecasesI³� � ; � andD � � and E � , for example,| H í � ; ' and � ; ' , but we
only have K¢Hü � �����~z �� z<� z E ; � z<� ; � zh� ; � zh� ; &�& z ;<;<; . For samplesdistributedlikethis,
theclusteringresultaround| H is unlikely to beaccurate,implying higherexpected
lossfor paceestimators.
Experiment 6.6 Breiman’scase. Thisexperimentexploresasituationconsidered
by Breiman(1995,pp.379–382).
According to Breiman(1995, p.379), the “valuesof the coefficients were renor-
malizedso that in the � -controlledcase,the averageÚ � wasaround ; % ' , in the� -random,about ; ��' .” Sincedetailsof the “re-normalization”arenot given, the
137
Proceduren � -fixedmodel � -randommodel
Dim PE Dim PEfull (OLS) 10 0.11 10 0.11null 0 2.79 0 2.72AIC 8.61 0.13 8.6 0.14BIC 6.53 0.24 6.61 0.24RIC 10 7.37 0.18 7.53 0.19CIC 9.87 0.11 9.81 0.12PACE � 9.45 0.12 9.44 0.12PACE Y 10 0.11 10 0.11PACE R 10 0.11 10 0.11full (OLS) 20 0.20 20 0.22null 0 2.80 0 2.74AIC 10.8 0.20 11.0 0.22BIC 6.83 0.27 6.71 0.29RIC 20 7.10 0.25 7.23 0.26CIC 10.6 0.21 10.6 0.23PACE � 12.2 0.21 12.1 0.23PACE Y 12.2 0.21 12.1 0.23PACE R 11.5 0.20 11.3 0.22full (OLS) 30 0.31 30 0.30null 0 2.75 0 2.74AIC 12.7 0.27 12.1 0.25BIC 7.06 0.27 7.15 0.27RIC 30 7.11 0.27 7.21 0.27CIC 8.40 0.26 8.42 0.26PACE � 13.4 0.28 11.6 0.26PACE Y 13.4 0.28 11.6 0.26PACE R 12.4 0.23 11.3 0.23full (OLS) 40 0.41 40 0.43null 0 2.78 0 2.79AIC 14.1 0.33 13.8 0.34BIC 7.17 0.28 6.96 0.31RIC 40 6.99 0.29 6.73 0.31CIC 7.52 0.29 7.26 0.31PACE � 11.2 0.32 11.1 0.33PACE Y 11.3 0.31 11.2 0.33PACE R 11.5 0.26 11.3 0.28
Table6.6: Experiment6.5. ResultsforD � �<�~z & �~z �~z E � predictors,and K Hü �� ��� I üh£Ñþ ò � with I+� � ; % .
138
Proceduren � -fixedmodel � -randommodel
Dim PE Dim PEfull (OLS) 10 0.11 10 0.11null 0 1.57 0 1.61AIC 4.92 0.10 5.13 0.11BIC 3.23 0.13 3.44 0.13RIC 10 3.77 0.12 3.96 0.11CIC 5.12 0.11 5.3 0.11PACE � 5.28 0.12 5.75 0.11PACE Y 10 0.11 10 0.11PACE R 10 0.11 10 0.11full (OLS) 20 0.20 20 0.22null 0 1.57 0 1.60AIC 6.35 0.16 6.56 0.17BIC 3.24 0.13 3.48 0.14RIC 20 3.5 0.13 3.76 0.13CIC 3.4 0.14 3.76 0.14PACE � 4.54 0.15 5.37 0.16PACE Y 4.66 0.14 5.53 0.15PACE R 4.72 0.13 5.56 0.13full (OLS) 30 0.31 30 0.30null 0 1.55 0 1.54AIC 8.5 0.22 7.92 0.20BIC 3.59 0.14 3.37 0.15RIC 30 3.61 0.14 3.41 0.15CIC 3.28 0.15 3.12 0.16PACE � 4.59 0.18 3.88 0.18PACE Y 4.73 0.16 4.05 0.17PACE R 5.48 0.14 4.72 0.14full (OLS) 40 0.41 40 0.43null 0 1.58 0 1.59AIC 10.2 0.29 9.71 0.29BIC 3.63 0.16 3.58 0.16RIC 40 3.52 0.16 3.44 0.16CIC 3.12 0.16 3.03 0.17PACE � 4.33 0.23 3.84 0.23PACE Y 4.73 0.19 4.24 0.19PACE R 5.47 0.16 5.09 0.16
Table6.7: Experiment6.5. ResultsforD � ���{z & �~z �~z E � predictors,and K Hü �� �<� I üh£Ñþ ò � with I+� � ; � .
139
Procedure µ�¶ �ª© µ<¶ �U« µ�¶ �U¬ µ�¶ �� µ�¶ �U·Dim PE Dim PE Dim PE Dim PE Dim PE� -fixed(
n �U«<��¸@¹´��O� )full (OLS) 20 0.54 20 0.54 20 0.54 20 0.54 20 0.54null 0 3.65 0 4.00 0 4.88 0 5.76 0 6.58AIC 4.77 0.32 6.4 0.42 7.45 0.48 8.00 0.51 8.69 0.53BIC 3.01 0.21 3.97 0.36 5.07 0.46 5.86 0.53 6.38 0.58RIC 2.32 0.14 2.98 0.34 3.83 0.49 4.41 0.59 4.84 0.67CIC 2.25 0.14 3.04 0.37 3.92 0.53 5.09 0.62 6.01 0.64PACE � 3.47 0.20 5.46 0.41 6.89 0.51 8.19 0.57 9.41 0.58PACE Y 3.62 0.20 5.74 0.40 7.22 0.51 8.61 0.56 9.72 0.58PACE R 3.76 0.16 5.46 0.34 6.99 0.46 8.17 0.51 8.99 0.54� -random(
n ��O��¸@¹´�U¦<� )full (OLS) 40 1.06 40 1.06 40 1.06 40 1.06 40 1.06null 0 0.41 0 2.60 0 3.22 0 3.85 0 4.43AIC 8.46 0.48 10.2 0.55 12.1 0.67 13.1 0.75 13.9 0.80BIC 3.94 0.30 5.51 0.38 6.74 0.52 7.55 0.64 8.06 0.71RIC 2.27 0.26 3.64 0.31 4.85 0.50 5.25 0.64 5.60 0.76CIC 1.22 0.30 3.38 0.31 4.47 0.53 5.17 0.67 6.03 0.76PACE � 3.50 0.36 5.83 0.41 8.31 0.60 10.8 0.75 12.7 0.80PACE Y 4.55 0.34 6.45 0.40 9.08 0.59 11.7 0.74 13.4 0.80PACE R 5.04 0.27 7.37 0.33 9.72 0.47 11.5 0.60 13.0 0.66
Table6.8: Experiment6.6,boththe ñ -fixedand ñ -randomsituations
normalizationfactor I usedheresetsB Hü ��I B ü , for C � ��z ;<;<; z D , sothat
Ú � í � I � ú ûü´ýmþ B ü �� I � ú ûü~ýmþ B ü � � D z (6.7)
thatis,
I �?� z D z Ú � ò í D��� � � Ú � òÑúBûü~ýmþ B ü � ; (6.8)
Therefore,in the ñ -fixedcase
I � E �~z & �~zh� ; % ' ò í ��� ú ûü~ýmþ B ü � z (6.9)
140
andin the ñ -randomcase
I ��% �~z E �~za� ; ��' ò í &úBûü´ýmþ B ü � ; (6.10)
Given this normalization,subsetselectionproceduresseemto producemodelsof
similar dimensionalitiesto thosegivenin Breiman’s paper. Each B ü is obtainedby
Breiman’smethod,p.379.
Experimentalresultsaregiven in Table6.8. The performanceof paceestimators
areasin Experiment6.5; PACE R is thebestof all. However, PACE � andPACE Y are
notoptimalselectioncriteriain all situations.They areintermediatein performance
to otherselectioncriteria—notthebest,not theworst. While further investigation
is neededto fully understandthe results,a plausibleexplanationis that,asin Ex-
periment6.5, few datapoints are available to accuratelyestimateõ]�?K H ò around| H .6.4 Practical datasets
Thissectionpresentssimulationstudiesonpracticaldatasets.
Experiment 6.7 Practical datasets. The datasetsusedarechosenfrom several
thatareavailable. They arechosenfor their relatively largenumberof continuous
or binaryvariables,to highlight theneedfor subsetselectionandalsoin accordance
with ourD-asymptoticconclusions.To simplify thesituationandfocuson theba-
sic idea, non-binarynominal variablesoriginally containedin somedatasetsare
discarded,andso areobservationswith missingvalues;future extensionsof pace
regressionmaytake theseissuesinto account.
Table6.7lists thenamesof thesedatasets,togetherwith thenumberof observations� andthenumberof variablesD
in eachdataset.ThedatasetsAutos(Automobile),
Cpu (ComputerHardware),andCleveland(HeartDisease—ProcessedCleveland)
141
Autos Bankbill Bodyfat Cholesterol Cleveland Cpun / k 159/ 16 71 / 16 252/ 15 297/ 14 297/ 14 209/ 7
Horses Housing Ozone Pollution Prim9 Uscrimen / k 102/ 14 506/ 14 203/ 10 60 / 16 500/ 9 47 / 16
Table6.9: Practicaldatasetsusedin Experiment6.7.
areobtainedfrom the MachineLearningRepositoryat UCI3 (Blake et al., 1998).
Bankbill, HorsesandUscrimearefrom OzDASL4 (AustralasianDataandStoryLi-
brary). Bodyfat,Housing(Boston)andPollutionarefrom StatLib5. Ozoneis from
LeoBreiman’sWebsite6. Prim9is availablefrom theS-PLUS package.Cholesterol
is usedby Kilpatrick andCameron-Jones(1998).Mostof thesedatasetsarewidely
usedfor testingnumericpredictors.
Paceregressionprocedures,togetherwith others,areappliedto thesedatasetswith-
out muchscrutiny. This is lessthanoptimal,andofteninappropriate.For example,
somedatasetsare intrinsically nonlinear, and/orthe noisecomponentis not nor-
mally distributed.
Although the resultsobtainedare generallysupportive, careful interpretationis
necessary. First, the finite situationmay not be large enoughto supportourD-
asymptoticconclusions,or theelementsin theset N.^K ü<P aretooseparatedto support
meaningfulupdating(Section4.2). In fact,it is well known thatnoestimatoris uni-
formally optimal in all finite situations.Second,the choiceof thesedatasetsmay
not be extensive enough,or they could be biasedin someunknown way. Third,
sincethetruemodelis unknown andthuscross-validationresultsareusedinstead,
thereis no consensuson thevalidity of dataresamplingtechniques(Section2.6).
Theexperimentalresultsaregivenin Table6.7.Eachfigureis theaverageof twenty
runsof 10-fold cross-validationresults,wherethe“Dim” columngivestheaverage
estimateddimensionality(excludingtheintercept)andthe“PE(%)” columntheav-
eragesquaredpredictionerrorrelative to thesamplevarianceof theresponsevari-3http://www.ics.uci.edu/º mlearn/MLRepository.html4http://www.maths.uq.edu.au/º gks/data/index.html5http://lib.stat.cmu.edu/datasets/6ftp://ftp.stat.berkeley.edu/pub/users/breiman
142
Procedure Autos Bankbill Bodyfat CholesterolDim PE(%) Dim PE(%) Dim PE(%) Dim PE(%)
NN-GAR. 10.6 21.5 10.39 7.45 8.10 2.76 10.38 101.1LASSO »�¼ 9.72 21.5 9.71 8.65 2.99 2.73 5.75 97.4full (OLS) 15.00 21.4 15.00 7.86 14.00 2.75 13.00 97.1null 0.00 100.5 0.00 101.01 0.00 100.48 0.00 100.4AIC 7.82 23.6 9.64 8.76 3.38 2.71 4.31 97.4BIC 3.87 23.9 8.73 8.99 2.24 2.67 2.54 97.6RIC 3.57 23.6 8.02 9.78 2.24 2.67 2.65 97.0CIC 5.00 24.5 12.54 8.56 2.23 2.68 2.23 99.9PACE � 12.95 21.9 13.77 8.20 2.24 2.68 4.42 97.4PACE Y 12.99 21.8 13.80 8.20 2.25 2.68 4.50 97.3PACE R 10.51 21.1 12.67 8.34 2.32 2.66 4.35 96.2
Cleveland Cpu Horses HousingDim PE(%) Dim PE(%) Dim PE(%) Dim PE(%)
NN-GAR. 9.60 47.6 5.10 25.5 4.90 109 7.50 28.0LASSO »�¼ 9.10 47.8 5.15 23.4 1.23 97 6.53 28.1full (OLS) 13.00 48.2 6.00 18.4 13.00 106 13.0 28.3null 0.00 100.3 0.00 100.7 0.00 102 0.0 100.3AIC 8.24 50.2 5.06 18.0 3.87 108 11.0 28.0BIC 5.31 51.3 4.94 17.9 2.50 106 10.7 28.8RIC 5.41 51.3 5.00 17.7 2.14 108 10.8 28.5CIC 11.46 49.1 6.00 18.4 0.29 107 11.9 28.2PACE � 12.35 48.5 5.21 18.2 2.65 107 11.6 28.2PACE Y 12.35 48.6 5.21 18.2 2.68 108 11.6 28.2PACE R 11.17 49.3 5.14 18.2 2.98 102 11.4 28.2
Ozone Pollution Prim9 UscrimeDim PE(%) Dim PE(%) Dim PE(%) Dim PE(%)
NN-GAR. 5.78 32.2 7.69 69.0 6.90 58.3 8.81 60.9LASSO »�¼ 5.62 32.3 5.32 47.9 6.43 54.1 9.67 49.8full (OLS) 9.00 31.6 15.00 60.8 7.00 53.5 15.00 52.1null 0.00 100.6 0.00 102.1 0.00 100.2 0.00 102.9AIC 5.54 31.6 7.24 58.3 7.00 53.5 6.91 49.3BIC 3.60 34.3 5.04 65.3 4.13 54.6 5.26 49.3RIC 4.24 33.5 4.51 66.3 5.87 54.8 3.81 52.7CIC 5.92 31.7 5.64 67.1 7.00 53.5 5.00 53.4PACE � 7.17 31.7 8.96 62.7 6.78 53.8 9.16 54.4PACE Y 7.17 31.7 9.06 62.9 6.85 53.8 9.21 54.9PACE R 6.08 31.5 7.86 56.0 6.97 53.8 7.78 50.7
Table6.10:Resultsfor practicaldatasetsin Experiment6.7.
143
able(i.e., thenull modelshouldhave roughly100%relativeerror).
It canbeseenfrom Table6.7 that thepaceestimators,especiallyPACE R , generally
producethebestcross-validationresultsin termsof predictionerror. This is con-
sistentwith theresultson artificial datasetsin theprevioussections.LASSO »�¼ also
generallygivesgoodresultson thesedatasets,in particularHorsesandPollution,
while NN-GAROTTE doesnot seemto performsowell.
6.5 Remarks
In thischapter, experimentshavebeencarriedoutovermany datasets,rangingfrom
artificial to practical. Although theseare all finiteD
situations,paceestimators
performreasonablywell, especiallyPACE R , in termsof squarederror, in accordance
with theD-asymptoticconclusions.
We also notice that paceestimators,mainly PACE � and PACE Y , perform slightly
worsethanthebestof theotherselectionestimatorsin somesituations.This is the
price that paceregressionhasto pay for estimatingõ´�8K H ò in finite situations. In
almostall cases,giventhetrue õ]�?K H ò , theexperimentalresultsarepredictable.
Unlike other modelingproceduresusedin the experiments,paceestimatorsmay
have room for further improvement. This is becausean estimatorof the mixing
distribution that is betterthanthecurrentlyusedNNM maybeavailable—e.g.,the
ML estimator. Even for the NNM estimatoritself, the choicesof potentialsupport
points,fitting intervals,empiricalmeasure,anddistancemeasurecanall affect the
results.Thecurrentchoicesarenotnecessarilyoptimal.
144
Chapter 7
Summary and final remarks
7.1 Summary
This thesispresentsa new approachto fitting linear models,called“paceregres-
sion”, which alsoovercomesthedimensionalitydeterminationproblemintroduced
in Section1.1. It endeavors to minimize theexpectedpredictionloss,which leads
to theD-asymptoticoptimality of paceregressionasestablishedtheoretically. In
orthogonalspaces,it outperforms,in thesenseof expectedloss,many existingpro-
ceduresfor fitting linear modelswhenthe numberof free parametersis infinitely
large. Dimensionalitydeterminationturns out to be a naturalby-productof the
goalof minimizing expectedloss. Thesetheoreticalconclusionsaresupportedby
simulationstudiesin Chapter6.
Paceregressionconsistsof six procedures,denotedby PACE þ z ;<;<; z PACE R , respec-
tively, whichareoptimalin theircorrespondingmodelspaces(Section3.5).Among
them,PACE \ is thebest,but requiresnumericintegrationandthusiscomputationally
expensive. However, it canbe approximatedvery well by PACE R , which provides
consistentlysatisfactoryresultsin our simulationstudies.
Themainideaof theseprocedures,explainedin Chapter3, is to decomposetheini-
tially estimatedmodel—wealwaysusetheOLS full model—inanorthogonalspace
into dimensionalmodels,which arestatisticallyindependent.Thesedimensional
145
modelsarethenemployed to obtainan estimateof the distribution of the true ef-
fectsin eachindividual dimension.This allows theinitially estimateddimensional
modelsto beupgradedbyfilteringouttheestimatedexpecteduncertainty, andhence
overcomesthecompetitionphenomenain modelselectionbasedon dataexplana-
tion. Theconceptof “contribution”, which is introducedin Section3.3,canbeused
to indicatewhetherandhow muchamodeloutperformsthenull one.It is employed
throughouttheremainderof thethesisto explain thetaskof modelselectionandto
obtainpaceregressionprocedures.
Chapter4 discussessomerelatedissues.Severalmajormodelingprinciplesareex-
aminedretrospectively in Section4.5. Orthogonalizationselection—acompetition
phenomenonanalogousto modelselection—iscoveredin Section4.6. Thepossi-
bility of updatingsignedprojectionsis discussedin Section4.7.
Although startingfrom the viewpoint of competingmodels,the ideaof this work
turns out to be similar to the empirical Bayesmethodologypioneeredby Rob-
bins (1951,1955,1964),andhasa closerelationwith Steinor shrinkageestima-
tion (Stein,1955;JamesandStein,1961).Theshrinkageideawasexploredin linear
regressionby Sclove(1968),Breiman(1995)andTibshirani(1996),for example.
Throughthis thesis,we have gaineda deepunderstandingof theproblemof fitting
linear modelsby minimizing expectedquadraticloss. Key issuesin fitting linear
modelsareintroducedin Chapter2, andthroughoutthe thesismajor existing pro-
ceduresfor fitting linearmodelsarereviewedandcompared,boththeoreticallyand
empirically, with theproposedones.They includeOLS, AIC, BIC, RIC, CIC, CV �?½Ðò ,BS �?¾ ò , RIDGE, NN-GAROTTE andLASSO.
Chapter5 investigatestheestimationof anarbitrarymixing distribution. Although
this is anindependenttopicwith its own value,it is anessentialpartof paceregres-
sion. Two minimumdistanceapproachesbasedon theprobabilitymeasureandon
a nonnegative measureareintroduced.While all the estimatorsarestronglycon-
sistent,the minimum distanceapproachbasedon a nonnegative measureis both
computationallyefficient andreliable. The maximumlikelihoodestimatorcanbe
146
reliablebut notefficient;otherminimumdistanceestimatorscanbeefficientbut not
reliable.By “reliable” herewe meanthatno isolateddatapointsareignoredin the
estimatedmixing distribution. This is vital for paceregression,becauseonesingle
ignoredpoint cancauseunboundedloss. The predictive accuracy of thesemini-
mumdistanceproceduresin thecaseof overlappingclustersis studiedempirically
in Section5.6,giving resultsthatgenerallyfavor thenew methods.
Finally, it is worth pointingout thattheoptimality of paceregressionis only in the
senseofD-asymptotics,andhencedoesnot excludethe useof OLS, andperhaps
many otherproceduresfor fitting linearmodels,in situationsof finiteD, especially
whenD
is small or therearefew neighboringdatapoints to supporta reasonable
updating.
7.2 Unsolvedproblemsand possibleextensions
In this section,we list someunsolved problemsandpossibleextensionsof pace
regression—mosthave appearedearlier in the thesis. Sincethe issuescoveredin
the thesisaregeneralandhave broadly-basedpotential,we have not attemptedto
generateanextensive list of extensions.
Orthogonalization selection Whatis theeffect of orthogonalizationselectionand
orthogonalizationselectionmethodson paceregression(Section4.6)?
Other noisedistrib utions Only normally distributed noisecomponentsare con-
sideredin thecurrentimplementationof paceregression.Extensionto other
typesof distribution,evenanempiricaldistribution,seemsfairly straightfor-
ward.
Full modelunavailable How shouldthesituationwhenthefull modelis unavail-
ableand the noisevarianceis estimatedinaccurately(Section4.4) be han-
dled?
147
The ñ -random model It is known that the ñ -randommodel convergesto the ñ -
fixedmodelfor largesamples(Section2.2). However, for smallsamples,theñ -randomassumptionshouldin principleresultin adifferentmodel,because
it involvesmoreuncertainty—namelysamplinguncertainty.
Enumeratedvariables Enumeratedvariablescanbe incorporatedinto linear re-
gression,aswell asinto paceregression,usingdummyvariables.However,
if the transformationinto dummyvariablesrelieson theresponsevector(so
asto reducedimensionality),it involvescompetitiontoo. This time thecom-
petitionoccursamongthemultiple labelsof eachenumeratedvariable.Note
thatbinaryvariablesdo not have this problemandthuscanbeusedstraight-
forwardly in paceregressionasnumericvariables.
Missing values Missingvaluesareanotherunknowns: they couldbefilled in soas
to minimizetheexpectedloss.
Other lossfunctions In many applicationsconcerninggains in monetaryunits,
minimizing the expectedabsolutedeviation, rather than the quadraticloss
consideredhere,is moreappropriate.
Other model structures Theideasof paceregressionareapplicableto many model
structures,suchas thosementionedin Section1.1—althoughthey may re-
quiresomeadaptation.
Classification Classificationdiffersfrom regressionin thatit usesadiscrete,often
unordered,predictionspace,ratherthanacontinuousone.Both sharesimilar
methodology:amethodthatis applicablein oneparadigmis oftenapplicable
in theothertoo. Wewould like to extendtheideasof paceregressionto solve
classificationproblemstoo.
Statistically dependentcompetingmodels Futureresearchmay extend into the
broaderarenaconcerninggeneralissuesof statisticallydependentcompeting
models(Section7.3).
148
7.3 Modeling methodologies
In spiteof rigorousmathematicalanalysis,empiricalBayes(including Steinesti-
mation)is oftencriticizedfrom a philosophicalviewpoint. Oneseeminglyobvious
problem,which is still controversialtoday(see,e.g.,discussionsfollowing Brown,
1990),is thatcompletelyirrelevant (or statisticallyunrelatedlydistributed)events,
given in arbitrarymeasurementunits, canbe employed to improve estimationof
eachother. Efron andMorris (1973a,p.379)commenton this phenomenonasfol-
lows:
Do youmeanthatif I wantto estimateteaconsumptionin TaiwanI will
do betterto estimatesimultaneouslythespeedof light andtheweight
of hogsin Montana?
TheempiricalBayesmethodologycontradictssomewell-known statisticalprinci-
ples,suchasconditionality, sufficiency andinvariance.
In anon-technicalintroductionto Steinestimation,EfronandMorris (1977,pp.125-
126),in thecontext of anexampleconcerningratesof anendemicdiseasein cities,
concludedthat “the James-Steinmethodgivesbetterestimatesfor a majority of
cities,andit reducesthetotal errorof estimationfor thesumof all cities. It cannot
be demonstrated,however, that Stein’s methodis superiorfor any particularcity;
in fact, the James-Steinpredictioncanbe substantiallyworse.” This clearly indi-
catesthat estimationshoulddependon the lossfunction employed in the specific
application—anotionconsistentwith statisticaldecisiontheory.
Two differentlossfunctionsareof interesthere.Oneis thelossconcerningapartic-
ularplayer, andtheothertheoverall lossconcerningagroupof players.Successful
applicationof empiricalBayesimpliesthatit is reasonableto employ overall lossin
this application.Thefurtherassumptionof statisticalindependenceof theplayers’
performanceis neededin orderto applythemathematicaltoolsthatareavailable.
149
Thesecondtypeof lossconcernstheperformanceof a teamof players. Although
employing informationaboutall players’pastperformancethroughempiricalBayes
estimationmayor maynot improvetheestimationof aparticularplayer’s trueabil-
ity, it is always(technically, almostsurely)the correctapproach—inotherwords,
asymptoticallyoptimalusinggeneralempiricalBayesestimationandfinitely better
usingSteinestimation—toestimatinga team’s trueability andhenceto predicting
its futureperformance.
A linearmodel—orany model—withmultiple freeparametersis sucha team,each
parameterbeinga player. Notethatparameterscouldbecompletelyirrelevant,but
they areboundtogetherinto a teamby the predictiontask. It may seemslightly
beyondtheexistingempiricalBayesmethods,sincetheperformanceof theseplay-
erscouldbe,andoftenare,statisticallydependent.Thegeneralquestionshouldbe
how to estimatetheteam’strueability, givenstatisticallydependentperformanceof
players.
Paceregressionhandlesstatisticaldependenceusingdummyplayers, namelythe
absolutedimensionaldistancesof theestimatedmodelin anorthogonalspace.This
approachprovidesa generalway of eliminatingstatisticaldependence.Thereare
otherpossiblewaysof definingdummyplayers,andfrom theempiricalBayesview-
point it is immaterialhow they aredefinedaslongasthey arestatisticallyindepen-
dentof eachother. The choiceof K þ z ;=;<; z K û in paceregressionhastheseadvan-
tages:(1) they sharethesamescaleof signal-to-noiseratio,andthustheestimation
is invariantof thechoiceof measurementunits; (2) themixturedistribution is well
defined,with mathematicalsolutionsavailable; (3) they aresimply relatedto the
loss function of prediction(in the ñ -fixed situation); (4) many applicationshave
somezero, or almostzero, K Hü ’s, and updatingeachcorrespondingK ü often im-
provesestimationandreducesdimensionality(seeSection4.2).
Nevertheless,usingdummyplayersin this way is not a perfectsolution.Whenthe
determinationof orthogonalbasisreliesontheobservedresponse,it suffersfromthe
orthogonalizationselectionproblem(Section4.6). Informationaboutthestatistical
dependenceof the original playersmight be available,or simply estimablefrom
150
data.If so,it mightbeemployableto improvetheestimateof theteam’strueability.
Estimationof a particularplayer’s true ability using the loss function concerning
this singleplayer is actuallyestimationgiven the player’s name, which is differ-
ent from estimationgiven the player’s performancewhile the performanceof all
playersin a teamis mixed togetherwith their nameslost or ignored. This is be-
causethelatteraddressestheteamability giventhatperformance,thussharingthe
samemethodologyof estimatingteamability. For example,pre-selectingparame-
tersbasedonperformance(i.e.,dataexplanation)affectstheestimationof theteam
ability; seeSection4.4.
This point generalizesto situationswith many competingmodels.Whenthefinal
modelis determinedbasedon its performance—ratherthanon someprior belief—
after seeingall the models’performance,the estimationof its true ability should
take into accountall candidates’performance.In fact, the final model,assumed
optimal,doesnot have to beoneof thecandidatemodels—itcanbea modification
of onemodelor a weightedcombinationof all candidates.
Modern computingtechnologyallows the investigationof larger and larger data
setsandtheexplorationof increasinglywide modelspaces.Datamining(see,e.g.,
Fayyadetal.,1996;Friedman,1997;WittenandFrank,1999),asaburgeoningnew
technology, oftenproducesmany candidatemodelsfrom computationallyintensive
algorithms.While thiscanhelpto find moreappropriatemodels,thechanceeffects
increase.How to determinetheoptimalmodelgivena numberof competingones
is preciselyourconcernabove.
The subjective prior usedin Bayesiananalysisconcernsthe specificationof the
modelspacein termsof probabilisticweights,thusrelatingto Fisher’sfirst problem
(seeSection1.1). Without a prior, the modelspacewould otherwisebe specified
implicitly with (for example)auniformdistribution. BothempiricalBayesandpace
regressionfully exploit dataafterall prior information—includingspecificationof
the model space—isgiven. Thus both belongto Fisher’s secondproblemtype.
Thestory told by empiricalBayesandpaceregressionis that thedatacontainsall
151
necessaryinformation for the convergenceof optimal predictive modeling,while
thetips givenby any prior information—thecorrectnessof which is alwayssubject
to practicalexamination—playat mosta role of speedingup this convergence.
“Torturing” datais deemedto beunacceptablemodelingpractice(seeSection1.3).
In fact,however, empiricalmodelingalwaystorturesthedatato someextent—the
larger the model space,the more the datais tortured. Sincesomemodel space,
large or small, is indispensable,the correctapproachis to take the effect of the
necessarydatatorturing into accountwithin themodelingprocedure.This is what
paceregressiondoes.And doessuccessfully.
152
Appendix
Help filesand sourcecode
The ideaspresentedin the thesisare formally implementedin the computerlan-
guagesof S-PLUS/R and FORTRAN 77. The whole programruns under ver-
sions 3.4and5.1of S-PLUS (MathSoft,Inc.1), andunderversion1.1.1of R.2 Help
files for mostimportantS-PLUS functionsareincluded.Theprogramis freesoft-
ware; it canberedistributedand/ormodifiedunderthe termsof theGNU General
PublicLicense(version2) aspublishedby theFreeSoftwareFoundation.3
Oneextra softwarepackageis utilized in the implementation,namelythe elegant
NNLS algorithmby LawsonandHanson(1974,1995),which is public domainand
obtainablefrom NetLib.4
Thesourcecodeis organizedin six files (five in S-PLUS/R andonein FORTRAN
77):
disctfun.q containsfunctionsthathandlediscretefunctions.They aremainly
employed in our work to representdiscretepdfs; in particular, the discrete,
arbitrarymixing distributions.1http://www.mathsoft.com/2http://lib.stat.cmu.edu/R/CRAN/3Free Software Foundation, Inc. 675 Mass Ave, Cambridge, MA 02139, USA.
http://www.gnu.org/.4http://www.netlib.org/lawson-hanson/all
153
ls.q is basicallyan S-PLUS/R interfaceto functionsin FORTRAN for solving
linear systemproblems,including QR transformation,non-negative linear
regression,non-negative linear regressionwith equality constraints,upper-
triangularequationsolution.
mixing.q includesfunctionsfor the estimationof an arbitrarymixing distribu-
tions.
pace.q has functionsaboutpaceregressionand for handling objectsof class
“pace”.
util.q givesa few smallfunctions.
ls.f hasfunctionsin FORTRAN for solvingsomelinearsystemsproblems.
154
A.1 Help files
bmixing:
Seemixing.
disctfun:
Constructor of a Discrete Function
DESCRIPTION:Returns a univariate discrete function which is alwayszero except over a discrete set of data points.
USAGE:disctfun(data=NULL, values=NULL, classname, normalize=F,
sorting=F)
REQUIREDARGUMENTS:None.
OPTIONAL ARGUMENTS:data: The set of data points over which the function takes
non-zero values.values: The set of the function values over the specified data
points.classname: Extra class names; e.g., c("sorted", "normed"), if
data are sorted and values are L1-normalized on input.normalize: Will L1-normalize values (so as to make a discrete
pdf function).sorting: Will sort data.
VALUE:Returns an object of class "disctfun".
DETAILS:Each discrete function is stored in a (n x 2) matrix,where the first column stores the data points over whichthe function takes non-zero values, and the second columnstores the corresponding function values at these points.
SEE ALSO:disctfun.object, mixing.
EXAMPLES:disctfun()disctfun(c(1:5,10:6), 10:1, T, T)
155
disctfun.object:
Discrete Function Object
DESCRIPTION:These are objects of class "disctfun". They represent thediscrete functions which take non-zero values over adiscrete set of data points .
GENERATION:This class of objects can be generated by the contructordisctfun.
METHODS:The "disctfun" class of objects has methods for thesegeneric functions: sort, print, is, normalize, unique,plot, "+", "*", etc.
STRUCTURE:Each discrete function is stored in a (n x 2) matrix,where the first column stores the data points over whichthe function takes non-zero values, and the second columnstores the corresponding function values at these points.
Since it is a special matrix, methods for matrices can beused.
SEE ALSO:disctfun, mixing.
fintervals:
Fitting Intervals
DESCRIPTION:Returns a set of fitting intervals given the set offitting points.
USAGE:fintervals(fp, rb=0, cn=3, dist=c("chisq", "norm"),
method=c("nnm", "pm", "choi", "cramer"), minb=0.5)
REQUIREDARGUMENTS:fp: fitting points. Allow being unsorted.
OPTIONAL ARGUMENTS:rb: Only used for Chi-sqaured distribution. No fitting
intervals’ endpoints are allowed to fall into the interval(0,rb). If the left endpoint does, it is replaced by 0; ifthe right endpoint does, replaced by rb. This is usefulfor the function mixing() when ne != 0.
156
cn: Only for normal distribution. The length of each fittinginterval is 2*cn.
dist: type of distribution; only implemented for non-centralChi-sqaure ("chisq") with one degree of freedom and normaldistribution with variance 1 ("norm").
method: One of "nnm", "pm", "choi" and "cramer"; see mixing.minb: Only for Chi-squared distribution and method "nnm". No
right endpoint is allowed between (0, minb); delete suchintervals if generated.
VALUE:fitting intervals (stored in a two-coloumn matrix).
SEE ALSO:mixing, spoints.
EXAMPLES:fintervals(1:10)fintervals(1:10, rb=2)
lsqr:
QR Transformation of the Least Squares Problem
DESCRIPTION:QR-transform the least squares problem
A x = binto
R x = t(Q) bwhere
A = Q R, andR is an upper-triagular matrix (could be rank-
deficient)Q is an orthogonal matrix.
USAGE:lsqr(a,b,pvt,ks=0,tol=1e-6,type=1 )
REQUIREDARGUMENTS:a: matrix Ab: vector b
OPTIONAL ARGUMENTS:pvt: column pivoting vector, only columns specified in pvt are
used in QR-transformed. If not provided, all columns areconsidered.
ks: the first ks columns specified in pvt are QR-transformedin the same order. (But they may be deleted due to rankdeficiency.)
tol: tolerance for collinearitytype: used for finding the most suitable pivoting column.
1 based on explanation ratio.2 based on largest diagonal element.
157
VALUE:a: R (including the zeroed elements)b: t(Q) bpvt: pivoting indexks: pseudorank of R (determined by the value of tol)w: explanation matrix.
w[1,] sum of squares in a[,j] and b[,j]w[2,] sum of unexplained squares
DETAILS:This is an S-Plus/R interface to the function LSQR inFORTRAN. Refer to the source code of this function formore detail.
REFERENCES:Lawson, C. L. and Hanson, Richard J. (1974, 1995)."Solving Least Squares Problems", Prentice-Hall.
Dongarra, J. J., Bunch, J.R., Moler, C.B. and Stewart,G.W. (1979). "LINPACK Users’ Guide." Philadelphia, PA:SIAM Publications.
Anderson, E., Bai, Z., Bischof, C., Blackford, S., Demmel,J., Dongarra, J., Du Croz, J., Greenbaum, A., Hammarling,S., McKenney, A., Sorensen, D. (1999). "LAPACK Users’Guide." Third edition. Philadelphia, PA: SIAMPublications.
EXAMPLES:a <- matrix(rnorm(500),ncol=10)b <- 1:50 / 50lsqr(a,b)
mixing, bmixing:
Minimum Distance Estimator of an Arbitrary Mixing Distribution.
DESCRIPTION:The minimum distance estimation of an arbitrary mixingdistribution based on a sample from the mixturedistribution, given the component distribution.
USAGE:mixing(data, ne=0, t=3, dist=c("chisq", "norm"),
method=c("nnm", "pm", "choi", "cramer"))bmixing(data, ne = 0, t = 3, dist = c("chisq", "norm"),
method = c("nnm", "pm", "choi", "cramer"),minb = 0.5)
REQUIREDARGUMENTS:data: A sample from the mixture distribution.
158
OPTIONAL ARGUMENTS:ne: number of extra points with small values; only used for
Chi-square distribution.t: threshold for support points; only used for Chi-square
distribution.dist: type of distribution; only implemented for non-
central Chi-square ("chisq") with one degree of freedomand normal distribution with variance 1 ("norm").
method: one of the four methods:"nnm": based on nonnegative measures (Wang, 2000)"pm": based on probability measures (Wang, 2000)"choi": based on CDFs (Choi and Bulren, 1968)"cramer": Cramer-von Mises statistic,
also CDF-based (MacDonald, 1971).
The default is "nnm", which, unlike the other three, doesnot have the minority cluster problem.
minb: Only for Chi-squared distribution and method "nnm". Noright endpoint is allowed between (0, minb); delete suchintervals if any.
VALUE:the estimated mixing distribution as an object of class"disctfun".
DETAILS:The seperability of data points are tested by the functionmixing() using the function separable(). Each block ishandled by the function bmixing().
REFERENCES:Choi, K., and Bulgren, W. B. (1968). An estimationprocedure for mixtures of distributions. J. R. Statist.Soc. B, 30, 444-460.
MacDonald, P. D. M. (1971). Comment on a paper by Choi andBulgren. J. R. Statist. Soc. B, 33, 326-329.
Wang, Y. (2000). A new approach to fitting linear modelsin high dimensional spaces. PhD thesis, Department ofComputer Science, University of Waikato, New Zealand.
SEE ALSO:pace, disctfun, separable, spoints, fintervals.
EXAMPLES:mixing(1:10) # mixing distribution of nchiqbmixing(1:10)mixing(c(5:15, 50:60)) # two main blocksbmixing(c(5:15, 50:60))mixing(c(5:15, 50:60), ne=20) # ne affects the estimationbmixing(c(5:15, 50:60), ne=20)
159
nnls, nnlse:
Problems NNLS and NNLSE
DESCRIPTION:Problem NNLS (nonnegative least squares):
Minimize || A x - b ||subject to x >= 0
Problem NNLSE (nonnegative least squares with equalityconstraints):
Minimize || A x - b ||subject to E x = fand x >= 0
USAGE:nnls(a,b)nnlse(a, b, e, f)
REQUIREDARGUMENTS:a: matrix Ab: vector be: matrix Ef: vector f
VALUE:x: the solution vectorrnorm: the Euclidean norm of the residual vector.index: defines the sets P and Z as follows:
P: index[1:nsetp] INDEX(1)Z: index[(nsetp+1):n]
mode: the success-failure flag with the following meaning:1 The solution has been computed successfully.2 The dimensions of the problem is bad,
either m <= 0 or n <= 03 Iteration count exceeded.
More than 3*n itereations.
DETAILS:This is an S-Plus/R interface to the algorithm NNLSproposed by Lawson and Hanson (1974, 1995), and to thealgorithm NNLSE by Haskell and Hanson (1981) and Hansonand Haskell (1982).
Problem NNLSE converges to Problem NNLS by consideringMinimize || / A\ / b\ ||
|| | | x - | | |||| \e E/ \e f/ ||
subject to x >= 0as e --> 0+.
Here the implemented nnlse only considers the situationthat xj >= 0 for all j, while the original algorithmallows restriction on some xj only.
160
The two functions nnls and nnlse are tested based on thesource code of NNLS (in FORTRAN 77), which is publicdomain and obtained from http://www.netlib.org/lawson-hanson/all.
REFERENCES:Lawson, C. L. and Hanson, R. J. (1974, 1995). "SolvingLeast Squares Problems", Prentice-Hall.
Haskell, K. H. and Hanson, R. J. (1981). An algorithm forlinear least squares problems with equality andnonnegativity constraints. Math. Prog. 21 (1981), pp.98-118.
Hanson, R. J. and Haskell, K. H. (1982). Algorithm 587.Two algorithms for the linearly constrained least squaresproblem. ACM Trans. on Math. Software, Sept. 1982.
EXAMPLES:a <- matrix(rnorm(500),ncol=10)b <- 1:50 / 50nnls(a,b)e <- rep(1,10)f <- 1nnlse(a,b,e,f)
nnlse:
Seennls.
pace:
Pace Estimator of a Linear Regression Model
DESCRIPTION:Returns an object of class "pace" that represents a fit ofa linear model.
USAGE:pace(x, y, va, yname=NULL, ycol=0, pvt=NULL, kp=0,
intercept=T, ne=0, method=<<see below>>, tau=2,tol=1e-6)
REQUIREDARGUMENTS:x: regression matrix. It is a matrix, or a data.frame, or
anything else that can be coerced into a matrix throughfunction as.matrix().
161
OPTIONAL ARGUMENTS:y: response vector. If y is missing, the last column in x or
the column specified by ycol is the response vector.va: variance of noise component.yname: name of response vector.ycol: column number in x for response vector, if y not
provided.pvt: pivoting vector which stores the column numbers that will
be considered in the estimation. When NULL (default), allcolumns are taken into account.
kp: the ordering of the first kp (=0, by default) columnsspecified in pvt are pre-chosen. The estimation will notchange their ordering, but they are suject to rank-deficiency examination.
intercept: if intercept should be added into the regressionmatrix.
ne: number of extra variables which are pre-excluded beforecalling this function due to small effects (i.e., A’s).Only used for pace methods and cic.
method: one of the methods "pace6" (default), "pace2", "pace4","olsc", "ols", "full", "null", "aic", "bic", "ric", "cic".See DETAILS and REFERENCESbelow.
tau: threshold value for the method "olsc(tau)".tol: tolerance threshold for collinearity. = 1e-6, by default.
VALUE:An object of class "pace" is returned. See pace.object fordetails.
DETAILS:For methods pace2, pace4, pace6 and olsc, refer to Wang(2000) and Wang et al. (2000). Further, n-aymptotically,aic = olsc(2), bic = olsc(log(n)), and ric = olsc(2log(k)), where n is the number of observations and k isthe number of free parameters (or candidate variables).
Method aic is prospoed by Akaike (1973).
Method bic is by Schwarz (1978).
For method ric, refer to Donoho and Johnstone (1994) andFoster and George (1994)
Method cic is due to Tibshirani and Knight (1997).
Method ols is the ordinary least squares estimationincluding all parameters after eliminating (psuedo-)collinearity, while method full is the ols withouteliminating collinearity. Method null returns the mean ofthe response, included for reason of completeness.
REFERENCES:Akaike, H. (1973). Information theory and an extension ofthe maximum likelihood principle. Proc. 2nd Int. Symp.Inform. Theory, suppl. Problems of Control andInformation theory, pp267-281.
162
Donoho, D. L. and Johnstone, I. M. (1994). Ideal SpatialAdaptation via Wavelet Shrinkage. Biometrika 81, 425-55.
Foster, D. and George, E. (1994). The risk inflationcriterion for multiple regression. Annals of Statistics,22, 1947-1975.
Schwarz, G. (1978). Estimating the dimension of a model.Annals of Statistics. Vol. 6, No. 2, March 1978. 461-464
Tibshirani, R. and Knight, K. (1997). The covarianceinflation criterion for model selection . Technicalreport, November, 1997, Department of Statistics,University of Stanford.
Wang, Y. (2000). A new approach to fitting linear modelsin high dimensional spaces. PhD thesis, Department ofComputer Science, University of Waikato, New Zealand.
Wang, Y., Witten, I. H. and Scott, A. (2000). Paceregression. (Submitted.)
SEE ALSO:pace.object, predict.pace
EXAMPLES:x <- matrix(rnorm(500), ncol=10) # completely randompace(x) # equivalently, pace(x,method="pace6")for(i in c("ols","null","aic","bic","ric","c ic"," pace2 ",
"pace4","pace6")) {cat("---- METHOD", i,"--- \n\n");print(pace(x, method=i))}
y <- x %*% c(rep(0,5), rep(.5,5)) + rnorm(50)# mixed effect response
for(i in c("ols","null","aic","bic","ric","c ic"," pace2 ","pace4","pace6")) {cat("---- METHOD", i,"--- \n\n");
print(pace(x, y, method=i))}pace(x,y,method="olsc",tau=5)pace(x,y, ne=20) # influence of extra variables
pace.object:
Pace Regression Model Object
DESCRIPTION:These are objects of class "pace". They represent the fitof a linear regression model by pace estimator.
GENERATION:This class of objects is returned from the function paceto represent a fitted linear model.
METHODS:
163
The "pace" class of objects (at the moment) has methodsfor these generic functions: print, predict.
STRUCTURE:coef: fitted coefficients of the linear model.pvt: pivoting columns in the training set.ycol: the column in the training set considered as the
response vector.intercept: intercept is used or not.va: given or estimated noise variance (i.e., the unbiased OLS
estimate).ks: the final model dimensionality.kc: psuedo-rank of the design matrix (determined due to the
value of tol).ndims: total number of dimensions.nobs: number of observations.ne: nubmer of extra variables with small effects.call: function call.A: observed absolute distances of the OLS full model.At: updated absolute distances.mixing: estimated mixing distribution from observed absolute
distances.
SEE ALSO:pace, predict.pace, mixing
predict.pace:
Predicts New Obsevations Using a Pace Linear Model Estimate.
DESCRIPTION:Returns the predicted values for the response variable.
USAGE:predict.pace(p, x, y)
REQUIREDARGUMENTS:p: an object of class "pace".x: test set (may or may not contain the response vector.)
OPTIONAL ARGUMENTS:y: response vector (if not stored in x).
VALUE:pred: predicted response vector.resid: residual vector (if applicable).
SEE ALSO:pace, pace.object.
EXAMPLES:x <- matrix(rnorm(500), ncol=10)p <- pace(x)
164
xtest <- matrix(rnorm(500), ncol=10)predict(p, xtest)
rsolve:
Solving Upper-Triangular Equation
DESCRIPTION:Solve the upper-triangular equation
X * b = y
where X is an upper-triagular matrix (could be rank-deficient). Here the columns of X may be pre-indexed.
USAGE:rsolve(x,y,pvt,ks)
REQUIREDARGUMENTS:x: matrix Xy: vector y
OPTIONAL ARGUMENTS:pvt: column pivoting vector, only columns specified in pvt are
used in solving the equation. If not provided, all columnsare considered.
ks: the first ks columns specified in pvt are QR-transformedin the same order.
VALUE:the solution vector b
DETAILS:This is an S-Plus/R interface to the function RSOLVE inFORTRAN. Refer to the source code of this function formore detail.
REFERENCES:Dongarra, J. J., Bunch, J.R., Moler, C.B. and Stewart,G.W. (1979). "LINPACK Users’ Guide." Philadelphia, PA:SIAM Publications.
Press, W.H., Flanney, B.P., Teukkolky S.A., Vatterling,U.T. (1994). "Numerical Recipes in C." CambridgeUniversity Press.
Anderson, E., Bai, Z., Bischof, C., Blackford, S., Demmel,J., Dongarra, J., Du Croz, J., Greenbaum, A., Hammarling,S., McKenney, A., Sorensen, D. (1999). "LAPACK Users’Guide." Third edition. Philadelphia, PA: SIAMPublications.
EXAMPLES:a <- matrix(rnorm(500),ncol=10)
165
b <- 1:50 / 50fit <- lsqr(a,b)rsolve(fit$a,fit$b,fit$pvt)
separable:
Seperability of a Point from a Set of Points.
DESCRIPTION:Test if a point could be separable from a set of datapoints, based on probability, for the purpose ofestimating mixing distribution.
USAGE:separable(data, x, ne=0, thresh=0.1,
dist=c("chisq", "norm"))
REQUIREDARGUMENTS:data: incrementally sorted data points.x: a single point, satisfying that either x <= min(data) or x
>= max(data).
OPTIONAL ARGUMENTS:ne: number of extra data points of small values (= 0, by
default). Only used from chisq distribution.thresh: a threshold value based on probability (= 0.1, by
default). Larger value implies easier seperability.dist: type of distribution; only implemented for non-
central Chi-square ("chisq") with one degree of freedomand normal distribution with variance 1 ("norm").
VALUE:T or F, implying whether x is separable from data.
DETAILS:Used by the estimation of a mixing distribution so thatthe whole set of data points could be seperated intoblocks to speed up estimation.
REFERENCES:Wang, Y. (2000). A new approach to fitting linear modelsin high dimensional spaces. PhD thesis, Department ofComputer Science, University of Waikato, New Zealand.
SEE ALSO:mixing, pace
EXAMPLES:separable(5:10, 25) # Returns Tseparable(5:10, 25, ne=20) # Returns F, for extra small
# values not included in data.
166
spoints:
Support Points
DESCRIPTION:Returns the set of candidate support points for theminimum distance estimation of an arbitrary mixingdistribution.
USAGE:spoints(data, ne=0, dist=c("chisq", "norm"), ns=100, t=3)
REQUIREDARGUMENTS:data: A sample of the mixture distribution.
OPTIONAL ARGUMENTS:ne: number of extra points with small values. Only used for
Chi-sqaured distribution.dist: type of distribution; only implemented for non-
central Chi-square ("chisq") with one degree of freedomand normal distribution with variance 1 ("norm").
ns: Maximum number of support points. Only for normaldistribution.
t: No support points inside (0, t).
VALUE:Support points (stored as a vector).
SEE ALSO:mixing, fintervals.
EXAMPLES:spoints(1:10)
167
A.2 Sourcecode
disctfun.q:
# ------------------------------------- ----- ------ ----- ----- --## Functions for manipulating discrete functions## ------------------------------------- ----- ------ ----- ----- --
disctfun <- function(data=NULL, values=NULL, classname,normalize = F, sorting = F)
{if( missing(data) ) n <- 0else n <- length(data)if( n == 0) { # empty disctfun
if(is.null( version$language ) ) { # S languaged <- NULLclass(d) <- "disctfun"return (d)
}else { # R language
d <- matrix(nrow=0,ncol=2,dimnames =list(NULL,c("Points", "Values")))
class(d) <- "disctfun"return (d)
}}else {
if( missing(values) ) values <- rep(1/n,n)else if (length(values) != n)
stop("Lengths of data and values must match.")else if( normalize ) values <- values / sum(values)d <- array(c(data, values),dim=c(n,2),dimnames =
list(paste("[",1:n,",]",sep=""),c("Points", "Values")))
class(d) <- "disctfun"}if( ! missing(classname) )
class(d) <- unique( c(class(d), classname) )if( missing(values) || normalize )
class(d) <- unique( c(class(d),"normed") )if( is.sorted(data) )
class(d) <- unique( c(class(d), "sorted") )if( sorting && !is.sorted(d) ) sort.disctfun(d)else d
}
sort.disctfun <- function(d){
if (length(d) == 0) return (d)if( is.sorted(d) ) return (d)n <- dim(d)[1]if( n != 1) {
168
i <- sort.list(d[,1])cl <- class(d)d <- array(d[i,], dim=c(n,2),dimnames =
list(paste("[",1:n,",]",sep=""),c("Points", "Values")))
class(d) <- cl}class(d) <- unique( c(class(d),"sorted") )d
}
is.disctfun <- function(d){
if( is.na( match("disctfun", class(d) ) ) ) Felse T
}
normalize.disctfun <- function(d){
d[,2] <- d[,2] / sum(d[,2])class(d) <- unique( c(class(d),"normed") )d
}
print.disctfun <- function(d){
if( length(d) == 0 ) cat("disctfun()\n")else if(dim(d)[1] == 0) cat("disctfun()\n")else print(unclass(d))invisible(d)
}
unique.disctfun <- function(d)# On input, d is sorted.{
count <- 1if(dim(d)[1] >= 2) {
for ( i in 2:dim(d)[1] )if( d[count,1] != d[i,1] ) {
count <- count + 1d[count,] <- d[i,]
}else d[count,2] <- d[count,2] + d[i,2]
disctfun(d[1:count,1], d[1:count,2],"sorted")}else {
class(d) <- c(class(d),"sorted")d
}}
"+.disctfun" <- function(d1,d2){
if(! is.disctfun(d1) | ! is.disctfun(d2) )stop("the argument(s) not disctfun.")
169
if( length(d1) == 0) d2else if (length(d2) == 0) d1else {
d <- disctfun( c(d1[,1],d2[,1]), c(d1[,2],d2[,2]), sort = T )# In R, unique is not a generic functionunique.disctfun(d)
}}
"*.disctfun" <- function(x,d){
if( is.disctfun(d) ) {if( length(x) == 1 )
{ if( length(d) != 0 ) d[,2] <- x * d[,2] }else stop("Inappropriate type of argument.")# this seems necessary for Rclass(d) <- unique(c("disctfun", class(d)))d
}else if( is.disctfun(x) ) {
if( length(d) == 1 ){ if( length(x) != 0 ) x[,2] <- d * x[,2] }
else stop("Inappropriate type of argument.")# this seems necessary for Rclass(x) <- unique(c("disctfun", class(x)))x
}}
plot.disctfun <- function(d,xlab="x", ylab="y",lty=1,...){
if( length(d) != 0)plot(d[,1],d[,2],xlab=xlab, ylab=ylab,type="h",...)
else cat("NULL (disctfun); nothing ploted.\n")}
ls.q:
# ------------------------------------- ----- ------ ----- ----- --## Functions for solving linear system problems.## ------------------------------------- ----- ------ ----- ----- --
nnls <- function(a,b){
m <- as.integer(dim(a)[1])n <- as.integer(dim(a)[2])storage.mode(a) <- "double"storage.mode(b) <- "double"x <- as.double(rep(0, n)) # only for outputrnorm <- 0storage.mode(rnorm) <- "double" # only for outputw <- x # n-vector of working space
170
zz <- b # m-vector of working spaceindex <- as.integer(rep(0,n))mode <- as.integer(0) # = 1, success.Fortran("nnls",a,m,m,n,b,x=x,rno rm=rn orm,w ,zz,in dex=i ndex,
mode=mode)[c("x","rnorm","index"," mode") ]}
nnls.disctfun <- function(s, sp){
index <- s$index[s$x[s$index]!=0]disctfun(sp[index],s$x[index], normalize=T, sorting=T)
}
nnlse <- function(a, b, e, f){
eps <- 1e-3; # eps is adjustable.
nnls(rbind(e, a * eps), c(f, b * eps))}
lsqr <- function(a,b,pvt,ks=0,tol=1e-6,type=1 ){
if ( ! is.matrix(b) ) dim(b) <- c(length(b),1)if ( ! is.matrix(a) ) dim(a) <- c(1,length(a))m <- as.integer(dim(a)[1])n <- as.integer(dim(a)[2])if ( m != dim(b)[1] )
stop("dim(a)[1] != dim(b)[1] in lsqr()")nb <- as.integer(dim(b)[2])if(missing(pvt) || is.null(pvt)) pvt <- 1:nelse if( ! correct.pvt(pvt,n) ) stop("mis-specified pvt.")kp <- length(pvt)kn <- 0storage.mode(kn) <- "integer"storage.mode(kp) <- "integer"storage.mode(a) <- "double"storage.mode(b) <- "double"storage.mode(pvt) <- "integer"storage.mode(ks) <- "integer"storage.mode(tol) <- "double"storage.mode(type) <- "integer"if(ks < 0 || ks > kp) stop("ks out of range in lsqr()")w <- matrix(0, nrow = 2, ncol = n+nb) # working spacestorage.mode(w) <- "double".Fortran("lsqr", a=a,m,m,n,b=b,m,nb,pvt=pvt,kp,ks=ks,w =w,to l,
type)[c("a","b","pvt","ks","w")]}
correct.pvt <- function(pvt,n) {! ( max(pvt) > n || # out of range
min(pvt) < 1 || # out of rangeany(duplicated(pvt)) # has duplicated)
}
rsolve <- function(x,y,pvt,ks)
171
{if ( ! is.matrix(y) ) dim(y) <- c(length(y),1)if ( ! is.matrix(x) ) dim(x) <- c(1,length(x))m <- as.integer(dim(x)[1])n <- as.integer(dim(x)[2])mn <- min(m,n)ny <- as.integer(dim(y)[2])if(missing(pvt)) {
pvt <- 1:mn}if ( missing(ks) ) ks <- length(pvt)if(ks < 0 || ks > mn) stop("ks out of range in lsqr()")storage.mode(x) <- "double"storage.mode(y) <- "double"storage.mode(pvt) <- "integer"storage.mode(ks) <- "integer".Fortran("rsolve", x,m,y=y,m,ny,pvt,ks)$y[1:ks,]
}
mixing.q:
# ------------------------------------- ----- ------ ----- ----- --## Functions for the estimation of a mixing distribution## ------------------------------------- ----- ------ ----- ----- --
spoints <- function(data, ne =0, dist = c("chisq", "norm"),ns = 100, t = 3)
{if( ! is.sorted(data) ) data <- sort(data)
dist <- match.arg(dist)n <- length(data)if(n < 2) stop("tiny size data.")
if(dist == "chisq" && (data[1] < t || ne != 0) ) {sp <- 0ns <- ns -1if(data[n] == t) ns <- 1
}else sp <- NULLswitch (dist,
"norm" = data, #seq(data[1], data[n], length=ns),c(sp, data[data >= t]) )
}
fintervals <- function( fp, rb = 0, cn = 3,dist = c("chisq", "norm"),method = c("nnm", "pm", "choi", "cramer"),minb = 0.5)
{dist <- match.arg(dist)method <- match.arg(method)
172
switch( method,"pm" = ,"nnm" = switch( dist,
"chisq" = {f <- cbind(c(fp,invlfun(fp[(i<-fp>=minb )])),c(lfun(fp),fp[i]) )
if( rb >= 0 ) {f[f[,1] > 0 & f[,1] < rb, 1] <- 0f[f[,2] > 0 & f[,2] < rb, 2] <- rb
}f
},"norm" = cbind( c(fp,fp-cn), c(fp+cn,fp) )),
"choi" = ,"cramer" = switch( dist,
"chisq" = cbind(pmax(min(fp) - 50, 0), fp),"norm" = cbind(min(fp) - 50, fp) )
)}
lfun <- function( x, c1 = 5, c2 = 20 )# --------------------------------- ----- ----- ------ ----- ----- -# function l(x)## Returns the right boudary points of fitting intervals# given the left ones# --------------------------------- ----- ----- ------ ----- ----- -{
if(length(x) == 0) NULLelse {
x[x<0] <- 0c1 + x + c2 * sqrt(x)
}}
invlfun <- function( x, c1 = 5, c2 = 20 )# --------------------------------- ----- ----- ------ ----- ----- -# inverse l-function# --------------------------------- ----- ----- ------ ----- ----- -{
if(length(x) == 0) NULLelse {
x[x<=c1] <- c1(sqrt( x - c1 + c2 * c2 / 4 ) - c2 / 2 )ˆ2
}}
epm <- function(data, intervals, ne = 0, rb = 0, bw = 0.5)# --------------------------------- ----- ----- ------ ----- ----- -# empirical probability measure over intervals# --------------------------------- ----- ----- ------ ----- ----- -{
n <- length(data)pm <- rep(0, nrow(intervals))for ( i in 1:n )
pm <- pm + (data[i]>intervals[,1] & data[i]<intervals[,2]) +( (data[i] == intervals[,1]) + (data[i] ==
173
intervals[,2])) * bwi <- intervals[,1] < rb & intervals[,2] >= rbpm[i] <- pm[i] + nepm / (n + ne)
}
probmatrix <- function( data, bins, dist = c("chisq", "norm") )# ------------------------------------- ----- ------ ----- ----- --# Probability matrix# ------------------------------------- ----- ------ ----- ----- --{
dist <- match.arg(dist)n<-length(data)e <- matrix(nrow=nrow(bins),ncol=n)for (j in 1:n)
e[,j] <- switch(dist,"chisq" = pchisq1(bins[,2], data[j]) -pchisq1(bins[,1], data[j]),"norm" = pnorm(bins[,2], data[j]) -pnorm(bins[,1], data[j]) )
e}
bmixing <- function(data, ne = 0, t = 3, dist = c("chisq","norm"),method = c("nnm","pm","choi","cramer"),minb=0.5)
{dist <- match.arg(dist)method <- match.arg(method)
if( !is.sorted(data) ) {data <- sort(data)class(data) <- unique( c(class(data),"sorted") )
}
n <- length(data)
if( n == 0 ) return (disctfun())if( n == 1 ) {
if( dist == "chisq" && data < t ) return ( disctfun(0) )else return ( disctfun(max( data )) )
}
# support pointssp <- spoints(data, ne = ne, t = t, dist = dist)rb <- 0# fitting intervalsif(ne != 0) {
rb <- max(minb,data[min(3,n)])fi <- fintervals(data, rb = rb, dist = dist, method = method,
minb=minb)}else fi <- fintervals(data, dist = dist, method = method,
minb=minb)
bw <- switch(method,
174
"cramer" =,"nnm" =,"pm" = 0.5,"choi" = 1 )
# empirical probability measureb <- epm(data, fi, ne=ne, rb=rb, bw=bw)a <- probmatrix(sp, fi, dist = dist)ns <- length(sp)nnls.disctfun(switch( method,
"nnm" = nnls(a,b),nnlse(a,b,matrix(1,nrow=1,ncol=ns),1) )
, sp)}
mixing <- function(data, ne = 0, t = 3, dist = c("chisq","norm"),method = c("nnm","pm","choi","cramer"))
{dist <- match.arg(dist)method <- match.arg(method)
n <- length(data)if( n <= 1 )
return ( bmixing(data,ne,t=t,dist=dist,method=m ethod ) )
if( ! is.sorted(data) ) data <- sort(data)d <- disctfun()start <- 1for( i in 1:(n-1) ) {
if( separable(data[start:i], data[i+1],ne, dist=dist) &&separable(data[(i+1):n], data[i], dist=dist) ) {
x <- data[start:i]class(x) <- "sorted"d <- d + ( i - start + 1 + ne) *
bmixing(x,ne,t=t,dist=dist,method=met hod)start <- i+1ne <- 0
}}x <- data[start:n]class(x) <- "sorted"normalize( d + ( n - start + 1 + ne) *
bmixing(x,ne,t=t,dist=dist,method =metho d) )}
separable <- function(data, x, ne = 0, thresh = 0.10,dist = c("chisq","norm") )
{if (length(x) != 1) stop("length(x) must be one.")
dist <- match.arg(dist)n <- length(data)mi <- data[1]
if (dist == "norm")if( sum( 1 - pnorm( abs( x - data) ) ) <= thresh ) Telse F
175
else {rx <- sqrt(x)p1 <- pnorm( abs(rx - sqrt(data)) )p2 <- pnorm( abs(rx - seq(0,sqrt(mi),len=3)) )if( sum( 1 - p1, ne/3 * (1 - p2) ) <= thresh ) Telse F
}}
pmixture <- function(x, d, dist="chisq"){
n <- length(x)p <- rep(0, len=n)for( i in 1:length(d[,1]) )
p <- p + pchisq1(x, d[i,1]) * d[i,2]p
}
plotmix <- function(d, xlim, ylim=c(0,1), dist="chisq",...){
if( missing(xlim) ) {mi <- min(d[,1])ma <- max(d[,1])low <- max(0, sqrt(mi)-3)ˆ2high <- (sqrt(ma) + 3)ˆ2xlim <- c(low,high)
}x <- seq(xlim[1], xlim[2], len=100)y <- pmixture(x, d, dist=dist)plot(x,y,type="l", xlim=xlim, ylim=ylim, ...)
}
pace.q:
# ------------------------------------- ----- ------ ----- ----- --## Functions for pace regression## ------------------------------------- ----- ------ ----- ----- --
contrib <- function(a, as){
asˆ2 - (as - a)ˆ2}
bhfun <- function(A, As) {hrfun(sqrt(A),sqrt(As))
}
bhrfun <- function(a, as) # hrfun without being divided by (2*a){
contrib(a,as) * dnorm(a,as) + contrib(-a,as) * dnorm(-a,as)}
176
hfun <- function(A, As) {hrfun(sqrt(A),sqrt(As))
}
hrfun <- function(a, as) {(contrib(a,as)*dnorm(a,as)+contri b(-a, as)*d norm(- a,as) )/(2* a)
}
hmixfun <- function(A, d){
n <- length(A)h <- rep(0,n)rc <- sqrt(d[,1])for( i in 1:n )
if( A[i] != 0 ) h[i] <- h[i] + hrfun(sqrt(A[i]),rc) %*% d[,2]h
}
bffun <- function(A, As){
bfrfun(sqrt(A), sqrt(As))}
bfrfun <- function(a, as)# ffun() without being divided by (2*a){
dnorm(a, as) + dnorm(-a, as)}
ffun <- function(A, As){
frfun(sqrt(A), sqrt(As))}
frfun <- function(a, as){
( dnorm(a, as) + dnorm(-a, as) ) / (2*a)}
fmixfun <- function(A, d){
n <- length(A)f <- rep(0,n)rc <- sqrt(d[,1])for( i in 1:n )
f[i] <- f[i] + frfun(sqrt(A[i]),rc) %*% d[,2]f
}
pace2 <- function(A, d){
k <- length(A)con <- cumsum( hmixfun(A, d) / fmixfun(A, d) )ma <- max(0,con)ima <- match(ma, con, nomatch=0) # number of positive a’sif(ima < k) A[(ima+1):k] <- 0 # if not the last one
177
A}
pace4 <- function(A, d){
con <- hmixfun(A, d) / fmixfun(A, d)A[con <= 0] <- 0A
}
pace6 <- function(A, d, t = 0.5){
for( i in 1:length(A) ) {de <- sum( bffun(A[i], d[,1]) * d[,2] )if(de > 0)
A[i] <- (sum(sqrt(d[,1]) * bffun(A[i], d[,1])*d[,2])/de)ˆ2}A[A<=t] <- 0A
}
pace <- function(x,y,va,yname=NULL,ycol=0,pv t=NULL ,kp=0 ,intercept=T,ne=0,method=c("pace6","pace2","pace4","olsc ","ol s",
"ful ","null","aic","bic","ric","cic"),tau=2,tol=1e-6)
{method <- match.arg(method)if ( is.data.frame(x) ) x <- as.matrix(x)ncol <- dim(x)[2]if(is.null(dimnames(x)[[2]]))
dimnames(x) <- list(dimnames(x)[[1]],paste("X",1:nc ol,se p=""))namelist <- dimnames(x)[[2]]if ( missing(y) ) {
if(ycol == 0 || ycol == "last") ycol <- ncoly <- x[,ycol]if( is.null(yname) ) yname <- dimnames(x)[[2]][ycol]x <- x[,-ycol]if ( ! is.matrix(x) ) x <- as.matrix(x)dimnames(x) <- list(NULL,namelist[-ycol])
}if ( ! is.matrix(y) ) y <- as.matrix(y)if (intercept) {
x <- cbind(1,x)ki <- 1kp <- kp + 1if(! is.null(pvt)) pvt <- c(1,pvt+1)dimnames(x)[[2]][1] <- "(Intercept)"
}else ki <- 0m <- as.integer(dim(x)[1])n <- as.integer(dim(x)[2])mn <- min(m,n)if ( m != dim(y)[1] )
stop("dim(x)[1] != dim(y)[1]")# first kp columns in pvt[] will maintain the same order
178
kp <- max(ki,kp)# QR transformationif(method == "full" ) ans <- lsqr(x,y,pvt,tol=0,ks=kp)else ans <- lsqr(x,y,pvt,ks=kp,tol=tol)kc <- ks <- ans$ks # kc is psuedo-rank (unchangable),
# ks is used as the model dimensionalityb2 <- ans$bˆ2if(missing(va)) {
if ( m - ks <= 0 ) va <- 0 # data wholly explainedelse va <- sum(b2[(ks+1):m])/(m-ks) # OLS variance estimator
}At <- A <- d <- NULLmi <- min(b2[ki:ks])if(va > max(1e-10, mi * 1e-10) ) { # Is va is tiny?
A <- b2[1:ks] / vanames(A) <- dimnames(x)[[2]][ans$pvt[1:ks]]
}ne <- ne + (mn - ks)# Check (asymptotically) olsc-equivalent methods.switch( method,
"full" = , # full is slightly different from# ols in that collinearity not# excluded
"ols" = { # ols = olsc(0)method <- "olsc"tau <- 0
},"aic" = { # aic = olsc(2)
method <- "olsc"tau <- 2
},"bic" = { # bic = olsc(log(n))
method <- "olsc"tau <- log(m)
},"ric" = { # ric = olsc(2log(k))
method <- "olsc"tau <- 2 * log(ks)
},"null" = { # null = olsc(+inf)
method <- "olsc"tau <- 2*max(b2) + 1e20ks <- kp
})
switch( method, # default method is pace2,4,6"olsc" = { # olsc-equivalent methods
if( ! is.null(A) ) {At <- Anames(At) <- names(A)if(kp < ks)
ksp <- match(max(0,cum<-cumsum(At[(kp+1): ks]-t au)),cum, nomatch=0)
else ksp <- 0if(kp+ksp+1 <= ks) At[(kp+ksp+1):ks] <- 0ks <- kp + ksp
179
ans$b[1:ks] <- sign(ans$b[1:ks])*sqrt(At[1:ks]*va)}
},"cic" = { # cic method
if( ! is.null(A) ) {At <- Anames(At) <- names(A)kl <- ks - kpif(kp < ks)
ksp <- match(max(0,cum <-cumsum(At[(kp+1):ks] -
4*log((ne+kl)/1:kl))),cum, nomatch=0)
else ksp <- 0if(kp+ksp+1 <= ks) At[(kp+ksp+1):ks] <- 0ks <- kp + kspans$b[1:ks] <- sign(ans$b[1:ks]) *
sqrt(At[1:ks] * va)}
},if( ! is.null(A) && ks > 1) { # runs when va is not tiny
At <- Anames(At) <- names(A)d <- mixing(A[2:ks],ne=ne) # mixing distributionAt[2:ks] <- get(method)(A[2:ks],d)# get rid of zero coefsks <- ks - match(F, rev(At[2:ks] < 0.001),
nomatch = ks) + 1ans$b[1:ks] <- sign(ans$b[1:ks]) * sqrt(At[1:ks] * va)
}else { # delete variables that explain little data
ave <- mean(b2[1:ks])ks <- ks - match(T, rev( b2[1:ks] > 1e-5 * ave),
nomatch=0) + 1})
coef <- rsolve(ans$a[1:ks,],ans$b[1:ks,],a ns$pvt [1:ks ],ks)i <- sort.list(ans$pvt[1:ks])pvt <- ans$pvt[i]coef <- coef[i]names(coef) <- dimnames(x)[[2]][pvt[1:ks]]if( ! is.null(yname) ) {
coef <- as.matrix(coef)dimnames(coef)[[2]] <- yname
}fit <- list(coef = coef, pvt = pvt, ycol = ycol, intercept =
intercept, va = va, ks = ks, kc = kc, ndims = n,nobs = m, ne = ne, call = match.call() )
if( !is.null(A)) fit$A <- Aif( !is.null(At)) fit$At <- Atif( !is.null(d)) fit$mixing <- unclass(d)class(fit) <- "pace"fit
}
print.pace <- function (p) {
180
cat("Call:\n")print(p$call)cat("\n")cat("Coefficients:\n")print(p$coef)cat("\n")cat("Number of observations:",p$nobs,"\n")cat("Number of dimensions: ",length(p$coef)," (out of ", p$kc,
", plus ", p$ndims - p$kc, " abandoned for collinearity)\n",sep="")
}
predict.pace <- function(p,x,y){
if( ! is.matrix(x) ) x <- as.matrix(x)n <- dim(x)[1]k <- length(p$pvt)if( p$intercept )
if(k == 1) pred <- rep(p$coef[1],n)else {
pred <- p$coef[1] + x[,p$pvt[2:k]-1,drop=FALSE] %*%p$coef[2:k]
}else pred <- x[,p$pvt] %*% p$coeffit <- list(pred = pred)if(! missing(y)) fit$resid <- y - predelse if(p$ycol >= 1 && p$ycol <= ncol(x))
fit$resid <- x[,p$ycol] - predif( length(fit$resid) != 0 ) names(fit$resid) <- NULLfit
}
util.q:
# --------------------------------- ----- ----- ------ ----- ----- -## Utility functions## --------------------------------- ----- ----- ------ ----- ----- -
is.sorted <- function(data){
if( is.na( match("sorted", class(data) ) ) ) Felse T
}
is.normed <- function(data){
if( is.na( match("normed", class(data) ) ) ) Felse T
}
normalize <- function(x,...) UseMethod("normalize")
181
rchisq1 <- function(n, ncp = 0)# rchisq() in S-Plus (up to version 5.1) does not work for# non-central Chi-square distribution.{
rnorm( n, mean = sqrt(ncp) )ˆ2}
pchisq1 <- function(x, ncp = 0)# The S+ language (both versions 3.4 and 5.1) fails to return 0# for, say, pchisq(0,1,1). The same function in R seems to work# fine. In the following, the probability value is set to zero# when the location is at zero.# Here we only consider the situation df = 1, using a new# function name pchisq1. Also, efficiency is taken into account,# within the application’s accuracy requirement.{
i <- x == 0 | sqrt(x) - sqrt(ncp) < -10j <- sqrt(x) - sqrt(ncp) > 10p <- rep(0,length(x))p[j] <- 1p[!i & !j] <- pchisq(x[!i & !j], 1, ncp)p
}
182
ls.f:
c ---------------------------------- ----- ------ ----- ----- -subroutine upw2(a,mda,j,k,m,s2)double precision a(mda,*), s2, e2integer j,k,m
e2 = a(k,j) * a(k,j)if( e2 > .1 * s2 ) then
s2 = sum2(a,mda,j,k+1,m)else
s2 = s2 - e2endifend
c ---------------------------------- ----- ------ ----- ----- -c Constructs single Householder orthogonal transformationc Input: a(mda,*) ; matrix Ac mda ; the first dimension of Ac j ; the j-th columnc k ; the k-th elementc s2 ; s2 = dˆ2c Output: a(k,j) ; = a(k,j) - dc d ; diagonal element of Rc q ; = - u’u/2, where u is the transformationc ; vector. Hence should not be zero.c ---------------------------------- ----- ------ ----- ----- -
subroutine hh1(a,mda,j,k,d,q,s2)integer mda,j,kdouble precision a(mda,*),s2,d,q
if (a(k,j) .ge. 0) thend = - dsqrt(s2)
elsed = dsqrt(s2)
endifa(k,j) = a(k,j) - dq = a(k,j) * dend
c ---------------------------------- ----- ------ ----- ----- -c Performs single Householder transformation on one columnc of a matrixcc Input: a(mda,)c m ; a(k..m,j) stores the transformationc ; vectur uc j ; the j-th columnc k ; the k-th elementc q ; q = -u’u/2; must be negativec b(mdb,) ; stores the to-be-transformed vectorc l ; b(k..m,l) to be transformedc Output: b(k..m,l) ; transformed part of bc ---------------------------------- ----- ------ ----- ----- -
subroutine hh2(a,mda,m,j,k,q,b,mdb,l)
183
integer mda,m,k,j,mdb,ldouble precision a(mda,*),q,b(mdb,*),s,alpha
s = 0.0do 10 i=k,m
s = s + a(i,j) * b(i,l)10 continue
alpha = s / qdo 20 i=k,m
b(i,l) = b(i,l) + alpha * a(i,j)20 continue
end
c --------------------------------- ----- ------ ----- ----- --c Constructs Givens orthogonal rotation matrixcc Algorithm refers to P58, Chapter 10 ofc C. L. Lawson and R. J. Hanson (1974).c "Solving Least Squares Problems". Prentice-Hall, Inc.c --------------------------------- ----- ------ ----- ----- --
subroutine gg1(v1, v2, c, s)double precision v1, v2, c, sdouble precision ma, r
ma = max(dabs(v1), dabs(v2))if (ma .eq. 0.0) then
c = 1.0s = 0.0
elser = ma * dsqrt((v1/ma)**2 + (v2/ma)**2)c = v1/rs = v2/r
endifend
c --------------------------------- ----- ------ ----- ----- --c Performs Givens orthogonal transformationcc Algorithm refering to P59, Chapter 10 ofc C. L. Lawson an R. J. Hanson (1974).c "Solving Least Squares Problems". Prentice-Hall, Inc.c --------------------------------- ----- ------ ----- ----- --
subroutine gg2(c, s, z1, z2)double precision c, s, z1, z2double precision w
w = c * z1 + s * z2z2 = -s * z1 + c * z2z1 = wend
c --------------------------------- ----- ------ ----- ----- --function sum(a,mda,j,l,h)integer mda,j,l,h,idouble precision a(mda,*)
184
sum = 0.0do 10 i=l,h
sum = sum + a(i,j)10 continue
end
c ---------------------------------- ----- ------ ----- ----- -function sum2(a,mda,j,l,h)integer mda,j,l,h,idouble precision a(mda,*)
sum2 = 0.0do 10 i=l,h
sum2 = sum2 + a(i,j) * a(i,j)10 continue
end
c ---------------------------------- ----- ------ ----- ----- -c stepwise least sqaures QR decomposition for equationc a x = bc i.e., a is QR-decomposed and b is QR-transformedc accordingly. Stepwise here means each calling of thisc function either adds or deletes a column vector, into orc from the vector pvt(0..ks-1).cc It is the caller’s responsibility that the added column isc nonsingularcc Input: a(mda,*) ; matrix ac m,n ; (m x n) elements are occupiedc ; it is likely mda = mc b(mdb,*) ; vector or matrix bc ; if is matrix, only the first columnc ; may help the transformationc nb ; (m x nb) elements are occupiedc pvt(kp) ; pivoting column indicesc kp ; kp elements in pvt() are consideredc ks ; on input, the firs ks columns indexedc ; in pvt() are already QR-transformed.c j ; the pvt(j)-th column of a will be addedc ; (if add = 1) or deleted (if add = -1)c w(2,*) ; w(1,pvt) saves the total sum of squaresc ; of the used column vectors.c ; w(2,pvt) saves the unexplained sum ofc ; squares.c add ; = 1, adds a new column and doesc ; QR transformation.c ; = -1, deletes the columnc Output: a(,) ; QR transformedc b(,) ; QR transformedc pvt() ;c ks ; ks = ks + 1, if add = 1c ; ks = ks - 1, if add = -1c w(2, *) ; changed accordinglyc ---------------------------------- ----- ------ ----- ----- -
subroutine steplsqr(a,mda,m,n,b,mdb,nb,pvt,kp ,ks,j ,w,ad d)
185
integer mda,m,n,mdb,nb,pvt(*),j,ks,kpdouble precision a(mda,*),b(mdb,*),w(2,*),q,d,c,sinteger pj,k,add,pk
if( add .eq. 1 ) thenc -- add pvt(j)-th column
ks = ks + 1pj = pvt(j)pvt(j) = pvt(ks)pvt(ks) = pj
call hh1(a,mda,pj,ks,d,q,w(2,pj))
do 20 k = ks+1,kppk = pvt(k)call hh2( a,mda,m,pj,ks,q,a,mda,pk )call upw2( a,mda,pk,ks,m,w(2,pk) )
20 continuedo 30 k = 1,nb
call hh2( a,mda,m,pj,ks,q,b,mdb,k )call upw2( b,mdb,k,ks,m,w(2,n+k) )
30 continuew(2, pj) = 0.0a(ks, pj) = ddo 40 k = ks+1, m
a(k,pj) = 0.040 continue50 continue
elsec -- delete pvt(j)-th column (swap with pvt(ks) andc -- ks decrements)
ks = ks - 1pj = pvt(j)do 110 i = j, ks
pvt(i) = pvt(i+1)110 continue
pvt(ks+1) = pjc -- givens rotation --
do 140 i=j, kscall gg1( a(i,pvt(i)), a(i+1,pvt(i)), c, s )do 120 l=i, kp
call gg2( c, s, a(i,pvt(l)), a(i+1,pvt(l)) )120 continue
do 130 l = 1, nbcall gg2( c, s, b(i,l), b(i+1,l))
130 continue140 continue
do 145 j = ks+1, kppj = pvt(j)w(2,pj) = w(2,pj) + a(ks+1,pj) * a(ks+1,pj)
145 continuedo 150 l = 1, nb
w(2,n+l) = w(2,n+l) + b(ks+1,l) * b(ks+1,l)150 continue
endifend
186
c ---------------------------------- ----- ------ ----- ----- -c The pvt(j)-th column is chosencc choose the column with the largest diagonal elementcc type = 1 VIF-based, otherwise based on diagonal elementcc Input: pvt ; column vectorc ks ; number ofc j ;c ---------------------------------- ----- ------ ----- ----- -
subroutine colj(pvt,ks,j,ln,w,tol,type)integer pvt(*),ks,j,l,ln,typedouble precision w(2,*),ma,now,tol
ma = tol * 0.1do 10 l=ks+1,ln
if (ks .eq. 0) type = 2now = colmax( w, pvt(l), type )if (now .le. ma) goto 10ma = nowj = l
10 continueif(ma .lt. tol) j = 0end
c ---------------------------------- ----- ------ ----- ----- -c returns the pivoting value, being subject to thec examination < tolc Input: w(2,*) ; w(1,j), the total sum of squaresc ; w(2,j), the unexplained sum ofc ; squaresc j ; the j-th column vectorc type ; = 1, VIF-basedc ; = 2, based on the diagonal elementc Output: colmax ; valuec ---------------------------------- ----- ------ ----- ----- -
function colmax(w,j,type)integer j, typedouble precision w(2,*)
if (type .eq. 1) thenif(w(1,j) .le. 0.0) then
colmax = 0.0else
colmax = w(2,j) / w(1,j)endif
elseif(w(2,j) .le. 0.0) then
colmax = 0.0else
colmax = dsqrt( w(2,j) )endif
endifend
187
c --------------------------------- ----- ------ ----- ----- --c QR decomposition for the linear regression problem a x = bcc Important note: on entry, the vector pvt[1:np] is given (andc supposedly correct), the first ks (could be zero) elementsc in pvt(*) will always be kept inside the final selection.c This is useful when, for example, the constant term isc always included in the final model.c --------------------------------- ----- ------ ----- ----- --
subroutine lsqr(a,mda,m,n,b,mdb,nb,pvt,kp,ks,w,to l,typ e)integer pvt(*),tmp,ldinteger mda,m,n,mdb,nb,lp,kp,ks,type,pjdouble precision a(mda,*), b(mdb,*)double precision w(2,*),tol,ma
if(tol .lt. 1e-20) tol = 1e-20ld = min(m,kp)
c computes the squared sum for each column of a and bdo 10 j=1,kp
pj = pvt(j)w(1,pj) = sum2(a,mda,pj,1,m)w(2,pj) = w(1,pj)
10 continuedo 15 j=1,nb
w(1,n+j) = sum2(b,mdb,j,1,m)w(2,n+j) = w(1,n+j)
15 continue
c The first ks columns defined in pvt(*) are pre-chosen.c However, they are subject to rank examination.
lp = ksks = 0do 17 j=1,lp
pj = pvt(j)ma = colmax(w,pj,type)if(ma > tol) then
call steplsqr(a,mda,m,n,b,mdb,nb,pvt,kp,ks, j,w,1 )if ( ks .ge. ld) go to 18
endif17 continue18 continue
c lp is the number of pre-chosen column variables.lp = ksld = min(ld,kp)
c the rest columns defined in pvt(*) are subject toc rank examination
do 50 k=ks+1,ldc -- choose the pivoting column j --
call colj(pvt,ks,j,kp,w,tol,type)if(j .eq. 0) then
kp = k - 1go to 60
endifcall steplsqr(a,mda,m,n,b,mdb,nb,pvt,kp,k s,j,w ,1)
188
50 continue60 continue
c if ks is small, or less than n, forward is unnecessary.c Forward selection is only necessary when backward isc insufficient, such as there are too many (noncollinear)c variables.
if( kp .gt. n .or. kp .gt. 200) thencall forward(a,mda,ks,n,b,mdb,nb,pvt,kp,ks ,lp,w )
endif
call backward(a,mda,ks,n,b,mdb,nb,pvt,k p,ks,l p,w)
end
c ---------------------------------- ----- ------ ----- ----- -c forward : forward ordering based on data explanationc a x = bcc Input: a(mda,*) ; matrix to be QR-transformedc m,n ; (m x n) elements are occupiedc b(mdb,nb) ; (m x nb) elements are occupiedc pvt(kp) ; indices of pivoting columnsc kp ; the first kp elements in pvt()c ; are to be usedc ks ; psuedo-rank of ac lp ; the first lp elements indexed in pvt()c ; are already QR-transformed and will notc ; change.c w(2,n+nb) ; w(1,) stores the total sum of eachc ; column of a and b.c ; w(2,) stores the unexplained sum ofc ; squaresc Output: a(,) ; Upper triagular matrix byc ; QR-transformationc b(,) ; QR-transformedc pvt() ; re-orderedc ---------------------------------- ----- ------ ----- ----- -
subroutine forward(a,mda,m,n,b,mdb,nb,pvt,kp, ks,lp ,w)integer pvt(*),tmp,ldinteger mda,m,n,mdb,nb,kp,ks,pj,jma,jth,lidouble precision a(mda,*), b(mdb,*)double precision w(2,*),ma,sum,r
do 1 j=lp+1,kppj = pvt(j)w(2,pj) = sum2(a,mda,pj,lp+1,m)
1 continue
do 2 j=1,nbw(2,n+j) = sum2(b,mdb,j,lp+1,m)
2 continue
do 130 i=lp+1,ksma = -1e20jma = 0
189
do 20 j=i,kpc -- choose the pivoting column j based on data explanation --
pj = pvt(j)sum = 0.0do 10 k = i, n
sum = sum + a(k,pj) * b(k,1)10 continue
c w(2,pj) should not be zeror = sum * sum / w(2,pj)if(r > ma) then
ma = rjma = j
endif20 continue
if(jma .eq. 0) thenks = igoto 200
elsecall steplsqr(a,mda,m,n,b,mdb,nb,pvt,kp,i-1 ,jma, w,1)
endif130 continue
200 continue
end
c --------------------------------- ----- ------ ----- ----- --c backward variable eliminationc --------------------------------- ----- ------ ----- ----- --
subroutine backward(a,mda,m,n,b,mdb,nb,pvt,kp,ks, lp,w)integer pvt(*),tmp,lpinteger mda,m,n,mdb,nb,kp,ks,jmi,jth,kscopy, pjdouble precision a(mda,*), b(mdb,*)double precision w(2,*),ma,absd,sumdouble precision xxx(ks+nb),s2
c backward elimination; the rank ks is not supposed to changekscopy = ksdo 50 j = ks, lp+1, -1
call coljd(a,mda,m,n,b,mdb,nb,pvt,kp,ks,l p+1,w ,jmi)call steplsqr(a,mda,m,n,b,mdb,nb,pvt,kp,k s,jmi ,w,-1 )
50 continue
c restores the rank and data explanationks = kscopydo 100 j=lp+1,kp
pj = pvt(j)s2 = sum2(a,mda,pj,lp+1,ks)w(2,pj) = w(2,pj) - s2
100 continue
do 110 j=1,nbs2 = sum2(b,mdb,j,lp+1,ks)w(2,n+j) = w(2,n+j) - s2
110 continue
190
end
c ---------------------------------- ----- ------ ----- ----- -subroutine coljd(a,mda,m,n,b,mdb,nb,pvt,kp,ks ,jth, w,jmi )integer pvt(*),tmp,ldinteger mda,m,n,mdb,nb,kp,ks,pj,jmi,jthdouble precision a(mda,*), b(mdb,*)double precision w(2,*),mi,val
mi = dabs( b(ks,1) )jmi = ksdo 50 j = jth, ks-1
call coljdval(a,mda,b,mdb,pvt,ks,j,val)if (val .le. mi) then
mi = valjmi = j
endif50 continue
end
c ---------------------------------- ----- ------ ----- ----- -subroutine coljdval(a,mda,b,mdb,pvt,ks,jth,va l)integer pvt(*)integer mda,mdb,ks,jthdouble precision a(mda,*), b(mdb,*)double precision val,c,sdouble precision xxx(ks)
do 10 j = jth+1, ksxxx(j) = a(jth,pvt(j))
10 continueval = b(jth,1)
do 50 j = jth+1, kscall gg1( xxx(j), a(j,pvt(j)), c, s )do 30 l=j+1, ks
xxx(l) = -s * xxx(l) + c * a(j,pvt(l))30 continue
val = -s * val + c * b(j,1)50 continue
val = dabs( val )end
c ---------------------------------- ----- ------ ----- ----- -subroutine rsolve(a,mda,b,mdb,nb,pvt,ks)integer mda,mdb,nb,ksinteger pvt(*)double precision a(mda,*),b(mdb,*)
if(ks .le. 0) returndo 50 k=1,nb
b(ks,k) = b(ks,k) / a(ks,pvt(ks))do 40 i=ks-1,1,-1
sum = 0.0do 30 j=i+1,ks
191
sum = sum + a(i,pvt(j)) * b(j,k)30 continue
b(i,k) = (b(i,k) - sum) / a(i,pvt(i))40 continue50 continue
end
192
References
Aha, D. W., Kibler, D. andAlbert, M. K. (1991). Instance-basedlearningalgo-
rithms. MachineLearning, 6, 37–66.
Akaike,H. (1969). Fitting autoregressivemodelsfor prediction.Ann.Inst.Statist.
Math., 21, 243–247.
Akaike,H. (1970). Statisticalpredictoridentification.Ann.Inst.Statist.Math., 22,
203–217.
Akaike, H. (1973). Information theoryandan extensionof the maximumlikeli-
hoodprinciple. In 2nd InternationalSymposiumon InformationTheory(pp.
267–281).AkademiaiKiado, Budapest.(Reprintedin S.KotzandN. L. John-
son(Eds.)(1992).Breakthroughsin Statistics.Volume1, 610–624.Springer-
Verlag.).
Anderson,E., Bai, Z., Bischof,C., Blackford,S., Demmel,J.,Dongarra,J.,Croz,
J. D., Greenbaum,A., Hammarling, S., McKenney, A. and Sorensen,D.
(1999). LAPACK Users’ Guide (3rd Ed.). Philadelphia,PA: SIAM Publi-
cations.
Atkeson,C.,Moorey, A. andSchaalz,S.(1997).Locally weightedlearning.Artifi-
cial IntelligenceReview, 11, 11–73.
Barbe,P. (1998).Statisticalanalysisof mixturesandtheempiricalprobabilitymea-
sure.ActaApplicandaeMathematicae, 50(3), 253–340.
Barnard,G. A. (1949). Statisticalinference(with discussion).J. Roy. Statist.Soc.,
Ser. B, 11, 115–139.
193
Bayes,T. (1763). An essaytowardssolvinga problemin thedoctrineof chances.
Phil. Trans.Roy. Soc., 53, 370–418.Reprintedin (1958)Biometrika,45, 293–
315.
Berger, J. O. (1985). StatisticaldecisiontheoryandBayesiananalysis(2ndEd.).
New York: Springer-Verlag.
Birnbaum,A. (1962).On thefoundationsof statisticalinference(with discussion).
J. Amer. Statist.Assoc., 57, 269–326. (Reprintedin S. Kotz andN. L. John-
son(Eds.)(1992).Breakthroughsin Statistics.Volume1, 478–518.Springer-
Verlag.).
Blake, C., Keogh,E. andMerz, C. J. (1998). UCI Repositoryof machinelearning
databases[http://www.ics.uci.edu/6 mlearn/MLRepository.html]. Department
of InformationandComputerScience,Universityof California,Irvine.
Blum, J. R. andSusarla,V. (1977). Estimationof a mixing distribution function.
Ann.Probab., 5, 200–209.
Bohning,D. (1982). Convergenceof Simar’s algorithmfor finding the maximum
likelihoodestimateof a compoundPoissonprocess.Ann.Statist., 10, 1006–
1008.
Bohning,D. (1985). Numericalestimationof a probability measure.Journal of
StatisticalPlanningandInference, 11, 57–69.
Bohning, D. (1986). A vertex-exchange-methodin ¿ -optimal design theory.
Metrika, 33, 337–347.
Bohning,D. (1995). A review of reliablealgorithmsfor thesemi-parametricmax-
imum likelihood estimatorof a mixture distribution. Journal of Statistical
PlanningandInference, 47, 5–28.
Bohning,D., Schlattmann,P. andLindsay, B. (1992).C.A.MAN (computerassisted
analysisof mixtures):Statisticalalgorithms.Biometrics, 48, 283–303.
194
Breiman,L. (1992).Thelittle bootstrapandothermethodsfor dimensionalityselec-
tion in regression:X-fixedpredictionerror. Journal of theAmericanStatistics
Association, 87, 738–754.
Breiman,L. (1995).Bettersubsetregressionusingthenonnegativegarrote.Techo-
metrics, 37(4), 373–384.
Breiman,L., Friedman,J.H., Olshen,R. A. andStone,C. J. (1984). Classification
andRegressionTrees. Wadsworth,BelmontCA.
Breiman,L. andSpector, P. (1992). Submodelselectionandevaluationin regres-
sion.thex-randomcase.InternationalStatisticalReview, 60, 291–319.
Brown, L. D. (1990). An ancillarity paradoxwhich appearsin multiple linear re-
gression(with discussion).Ann.Statist., 18, 471–538.
Burnham,K. P. andAnderson,D. R. (1998). Model selectionand inference: a
practical information-theoreticapproach. New York: Springer.
Carlin,B. P. andLouis,T. A. (1996).BayesandEmpiricalBayesMethodsfor Data
Analysis. London:Chapman& Hall.
Carlin,B. P. andLouis,T. A. (2000). EmpiricalBayes:Past,presentandfuture. J.
Am.Statist.Assoc.(To appear).
Chatfield,C. (1995). Model uncertainty, datamining andstatisticalinference. J.
Roy. Statist.Soc.Ser. A, 158, 419–466.
Choi, K. and Bulgren, W. B. (1968). An estimationprocedurefor mixturesof
distributions.J. R.Statist.Soc.B, 30, 444–460.
Cleveland,W. S. (1979). Robust locally weightedregressionandsmoothingscat-
terplot. Journalof theAmericanStatisticalAssociation, 74, 829–836.
Coope,I. D. andWatson,G.A. (1985).A projectedLagrangianalgorithmfor semi-
infinite programming.MathematicalProgramming, 32, 337–293.
195
Copas,J.B. (1969).CompounddecisionsandempiricalBayes.J. Roy. Statist.Soc.
B, 31, 397–425.
Copas,J. B. (1983). Regression,predictionandshrinkage(with discussions).J.
Roy. Statist.Soc.B, 45, 311–354.
Deely, J. J. and Kruse, R. L. (1968). Constructionof sequencesestimatingthe
mixing distribution. Ann.Math.Statist., 39, 286–288.
Dempster, A. P., Laird, N. M. andRubin,D. B. (1977). Maximumlikelihoodfrom
incompletedatavia theEM algorithm.J. Roy. Statist.Soc.,Ser. B, 39, 1–22.
Derksen,S. andKeselman,H. J. (1992). Backward, forward andstepwiseauto-
matedsubsetselectionalgorithms:Frequency of obtainingauthenticandnoise
variables. British Journal of Mathematicaland StatisticalPsychology, 45,
265–282.
DerSimonian,R. (1986).Maximumlikelihoodestimationof a mixing distribution.
J. Roy. Statist.Soc.Ser. C, 35, 302–309.
DerSimonian,R. (1990). Correctionto algorithmAS 221: Maximum likelihood
estimationof amixing distribution. J. Roy. Statist.Soc.Ser. C, 39, 176.
Dongarra,J. J., Bunch, J., Moler, C. and Stewart, G. (1979). LINPACK Users’
Guide. Philadelphia,PA: SIAM Publications.
Donoho,D. L. andJohnstone,I. M. (1994). Ideal spatialadaptationvia wavelet
shrinkage.Biometrika, 81, 425–455.
Efron,B. (1998).R. A. Fisherin the21stcentury. StatisticalScience, 13, 95–126.
Efron,B. andMorris, C. N. (1971).Limiting therisk of BayesandempiricalBayes
estimators- PartI: TheBayescase.J. Amer. Statist.Assoc., 66, 807–815.
Efron, B. andMorris, C. N. (1972a).EmpiricalBayeson vectorobservations:An
extensionof Stein’smethod.Biometrika, 59, 335–347.
196
Efron, B. and Morris, C. N. (1972b). Limiting the risk of Bayesand empirical
Bayesestimators- PartII: TheempiricalBayescase.J. Amer. Statist.Assoc.,
67, 130–139.
Efron, B. andMorris, C. N. (1973a).Combiningpossiblyrelatedestimationprob-
lems(with discussion).J. Roy. Statist.Soc.,Ser. B, 35, 379–421.
Efron,B. andMorris, C. N. (1973b).Stein’sestimationruleandits competitor- An
empiricalBayesapproach.J. Amer. Statist.Assoc., 68, 117–130.
Efron, B. andMorris, C. N. (1975). DataanalysisusingStein’s estimatorandits
generalizations.J. Amer. Statist.Assoc., 70, 311–319.
Efron, B. andMorris, C. N. (1977). Stein’s paradoxin statistics.ScientificAmeri-
can, 236(May), 119–127.
Efron, B. and Tibshirani, J. (1993). An Introductionto the Bootstrap. London:
ChapmanandHall.
Fan, J. andGijbels, I. (1996). Local PolynomialModelingand Its Applications.
Chapman& Hall.
Fayyad,U., Piatetsky-Shapiro,G., Smyth,P. andUthurusamy, R. (Eds.).(1996).
Advancesin Knowledge DiscoveryandData Mining. MIT Press,Cambridge,
MA.
Fedorov, V. V. (1972).Theoryof OptimalExperiments. New York: AcademicPress.
Fisher, R. A. (1922). On the mathemmaticalfoundationsof theoreticalstatistics.
Philos.Trans.Roy. Soc.London,Ser. A, 222, 309–368.(Reprintedin S. Kotz
andN. L. Johnson(Eds.)(1992).Breakthroughsin Statistics.Volume1,11–44.
Springer-Verlag.).
Fisher, R. A. (1925). Theoryof statisticalestimation.Proc. Camb. Phil. Soc., 22,
700–725.
Foster, D. andGeorge,E. (1994).Therisk inflationcriterionfor multipleregression.
Annalsof Statistics, 22, 1947–1975.
197
Frank,E. andFriedman,J.(1993).A statisticalview of somechemometricsregres-
siontools(with discussions).Technometrics, 35, 109–148.
Friedman,J. (1991). Multiple adaptive regressionsplines(with discussion).Ann.
Statist., 19, 1–141.
Friedman,J. and Stuetzle,W. (1981). Projectionpursuit regression. JASA, 76,
817–823.
Friedman,J. H. (1997). Datamining andstatistics:What’s the connection? In
Proceedingsof the29thSymposiumon theInterface:ComputingScienceand
Statistics. TheInterfaceFoundationof North America.May. Houston,Texas.
Galambos,J. (1995). AdvancedProbability Theory. A seriesof Textbooksand
Referencebooks/10.MarcelDekker, Inc.
Gauss,C. F. (1809). Theoriamotuscorporumcoelestium. Werke, 7. (English
transl.:C. H. Davis. Dover, New York, 1963).
Hall, M. A. (1999).Correlation-basedFeatureSubsetSelectionfor MachineLearn-
ing. PhDthesis,Departmentof ComputerScience,Universityof Waikato.
Hannan,E. J. andQuinn,B. G. (1979). Thedeterminationof theorderof autore-
gression.J. R.Statist.Soc.B, 41, 190–195.
Hartigan,J. A. (1996). Introduction. In P. Arabie, L. J. HubertandG. D. Soete
(Eds.),Clusteringand Classification(pp. 1–3).World ScientificPubl.,River
Edge,NJ.
Hastie,T. andTibshirani,R. (1990). GeneralizedAdditiveModels. Chapman&
Hall.
Hoerl,R. W., Schuenemeyer, J.H. andHoerl,A. E. (1986).A simulationof biased
estimationandsubsetselectionregressiontechniques.Technometrics, 9, 269–
380.
James,W. andStein,C. (1961). Estimationwith quadraticloss. In Proc. Fourth
Berkeley Symposiumon Math. Statist.Prob., 1 (pp. 311–319).University of
198
CaliforniaPress.(Reprintedin S.KotzandN. L. Johnson(Eds.)(1992).Break-
throughsin Statistics.Volume1, 443–460.Springer-Verlag.).
Kiefer, J. andWolfowitz, J. (1956). Consistercy of the maximumlikelihoodesti-
mator in the presenceof infinitely many incedentalparameters.Ann. Math.
Statist., 27, 886–906.
Kilpatrick, D. andCameron-Jones,M. (1998). Numericpredictionusinginstance-
basedlearningwith encodinglength selection. In N. Kasaboc,R. Kozma,
K. Ko,R.O’Shea,G.Coghill andT. Gedeon(Eds.),Progressin Connectionist-
BasedInformationSystems, Volume2 (pp.984–987).Springer-Verlag.
Kotz,S. andJohnson,N. L. (Eds.).(1992). Breakthroughsin Statistics, Volume1.
Springer-Verlag.
Kullback,S.(1968).InformationTheoryandStatistics(2ndEd.).New York: Dover.
Reprintedin 1978,Gloucester, MA: PeterSmith.
Kullback,S. andLeibler, R. (1951). On informationandsufficiency. Ann.Math.
Statist., 22, 79–86.
Laird, N. M. (1978). Nonparametricmaximumlikelihoodestimationof a mixing
distribution. J. Amer. Statist.Assoc., 73, 805–811.
Lawson,C. L. andHanson,R. J. (1974,1995). SolvingLeastSquaresProblems.
Prentice-Hall,Inc.
Legendre,A. M. (1805). Nouvellesmethodspur la determinationdesorbits des
cometes.(Appendix:Surla methodedesmoindrecarres).
Lehmann,E. L. andCasella,G. (1998).Theoryof point estimation(2ndEd.). New
York: Spinger-Verlag.
Lesperance,M. L. andKalbfleisch,J. D. (1992). An algoritm for computingthe
nonparametricmle of a mixing distribution. J. Amer. Statist.Assoc., 87, 120–
126.
199
Li, K.-C. (1987).Asymptoticoptimality for CÀ , CÁ , cross-validationandgeneral-
izedcross-validation:discreteindex set.Ann.Statist., 15, 958–975.
Lindsay, B. G. (1983).Thegeometryof mixturelikelihoods:A generaltheory. Ann.
Statist., 11, 86–94.
Lindsay, B. G. (1995). Mixture Models:Theory, Geometry, andApplications, Vol-
ume5 of NSF-CBMSRegionalConferenceSeriesin ProbabilityandStatistics.
Institutefor MathematicalStatistics:Hayward,CA.
Lindsay, B. G. andLesperance,M. L. (1995).A review of semiparametricmixture
models.Journal of StatisticalPlanning& Inference, 47, 29–39.
Linhart,H. andZucchini,W. (1989).ModelSelection. New York: Wiley.
Liu, H. andMotoda,H. (1998). Feature Selectionfor Knowledge Discovery and
DataMining. Kluwer AcademicPublishers.
Loader, C. (1999).Localregressionandlikelihood. Statisticsandcomputingseries.
Springer.
Macdonald,P. D. M. (1971). Commenton a paperby Choi andBulgren. J. R.
Statist.Soc.B, 33, 326–329.
Mallows,C. L. (1973).Somecommentson Cp. Technometrics, 15, 661–675.
Maritz, J. S. andLwin, T. (1989). Empirical BayesMethods(2ndEd.). Chapman
andHall.
McLachlan,G. andBasford,K. (1988). Mixture Models: Inferenceand Applica-
tionsto Clustering. MarcelDekker, New York.
McQuarrie,A. D. andTsai,C.-L. (1998). Regressionandtimeseriesmodelselec-
tion. Singapore;RiverEdge,N.J.: World Scientific.
Miller, A. J. (1990). SubsetSelectionin Regression, Volume40 of Monographson
StatisticsandAppliedProbability. Chapman& Hall.
200
Morris, C. N. (1983). Parametricempiricalbayesinference:theoryandapplica-
tions. J. Amer. Statist.Assoc., 78, 47–59.
Neyman,J.(1962).Two breakthroughsin thetheoryof statisticaldecisionmaking.
Rev. Inst. Internat.Statist., 30, 11–27.
Polya,G. (1920).Ueberdenzentralengrenswertsatzderwahrscheinlichkeitstheorie
unddasmomentenproblem.Math.Zeit., 8, 171.
PrakasaRao,B. L. (1992). Identifiability in StochasitcModels: Characterization
of ProbabilityDistributions. AcademicPress,New York.
Quinlan,J.R. (1993).C4.5: programsfor machinelearning. Mateo,Calif.: Morgan
KaufmannPublishers.
Rao,C. R. andWu, Y. (1989).A stronglyconsistentprocedurefor modelselection
in a regressionproblem.Technometrics, 72(2), 369–374.
Rissanen,J. (1978). Modelingby shortestdatadescription.Automatica, 14, 465–
471.
Rissanen,J. (1983). A universalprior for integersand estimationby minimum
descriptionlength.Ann.Statist., 11, 416–431.
Rissanen,J. (1989). Stochastic Complexity in Statistical Inquiry, Volume 15 of
Seriesin ComputerScience. World ScientificPublishingCo.
Robbins,H. (1950). A generalizationof themethodof maximumlikelihood:Esti-
matingamixing distribution(abstract).Ann.Math.Statist., 21, 314–315.
Robbins,H. (1951).Asymptoticallysubminimaxsolutionsof compoundstatistical
decisionproblems.In Proc.SecondBerkeley SymposiumonMath.Statist.and
Prob., 1 (pp.131–148).Universityof CaliforniaPress.
Robbins,H. (1955). An empirical Bayesapproachto statistics. In Proc. Third
Berkeley Symposiumon Math.Statist.andProb., 1 (pp. 157–164).University
of CaliforniaPress.
201
Robbins,H. (1964).TheempiricalBayesapproachto statisticaldecisionproblems.
Ann.Math.Statist., 35, 1–20.
Robbins,H. (1983). Somethoughtson empiricalBayesestimation.Ann.Statist.,
11, 713–723.
Roecker, E.B. (1991).Predictionerrorandits estimationof subsetselectedmodels.
Technometrics, 33, 459–468.
Schott,J.R. (1997).Matrix analysisfor statistics. New York: Wiley.
Schwarz, G. (1978). Estimatingthe dimensionof a model. Annalsof Statistics,
6(2), 461–464.
Sclove, S. L. (1968). Improvedestimatorsfor coefficientsin linear regression.J.
Amer. Statist.Assoc., 63, 596–606.
Shannon,C. E. (1948). A mathematicaltheory of communication. Bell System
Technical Journal, 27, 398–403.
Shao,J.(1993).Linearmodelselectionby cross-validation.J. Amer. Statist.Assoc.,
88, 486–494.
Shao,J. (1996).Bootstrapmodelselection.J. Amer. Statist.Assoc., 91, 655–665.
Shao,J.andTu, D. (1995).TheJackknifeandBootstrap. SpringerVerlag.
Shibata,R. (1976). Selectionof theorderof anautoregressive modelby Akaike’s
informationcriterion. Biometrika, 63, 117–126.
Shibata,R. (1981). An optimal selectionof regressionvariables.Biometrika, 68,
45–54.
Shibata,R. (1986). Consistency of modelselectionandparameterestimation. In
J. Gani.andM. Priestley (Eds.),Essaysin time seriesand appliedprocesses
(pp.27–41).J.Appl. Prob.
202
Stein,C.(1955).Inadmissibilityof theusualestimatorfor themeanof amultivariate
normaldistribution. In Proc.Third Berkeley Symp.onMath.Statist.andProb.,
1 (pp.197–206).Universityof CaliforniaPress.
Teicher, H. (1960).On themixtureof distributions.Ann.Math.Statist., 31, 55–57.
Teicher, H. (1961). Identifiability of mixtures.Ann.Math.Statist., 32, 244–248.
Teicher, H. (1963). Identifiability of mixtures.Ann.Math.Statist., 34, 1265–1269.
Thompson,M. L. (1978). Selectionof variablesin multiple regression,part1 and
part2. InternationalStatisticalReview, 46, 1–21and129–146.
Tibshirani,R. (1996). Regressionshrinkageandselectionvia thelasso.Journal of
theRoyalStatisticalSocietyB, 58, 267–288.
Tibshirani,R. andKnight, K. (1997). Thecovarianceinflation criterion for model
selection.Technicalreport,Departmentof Statistics,Universityof Stanford.
Titterington,D. M., Smith,A. F. M. andMakov, U. E. (1985). StatisticalAnalysis
of Finite MixtureDistributions. JohnWiley & Sons.
von Neuman,J. (1928). Zur theorieder gesellschafstspiele.Math. Annalen, 100,
295–320.
von Neumann,J. andMorgenstern,O. (1944). Theoryof Gamesand Economic
Behavior. PrincetonUniversityPress,Princeton,NJ.
Wald,A. (1939).Contributionsto thetheoryof statisticalestimationandhypothesis
testing.Ann.Math.Statist., 10, 299–326.
Wald,A. (1950).StatisticalDecisionFunctions. Wiley, New York.
Witten,I. H. andFrank,E. (1999).DataMining: PracticalMachineLearningTools
andTechniqueswith JAVA Implementations. MorganKaufmann.
Wu, C. F. (1978a).Somealgorithmicaspectsof thetheoryof optimaldesigns.Ann.
Statist., 6, 1286–1301.
203
Wu, C. F. (1978b). Someiterative proceduresfor generatingnonsigularoptimal
designs.Comm.Statist., 7, 1399–1412.
Zhang,P. (1993). Model selectionvia multifold cross-validation. Ann.Statist., 21,
299–313.
Zhao,L. C.,Krishnaiah,P. R.andBai,Z. D. (1986).Onthedetectionof thenumber
of signalsin thepresenceof whitenoise.Journalof MultivariateAnalysis, 20,
1–25.
204