Shalizi Bootstrap 2010 05

download Shalizi Bootstrap 2010 05

of 6

Transcript of Shalizi Bootstrap 2010 05

  • 7/27/2019 Shalizi Bootstrap 2010 05

    1/6

    A reprint from

    American Scientistthe magazine of Sigma Xi, The Scientific Research Society

    This reprint is provided for personal and noncommercial use. For any other use, please send a request to Permissions,American Scientist, P.O. Box 13975, Research Triangle Park, NC, 27709, U.S.A., or by electronic mail to [email protected] Xi, The Scientific Research Society and other rightsholders

  • 7/27/2019 Shalizi Bootstrap 2010 05

    2/6

    186 American Scientist, Volume 98

    C S

    2010 Sigma Xi, The Scientific Research Society. Reproductionwith permission only. Contact [email protected].

    The Bootstrap

    Cosma Shalizi

    Statistics is the branchof ap-plied mathematics that studiesways of drawing inferences from lim-ited and imperfect data. We may wantto know how a neuron in a rats brainresponds when one of its whiskers getstweaked, or how many rats live in Man-hattan, or how high the water will getunder the Brooklyn Bridge, or the typi-

    cal course of daily temperatures in thecity over the year. We have some dataon all of these things, but we know thatour data are incomplete, and experiencetells us that repeating our experimentsor observations, even taking great careto replicate the conditions, gives moreor less different answers every time. It isfoolish to treat any inference from onlythe data in hand as certain.

    If all data sources were totally capri-cious, thered be nothing to do beyondpiously qualifying every conclusionwith but we could be wrong aboutthis. A mathematical science of statis-tics is possible because, although repeat-ing an experiment gives different results,some types of results are more commonthan others; their relative frequencies arereasonably stable. We can thus modelthe data-generating mechanism throughprobability distributions and stochasticprocessesrandom series with some in-determinacy about how the events mightevolve over time, although some pathsmay be more likely than others. Whenand why we can use stochastic models

    are very deep questions, but ones foranother time. But if we canuse them ina problem, quantities such as these arerepresented as parameters of the sto-chastic models. In other words, they arefunctions of the underlying probability

    distribution. Parameters can be singlenumbers, such as the total rat popula-tion; vectors; or even whole curves, suchas the expected time-course of tempera-ture over the year. Statistical inferencecomes down to estimating those param-eters, or testing hypotheses about them.

    These estimates and other inferencesare functions of the data values, whichmeans that they inherit variability fromthe underlying stochastic process. If wereran the tape (as Stephen Jay Gouldused to say) of an event that happened,we would get different data with a cer-tain characteristic distribution, and ap-plying a fixed procedure would yielddifferent inferences, again with a certaindistribution. Statisticians want to use thisdistribution to quantify the uncertaintyof the inferences. For instance, by howmuch would our estimate of a parametervary, typically, from one replication of theexperiment to anothersay, to be precise,what is the root-mean-square (the squareroot of the mean average of the squares)

    deviation of the estimate from its aver-age value, or the standard error?Or wecould ask, What are all the parametervalues that couldhave produced this datawith at least some specified probability?In other words, what are all the param-eter values under which our data are notlow-probability outliers? This gives usthe confidence regionfor the parameterrather than apoint estimate,a promise thateither the true parameter point lies in thatregion, or something very unlikely underany circumstances happenedor that

    our stochastic model is wrong.

    To get standard errors or confidenceintervals, we need to know the distri-

    bution of our estimates around the trueparameters. These sampling distributionsfollow from the distribution of the data,

    because our estimates are functions of thedata. Mathematically the problem is welldefined, but actually computinganythingis another story. Estimates are typically

    complicated functions of the data, andmathematically convenient distributionsall may be poor approximations of thedata source. Saying anything in closedform about the distribution of estimatescan be simply hopeless. The two classi-cal responses of statisticians have beento focus on tractable special cases, and toappeal to asymptotic analysis, a methodthat approximates the limits of functions.

    Origin MythsIf youve taken an elementary statisticscourse, you were probably drilled in thespecial cases. From one end of the pos-sible set of solutions, we can limit thekinds of estimator we use to those witha simple mathematical formsay, meanaverages and other linear functions of thedata. From the other, we can assume thatthe probability distributions featured inthe stochastic model take one of a fewforms for which exact calculation ispos-sible, either analytically or via tables ofspecial functions. Most such distribu-tions have origin myths: The Gaussian

    bell curve arises from averaging many

    independent variables of equal size(say, the many genes that contribute toheight in humans); the Poisson distri-

    bution comes from counting how manyof a large number of independent andindividually improbable events have oc-curred (say, radium nuclei decaying in agiven second), and so on. Squeezed from

    both ends, the sampling distribution ofestimators and other functions of thedata becomes exactly calculable in termsof the aforementioned special functions.

    That these origin myths invoke vari-

    ous limits is no accident. The great re-

    Cosma Shalizi received his Ph.D. in physics fromthe University of WisconsinMadison in 2001. Heis an assistant professor of statistics at CarnegieMellon University and an external professor at theSanta Fe Institute. Address: 132 Baker Hall, Carn-egie Mellon University, 5000 Forbes Avenue, Pitts-

    burgh, PA 15213. Internet: http://www.bactra.org

    Statisticians can reuse

    their data to quantify

    the uncertainty of

    complex models

  • 7/27/2019 Shalizi Bootstrap 2010 05

    3/6

    2010 MayJune 187www.americanscientist.org 2010 Sigma Xi, The Scientific Research Society. Reproduction

    with permission only. Contact [email protected].

    sults of probability theorythe laws oflarge numbers, the ergodic theorem, thecentral limit theorem and so onde-scribe limits in which allstochastic pro-

    cesses in broad classes of models displaythe same asymptotic behavior. The cen-tral limit theorem (CLT), for instance,says that if we average more and moreindependent random quantities with acommon distribution, and if that com-mon distribution is not too pathological,then the distribution of their means ap-proaches a Gaussian. (The non-Gauss-ian parts of the distribution wash awayunder averaging, but the average of twoGaussians is another Gaussian.) Typi-cally, as in the CLT, the limits involvetaking more and more data from thesource, so statisticians use the theoremsto find the asymptotic, large-sample dis-

    tributions of their estimates. We havebeen especially devoted to rewriting ourestimates as averages of independentquantities, so that we can use the CLT to

    get Gaussian asymptotics. Refinementsto such results would consider, say, therate at which the error of the asymptoticGaussian approximation shrinks as thesample sizes grow.

    To illustrate the classical approachand the modern alternatives, Ill intro-duce some data: The daily closing pricesof the Standard and Poors 500 stockindex from October 1, 1999, to October20, 2009. (I use these data because theyhappen to be publicly available and fa-miliar to many readers, not to impartany kind of financial advice.) Profes-sional investors care more about chang-es in prices than their level, specifically

    the log returns, the log of the price todaydivided by the price yesterday. For thistime period of 2,529 trading days, thereare 2,528 such values (see Figure 1). The

    efficient market hypothesis from fi-nancial theory says the returns cant bepredicted from any public information,including their own past values. In fact,many financial models assume suchseries are sequences of independent,identically distributed (IID) Gaussianrandom variables. Fitting such a modelyields the distribution function in thecenter graph of Figure 1.

    An investor might want to know,for instance, how bad the returnscould be. The lowest conceivable logreturn is negative infinity (with all thestocks in the index losing all value),

    but most investors worry less about an

    Figure 1. A series of log returns from the Standard and Poors 500 stock index from October 1, 1999, to October 20, 2009 (left),can be used toillustrate a classical approach to probability. A financial model that assumes the series are sequences of independent, identically distributed

    Gaussian random variables yields the distribution function shown at center. A theoretical sampling distribution that models the smallest 1percent of daily returns (denoted asq0.01) shows a value of 0.0326 0.00104(right),but we need a way to determine the uncertainty of this estimate.

    Figure 2. A schematic for model-based bootstrapping(left)shows that simulated values are generated from the fitted model, and then they are treatedlike the original data, yielding a new parameter estimate. Alternately, in nonparametric bootstrapping, a schematic(right)shows that new data are

    simulated by resampling from the original data (allowing repeated values), then parameters are calculated directly from the empirical distribution.

  • 7/27/2019 Shalizi Bootstrap 2010 05

    4/6

    188 American Scientist, Volume 98 2010 Sigma Xi, The Scientific Research Society. Reproduction

    with permission only. Contact [email protected].

    apocalyptic end of American capital-ism than about large-but-still-typicallossessay, how bad are the smallest1 percent of daily returns? Call thisnumber q0.01; if we know it, we knowthat we will do better about 99 percentof the time, and we can see whetherwe can handle occasional losses of thatmagnitude. (There are about 250 trad-ing days in a year, so we should expecttwo or three days at least that bad in a

    year.) From the fitted distribution, wecan calculate that q0.01= 0.0326, or,undoing the logarithm, a 3.21 percentloss. How uncertain is this point esti-mate? The Gaussian assumption letsus calculate the asymptotic samplingdistribution of q0.01, which turns out to

    be another Gaussian (see the right graphin Figure 1),implying a standard errorof 0.00104. The 95 percent confidenceinterval is (0.0347, 0.0306): Either the

    real q0.01is in that range, or our data setis one big fluke (at 1-in-20 odds), or theIID-Gaussian model is wrong.

    Fitting ModelsFrom its origins in the 19th centurythrough about the 1960s, statistics wassplit between developing general ideasabout how to draw and evaluate sta-tistical inferences, and working out theproperties of inferential procedures intractable special cases (like the one we

    just went through) or under asymptot-ic approximations. This yoked a very

    broad and abstract theory of inferenceto very narrow and concrete practicalformulas, an uneasy combination oftenpreserved in basic statistics classes.

    The arrival of (comparatively) cheapand fast computers made it feasible forscientists and statisticians to record lotsof data and to fit models to them. Some-times the models were conventional ones,including the special-case assumptions,which often enough turned out to be

    detectably, and consequentially, wrong.At other times, scientists wanted morecomplicated or flexible models, some of

    Figure 3. An empirical distribution (left, in red, smoothed for visual clarity)of the log returns from a stock-market index is more peaked and has sub-stantially more large-magnitude returns than a Gaussian fit(blue). The black marks on the horizontal axis show all the observed values. The distribu-

    tion of q0.01based on 100,000 nonparametric replications is very non-Gaussian(right, in red).The empirical estimate is marked by the blue dashed line.

    Figure 4. A scatter plot of black circles shows

    log returns from a stock-market index on suc-cessive days. The best-fit line (blue)is a linearfunction that minimizes the mean-squaredprediction error. Its negative slope indicates

    that days with below-average returns tendto be followed by days with above-average

    returns, and vice versa. The red line shows anoptimization procedure, calledspline smooth-ing,that will become more or less curved de-

    pending on looser or tighter constraints.

  • 7/27/2019 Shalizi Bootstrap 2010 05

    5/6

    2010 MayJune 189www.americanscientist.org 2010 Sigma Xi, The Scientific Research Society. Reproduction

    with permission only. Contact [email protected].

    which had been proposed long before butnow moved from being theoretical curi-osities to stuff that could run overnight.In principle, asymptotics might handleeither kind of problem, but convergenceto the limit could be unacceptably slow,especially for more complex models.

    By the 1970s statistics faced the prob-lem of quantifying the uncertainty of in-

    ferences without using either implausi-bly helpful assumptions or asymptotics;all of the solutions turned out to demandeven morecomputation. Perhaps the mostsuccessful was a proposal by StanfordUniversity statistician Bradley Efron, ina now-famous 1977 paper, to combineestimation with simulation. Over the lastthree decades, Efrons bootstrap hasspread into all areas of statistics, sprout-ing endless elaborations; here Ill stick toits most basic forms.

    Remember that the key to dealing with

    uncertainty in parameters is the samplingdistribution of estimators. Knowing whatdistribution wed get for our estimateson repeating the experiment would giveus quantities, such as standard errors.Efrons insight was that we can simulatereplication. After all, we have already fit-ted a model to the data, which is a guessat the mechanism that generated thedata. Running that mechanism generatessimulated data that, by hypothesis, havenearly the same distribution as the realdata. Feeding the simulated data throughour estimator gives us one draw fromthe sampling distribution; repeating thismany times yields the sampling distri-

    bution as a whole. Because the methodgives itself its own uncertainty, Efroncalled this bootstrapping; unlike Bar-on von Mnchhausens plan for gettinghimself out of a swamp by pulling him-self out by his bootstraps, it works.

    Lets see how this works with thestock-index returns. Figure 2 showsthe overall process: Fit a model to data,use the model to calculate the param-eter, then get the sampling distribution

    by generating new, synthetic data fromthe model and repeating the estima-tion on the simulation output. The firsttime I recalculate q0.01 from a simula-tion, I get -0.0323. Replicated 100,000times, I get a standard error of 0.00104,and a 95 percent confidence interval of(0.0347, 0.0306), matching the theo-retical calculations to three significantdigits. This close agreement shows thatI simulated properly! But the point ofthe bootstrap is that it doesnt rely onthe Gaussian assumption, juston our

    ability to simulate.

    Bootstrapping

    The bootstrap approximates the sam-pling distribution, with three sources ofapproximation error. First theres simu-lation error,using finitely many replica-tions to stand for the full sampling dis-tribution. Clever simulation design canshrink this, but brute forcejust usingenough replicationscan also make itarbitrarily small. Second, theres statisti-cal error:The sampling distribution ofthe bootstrap reestimates under our fit-ted model is not exactly the same asthe sampling distribution of estimatesunder the true data-generating process.The sampling distribution changes withthe parameters, and our initial fit is notcompletely accurate. But it often turnsout that distribution of estimates around

    the truth is more nearly invariant thanthe distribution of estimates themselves,so subtracting the initial estimate fromthe bootstrapped values helps reducethe statistical error; there are many sub-tler tricks to the same end. The finalsource of error in bootstrapping is speci-

    fication error: The data source doesntexactly follow our model at all. Simulat-ing the model then never quite matchesthe actual sampling distribution.

    Here Efron had a second brilliantidea, which is to address specification

    error by replacing simulation from the

    model with resampling from the data.After all, our initial collection of datagives us a lot of information about therelative probabilities of different values,and in certain senses this empirical dis-tribution is actually the least prejudicedestimate possible of the underlying dis-tributionanything else imposes biasesor preconceptions, which are possiblyaccurate but also potentially misleading.We could estimate q0.01directly from theempirical distribution, without the me-diation of the Gaussian model. Efronsnonparametric bootstrap treats theoriginal data set as a complete popula-tion and draws a new, simulated samplefrom it, picking each observation withequal probability (allowing repeated val-ues) and then re-running the estimation

    (as shown in Figure 2).This new method matters here be-

    cause the Gaussian model is inaccurate;the true distribution is more sharplypeaked around zero and has substan-tially more large-magnitude returns, in

    both directions, than the Gaussian (seethe left graph in Figure 3).For the em-pirical distribution, q0.01= 0.0392. Thismay seem close to our previous pointestimate of 0.0326, but its well beyondthe confidence interval, and under theGaussian model we should see values

    that negative only 0.25 percent of the

    Figure 5. The same spline fit from the previous figure (black line)is combined with 800 splinesfit to bootstrapped resamples of the data (blue curves)and the resulting 95 percent confidencelimits for the true regression curve (red lines).

  • 7/27/2019 Shalizi Bootstrap 2010 05

    6/6

    190 American Scientist, Volume 98 2010 Sigma Xi, The Scientific Research Society. Reproduction

    with permission only. Contact [email protected].

    time, not 1 percent of the time. Doing100,000 non-parametric replicatesthatis, resampling from the data and rees-timating q0.01 that many timesgivesa very non-Gaussian sampling distri-

    bution (as shown in the right graph ofFigure 3), yielding a standard error of0.00364 and a 95 percent confidence in-terval of (0.0477, 0.0346).

    Although this is more accurate thanthe Gaussian model, its still a really sim-ple problem. Conceivably, some othernice distribution fits the returns betterthan the Gaussian, and it might evenhave analytical sampling formulas. Thereal strength of the bootstrap is that itlets us handle complicated models, andcomplicated questions, in exactly thesame way as this simple case.

    To continue with the financial exam-ple, a question of perennial interest ispredicting the stock market. Figure 4 is

    a scatter plot of the log returns on suc-cessive days, the return for today beingon the horizontal axis and that of to-morrow on the vertical. Its mostly justa big blob, because the market is hardto predict, but I have drawn two linesthrough it: a straight one in blue, and acurved one in black. These lines try topredict the average return tomorrowas functions of todays return; theyrecalled regression linesor regression curves.The straight line is the linear functionthat minimizes the mean-squared pre-diction error, or the sum of the squaresof the errors made in solving everysingle equation (called the least squaresmethod). Its slope is negative (0.0822),indicating that days with below-aver-age returns tend to be followed by oneswith above-average returns and viceversa, perhaps because people try to

    buy cheap after the market falls (push-ing it up) and sell dear when it rises(pulling it down). Linear regressionswith Gaussian fluctuations around theprediction function are probably the

    best-understood of all statistical mod-

    elstheir oldest forms go back twocenturies nowbut theyre more ven-erable than accurate.

    The black curve is a nonlinear esti-mate of the regression function, comingfrom a constrained optimization pro-cedure called spline smoothing:Find thefunction that minimizes the predictionerror, while capping the value of the av-erage squared second derivative. As theconstraint tightens, the optimal curve,the spline, straightens out, approachingthe linear regression; as the constraint

    loosens, the spline wiggles to try to

    pass through each data point. (A splinewas originally a flexible length of woodcraftsmen used to draw smooth curves,fixing it to the points the curve had to gothrough and letting it flex to minimizeelastic energy; stiffer splines yielded flat-ter curves, corresponding mathemati-cally to tighter constraints.)

    To actually get the spline, I need to

    pick the level of the constraint. Too small,and I get an erratic curve that memorizesthe sample but wont generalize to newdata; but too much smoothing erases realand useful patterns. I set the constraintthrough cross-validation: Remove onepoint from the data, fit multiple curveswith multiple values of the constraintto the other points, and then see whichcurve best predicts the left-out point. Re-peating this for each point in turn showshow much curvature the spline needs inorder to generalize properly. In this case,

    we can see that we end up selecting amoderate amount of wiggliness; like thelinear model, the spline predicts rever-sion in the returns but suggests that itsasymmetricdays of large negative re-turns being followed, on average, by big-ger positive returns than the other wayaround. This might be because people aremore apt to buy low than to sell high, butwe should check that this is a real phe-nomenon before reading much into it.

    There are three things we should noteabout spline smoothing. First, its muchmore flexible than just fitting a straightline to the data; splines can approximatea huge range of functions to an arbitrarytolerance, so they can discover compli-cated nonlinear relationships, such asasymmetry, without guessing in advancewhat to look for. Second, there was nohope of using a smoothing spline onsubstantial data sets before fast comput-ers, although now the estimation, includ-ing cross-validation, takes less than asecond on a laptop. Third, the estimatedspline depends on the data in two ways:Once we decide how much smoothing

    to do, it tries to match the data withinthe constraint; but we also use the datato decide how much smoothing to do.Any quantification of uncertainty hereshould reckon with both effects.

    There are multiple ways to use boot-strapping to get uncertainty estimatesfor the spline, depending on what werewilling to assume about the system. HereI will be cautious and fall back on the saf-est and most straightforward procedure:Resample the points of the scatter plot(possibly getting multiple copies of the

    same point), and rerun the spline smooth-

    er on this new data set. Each replicationwill give a different amount of smooth-ing and ultimately a different curve. Fig-ure 5 shows the individual curves from800 bootstrap replicates, indicating thesampling distribution, together with 95percent confidence limits for the curveas a whole. The overall negative slopeand the asymmetry between positive

    and negative returns are still there, butwe can also see that our estimated curveis much better pinned down for small-magnitude returns, where there are lotsof data, than for large-magnitude returns,where theres little information and smallperturbations can have more effect.

    Smoothing Things Out

    Bootstrapping has been ramified tremen-dously since Efrons original paper, and Ihave sketched only the crudest features.Nothing Ive done here actually proves

    that it works, although I hope Ive madethat conclusion plausible. And indeedsometimes the bootstrap fails; it givesvery poor answers, for instance, to ques-tions about estimating the maximum (orminimum) of a distribution. Understand-ing the difference between that case andthat of q0.01, for example, turns out to in-volve rather subtle math. Parameters arefunctions of the distribution generatingthe data, and estimates are functions ofthe data or of the empirical distribution.For the bootstrap to work, the empiricaldistribution has to converge rapidly onthe true distribution, and the parametermust smoothly depend on the distribu-tion, so that no outlier ends up undulyinfluencing the estimates. Making influ-ence precise here turns out to mean tak-ing derivatives in infinite-dimensionalspaces of probability distribution func-tions, and the theory of the bootstrap is adelicate combination of functional analy-sis with probability theory. This sort oftheory is essential to developing new

    bootstrap methods for new problems,such as ongoing work on resampling

    spatial data, or model-based bootstrapswhere the model grows in complexitywith the data.

    The bootstrap has earned its placein the statisticians toolkit because, ofall the ways of handling uncertainty incomplex models, it is at once the moststraightforward and the most flexible. Itwill not lose that place so long as the eraof big data and fast calculation endures.

    Bibliography

    Efron, B. 1979. Bootstrap methods: another look

    at the jackknife.Annals of Statistics7:126.