Wichmann (2001) the Psycho Metric Function. II. Bootstrap-based Confidence Intervals and Sampling

8/8/2019 Wichmann (2001) the Psycho Metric Function. II. Bootstrap-based Confidence Intervals and Sampling

1/16

Copyright 2001 Psychonomic Society, Inc. 1314

Perception & Psychophysics

2001, 63 (8), 1314-1329

The performance of an observer on a psychophysicaltask is typically summarized by reporting one or more re-sponse thresholdsstimulus intensities required to pro-duce a given level of performanceand by a characteri-zation of the rate at which performance improves withincreasing stimulus intensity. These measures are de-rived from apsychometric function, which describes thedependence of an observers performance on some phys-ical aspect of the stimulus.

Fitting psychometric functions is a variant of the moregeneral problem of modeling data. Modeling data is a

three-step process: First, a model is chosen, and the param-eters are adjusted to minimize the appropriate error met-ric or loss function. Second, error estimates of the param-eters are derived and third, the goodness of f it betweenmodel and the data is assessed. This paper is concernedwith the second of these steps, the estimation of variabilityin fitted parameters and in quantities derived from them.Our companion paper (Wichmann & Hill, 2001) illustrateshow to fit psychometric functions while avoiding bias re-sulting from stimulus-independent lapses, and how to eval-uate goodness of fit between model and data.

We advocate the use of Efrons bootstrap method, aparticular kind of Monte Carlo technique, for the prob-lem of estimating the variability of parameters, thresholds,

and slopes of psychometric functions (Efron, 1979, 1982;Efron & Gong, 1983; Efron & Tibshirani, 1991, 1993).Bootstrap techniques are not without their own assump-tions and potential pitfalls. In the course of this paper, weshall discuss these and examine their effect on the esti-mates of variability we obtain. We describe and examinethe use of parametric bootstrap techniques in finding con-fidence intervals for thresholds and slopes. We then ex-plore the sensitivity of the estimated confidence intervalwidths to (1) sampling schemes, (2) mismatch of the ob-

jective function, and (3) accuracy of the originally fittedparameters. The last of these is particularly important,

since it provides a test of the validity of the bridging as-

Part of this work was presented at the Computers in Psychology (CiP)

Conference in York, England, dur ing April 1998 (Hill & Wichmann,1998). This research was supported by a Wellcome Trust Mathematical

Biology Studentship, a Jubi lee Scholarship from St. Hughs College,Oxford, and a Fellowship by Examination from Magdalen College, Ox-

ford, to F.A.W. and by a grant from the Christopher Welch Trust Fundand a Maplethorpe Scholarship from St. Hughs College, Oxford, to N.J.H.

We are indebted to Andrew Derrington, Karl Gegenfurtner, Bruce Hen-ning, Larry Maloney, Eero Simoncelli, and Stefaan Tibeau for helpful

comments and suggestions. This paper benefited considerably from con-scientious peer review, and we thank our reviewers, David Foster, Stan

Klein, Marjorie Leek, and Bernhard Treutwein, for helping us to improveour manuscript. Software implementing the methods described in this

paper is available (MATLAB); contact F.A.W. at the address provided orsee http://users.ox.ac.uk/~sruoxfor/psychofit/. Correspondence con-

cerning this article should be addressed to F. A. Wichmann, Max-Planck-Institut fr Biologische Kybernetik, Spemannstrae 38, D-72076

Tbingen, Germany (e-mail: [email protected]).

The psychometric function: II. Bootstrap-basedconfidence intervals and sampling

FELIX A. WICHMANN and N. JEREMY HILL

University of Oxford, Oxford, England

The psychometric function relates an observers performance to an independent variable, usually a phys-ical quantity of an experimental stimulus. Even if a model is successfully fit to the data and its goodnessof fit is acceptable, experimenters require an estimate of the variability of the parameters to assess whetherdifferences across conditions are significant. Accurate estimates of variability are difficult to obtain, how-ever, given the typically small size of psychophysical data sets: Traditional statistical techniques are onlyasymptotically correct and can be shown to be unreliable in some common situations. Here and in ourcompanion paper (Wichmann & Hill, 2001), we suggest alternative statistical techniques based on MonteCarlo resampling methods. The present papers principal topic is the estimation of the variability of fittedparameters and derived quantities, such as thresholds and slopes. First, we outline the basic bootstrapprocedure and argue in favor of the parametric, as opposed to the nonparametric, bootstrap. Second, wedescribe how the bootstrap bridging assumption, on which the validity of the procedure depends, can

be tested. Third, we show how ones choice of sampling scheme (the placement of sample points on thestimulus axis) strongly affects the reliability of bootstrap confidence intervals, and we make recom-mendations on how to sample the psychometric function efficiently. Fourth, we show that, under cer-tain circumstances, the (arbitrary) choice of the distribution function can exert an unwanted influence onthe size of the bootstrap confidence intervals obtained, and we make recommendations on how to avoidthis influence. Finally, we introduce improved confidence intervals (bias corrected and accelerated) thatimprove on the parametric and percentile-based bootstrap confidence intervals previously used. Softwareimplementing our methods is available.
http://users.ox.ac.uk/~sruoxfor/psychofit/


2/16

THE PSYCHOMETRIC FUNCTION II 1315

sumption on which the use of parametric bootstrap tech-niques relies. Finally, we recommend, on the basis of thetheoretical work of others, the use of a technique calledbias correction with acceleration (BCa) to obtain stableand accurate confidence interval estimates.

BACKGROUND

The Psychometric FunctionOur notation will follow the conventions we have out-

lined in Wichmann and Hill (2001). A brief summary ofterms follows.

Performance on Kblocks of a constant-stimuli psy-chophysical experiment can be expressed using threevectors, each of length K. An x denotes the stimulus val-ues used, an n denotes the numbers of trials performed ateach point, and a y denotes the proportion of correct re-sponses (in n-alternative forced-choice [n-AFC] experi-

ments) or positive responses (single-interval or yes/no ex-periments) on each block. We often useNto refer to thetotal number of trials in the set,N5 ni.

The number of correct responsesyini in a given blockiis assumed to be the sum of random samples from aBernoulli process with probability of successpi. A psy-chometric function y(x) is the function that relates thestimulus dimensionx to the expected performance valuep.

A common general form for the psychometric func-tion is

(1)

The shape of the curve is determined by our choice of afunctional form for Fand by the four parameters {a, b, g,l}, to which we shall refer collectively by using the pa-rameter vector q. Fis typically a sigmoidal function, suchas the Weibull, cumulative Gaussian, logistic, or Gumbel.We assume that Fdescribes the underlying psychologicalmechanism of interest: The parameters gand ldeterminethe lower and upper bounds of the curve, which are af-fected by other factors. In yes/no paradigms,gis the guessrate and lthe miss rate. In n-AFC paradigms, gusuallyreflects chance performance and is f ixed at the recipro-

cal of the number of intervals per trial, and lreflects thestimulus-independent error rate or lapse rate (see Wich-mann & Hill, 2001, for more details).

When a parameter set has been estimated, we will usu-ally be interested in measurements of the threshold (dis-placement along thex-axis) and slope of the psychomet-ric function. We calculate thresholds by taking the inverseofFat a specified probability level, usually .5. Slopes arecalculated by finding the derivative ofFwith respect tox,evaluated at a specified threshold. Thus, we shall use thenotation threshold0.8, for example, to mean F

10.8, and

slope0.8 to mean dF/dx evaluated at F1

0.8. When we use the

terms thresholdand slope without a subscript, we meanthreshold0.5 and slope0.5: In our 2AFC examples, this willmean the stimulus value and slope ofFat the point whereperformance is approximately 75% correct, although the

exact performance level is affected slightly by the (small)value ofl.

Where an estimate of a parameter set is required, givena particular data set, we use a maximum-likelihood searchalgorithm, with Bayesian constraints on the parametersbased on our beliefs about their possible values. For ex-

ample, lis constrained within the range [0, .06], reflect-ing our belief that normal, trained observers do not makestimulus-independent errors at high rates. We describeour method in detail in Wichmann and Hill (2001).

Estimates of Variability: AsymptoticVersus Monte Carlo Methods

In order to be able to compare response thresholds orslopes across experimental conditions, experimenters re-quire a measure of their variability, which will depend onthe number of experimental trials taken and their place-ment along the stimulus axis. Thus, a f itting proceduremust provide not only parameter estimates, but also errorestimates for those parameters. Reporting error estimateson fitted parameters is unfortunately not very common inpsychophysical studies. Sometimes probit analysis hasbeen used to provide variability estimates (Finney, 1952,1971). In probit analysis, an iteratively reweighted linearregression is performed on the data once they have under-gone transformation through the inverse of a cumulativeGaussian function. Probit analysis relies, however, on as-ymptotic theory: Maximum-likelihood estimators are as-ymptotically Gaussian, allowing the standard deviation tobe computed from the empirical distribution (Cox & Hink-

ley, 1974). Asymptotic methods assume that the numberof datapoints is large; unfortunately, however, the numberof points in a typical psychophysical data set is small (be-tween 4 and 10, with between 20 and 100 trials at each),and in these cases, substantial errors have accordingly beenfound in the probit estimates of variability (Foster &Bischof, 1987, 1991; McKee, Klein, & Teller, 1985). Forthis reason, asymptotic theory methods are not recom-mended for estimating variability in most realistic psy-chophysical settings.

An alternative method, the bootstrap (Efron, 1979,1982; Efron & Gong, 1983; Efron & Tibshirani, 1991,

1993), has been made possible by the recent sharp increasein the processing speed of desktop computers. The boot-strap method is a Monte Carlo resampling technique re-lying on a large number of simulated repetitions of theoriginal experiment. It is potentially well suited to theanalysis of psychophysical data, because its accuracy doesnot rely on large numbers of trials, as do methods derivedfrom asymptotic theory (Hinkley, 1988). We apply thebootstrap to the problem of estimating the variability ofparameters, thresholds, and slopes of psychometric func-tions, following Maloney (1990), Foster and Bischof(1987, 1991, 1997), and Treutwein (1995; Treutwein &

Strasburger, 1999).The essence of Monte Carlo techniques is that a largenumber,B, of synthetic data sets y1*, , yB* are gener-ated. For each data set yi*, the quantity of interest J(thresh-

y a b g l g g l a b x F x; , , , ; , .( ) = + ( ) ( )1


3/16

1316 WICHMANN AND HILL

old or slope, e.g.) is estimated to give Ji*. The process forobtaining Ji* is the same as that used to obtain the f irst esti-mate J. Thus, if our first estimate was obtained by J5 t(q),where q is the maximum-likelihood parameter estimatefrom a fit to the original data y, so the simulated esti-mates Ji* will be given by J

i*5 t(qi*), where qi* is the

maximum-likelihood parameter estimate from a fit to thesimulated data yi*.Sometimes it is erroneously assumed that the intention

is to measure the variability of the underlying J itself.This cannot be the case, however, because repeated com-puter simulation of the same experiment is no substitutefor the real repeated measurements this would require.What Monte Carlo simulationscan do is estimate the vari-ability inherent in (1) our sampling, as characterized bythe distribution of sample points (x) and the size of thesamples (n), and (2) any interaction between our sam-pling strategy and the process used to estimate Jthat is,

assuming a model of the observers variability, f itting afunction to obtain q, and applying t(q).

Bootstrap Data Sets: Nonparametricand Parametric Generation

In applying Monte Carlo techniques to psychophysicaldata, we require, in order to obtain a simulated data set yi*,some system that provides generating probabilities p forthe binomial variatesyi1*, ,yiK*. These should be thesame generating probabilit ies that we hypothesize to un-derlie the empirical data set y.

Efrons bootstrap offers such a system. In the nonpara-

metric bootstrap method, we would assume p 5 y. Thisis equivalent to resampling, with replacement, the origi-nal set of correct and incorrect responses on each blockof observationsj in y to produce a simulated sampleyij*.

Alternatively, a parametric bootstrap can be performed.In the parametric bootstrap, assumptions are made aboutthe generating model from which the observed data arebelieved to arise. In the context of estimating the variabil-ity of parameters of psychometric functions, the data aregenerated by a simulated observer whose underlyingprobabilities of success are determined by the maximum-likelihood fit to the real observers data [yf it 5 y(x;q)].

Thus, where the nonparametric bootstrap uses y, the para-metric bootstrap uses yf it as generating probabilities pfor the simulated data sets.

As is frequently the case in statistics, the choice ofparametric versus nonparametric analysis concerns howmuch conf idence one has in ones hypothesis about theunderlying mechanism that gave rise to the raw data, asagainst the conf idence one has in the raw datas precisenumerical values. Choosing the parametric bootstrap forthe estimation of variability in psychometric function fit-ting appears the natural choice for several reasons. Firstand foremost, in fitting a parametric model to the data,

one has already committed oneself to a parametric analy-sis. No additional assumptions are required to perform aparametric bootstrap beyond those required for f itting afunction to the data: Specification of the source of vari-

ability (binomial variability) and the model from which thedata are most likely to come (parameter vector q and dis-tribution function F). Second, given the assumption thatdata from psychophysical experiments are binomially dis-tributed, we expectdata to be variable (noisy). The non-parametric bootstrap treats every datapoint as if its exact

value reflected the underlying mechanism.1

The paramet-ric bootstrap, on the other hand, allows the datapoints tobe treated as noisy samples from a smooth and monotonicfunction, determined by q and F.

One consequence of the two different bootstrap regimesis as follows. Assume two observers performing the samepsychophysical task at the same stimulus intensities x andassume that it happens that the maximum-likelihood fitsto the two data sets yield identical parameter vectors q.Given such a scenario, the parametric bootstrap returnsidentical estimates of variability for both observers, sinceit depends only on x, q, and F. The nonparametric boot-

straps estimates would, on the other hand, depend on theindividual differences between the two data sets y1 and y2,something we consider unconvincing: A method for es-timating variability in parameters and thresholds shouldreturn identical estimates for identical observers perform-ing the identical experiment.2

Treutwein (1995) and Treutwein and Strasburger (1999)used the nonparametric bootstrap and Maloney (1990)used the parametric bootstrap to compare bootstrap esti-mates of variability with real-word variability in the dataof repeated psychophysical experiments. All of the abovestudies found bootstrap studies to be in agreement with the

human data. Keeping in mind that the number of repeatsin the above-quoted cases was small, this is nonethelessencouraging, suggesting that bootstrap methods are a validmethod of variability estimation for parameters fitted topsychophysical data.

Testing the Bridging AssumptionAsymptoticallythat is, for large KandNq will con-

verge toward q, since maximum-likelihood estimation isasymptotically unbiased3 (Cox & Hinkley, 1974; Kendall& Stuart, 1979). For the small Ktypical of psychophysicalexperiments, however, we can only hope that our estimated

parameter vector q is close enough to the true param-eter vector q for the estimated variability in the parametervector q obtained by the bootstrap method to be valid. Wecall this the bootstrap bridging assumption.

Whether q is indeed sufficiently close to q depends, ina complex way, on the samplingthat is, the number ofblocks of trials (K), the numbers of observations at eachblock of trials (n), and the stimulus intensities (x) relativeto the true parameter vector q. Maloney (1990) summa-rized these dependencies for a given experimental designby plotting the standard deviation ofb as a function ofaand bas a contour plot (Maloney, 1990, Figure 3, p. 129).

Similar contour plots for the standard deviation ofa andfor bias in both a and b could be obtained. If, owing tosmall K, bad sampling, or otherwise, our estimation pro-cedure is inaccurate, the distribution of bootstrap param-


4/16


eter vectors q1*, , qB*centered around qwill notbe centered around true q. As a result, the estimates ofvariability are likely to be incorrect unless the magnitudeof the standard deviation is similar around q and q, despitethe fact that the points are some distance apart in pa-rameter space.

One way to assess the likely accuracy of bootstrap es-timates of variability is to follow Maloney (1990) and toexamine the local flatness of the contours around q, ourbest estimate ofq. If the contours are sufficiently flat, thevariability estimates will be similar, assuming that thetrue q is somewhere within this flat region. However, theprocess of local contour estimation is computationally ex-pensive, since it requires a very large number of completebootstrap runs for each data set.

A much quicker alternative is the following. Having ob-tained q and performed a bootstrap using y(x;q ) as thegenerating function, we move to eight different pointsf1, ,

f8 in abspace. Eight Monte Carlo simulations are per-formed, using f1, , f8 as the generating functions to ex-plore the variability in those parts of the parameter space(only the generating parameters of the bootstrap are changedx remains the same for all of them). If the contours of vari-ability around q are sufficiently flat, as we hope they are,confidence intervals at f1, , f8 should be of the samemagnitude as those obtained at q . Prudence should leadus to accept the largest of the nine confidence intervalsobtained as our estimate of variability.

A decision has to be made as to which eight points inabspace to use for the new set of simulations. Gener-

ally, provided that the psychometric function is at leastreasonably well sampled, the contours vary smoothly inthe immediate vicinity ofq, so that the precise placementof the sample points f1, , f8 is not critical. One sug-gested and easy way to obtain a set of additional gener-ating parameters is shown in Figure 1.

Figure 1 showsB5 2,000 bootstrap parameter pairs asdark filled circles plotted in abspace. Simulated datasets were generated from ygen with the Weibull as Fandqgen 5 {10, 3, 0.5, 0.01}; sampling scheme s7 (triangles;see Figure 2) was used, andNwas set to 480 (ni 5 80).The large central triangle at (10, 3) marks the generating

parameter set; the solid and dashed line segments adjacentto thex- andy-axes mark the 68% and 95% confidenceintervals for aand b, respectively.4 In the following, weshall use WCI to stand for width of confidence interval,with a subscript denoting its coverage percentagethat is,WCI68 denotes the width of the 68% confidence interval.

5

The eight additional generating parameter pairs f1, , f8are marked by the light triangles. They form a rectanglewhose sides have length WCI68 in aand b. Typically, thiscentral rectangular region contains approximately 30%40% of all abpairs and could thus be viewed as a crude

joint 35% confidence region for aand b. A coverage per-

centage of 35% represents a sensible compromise betweenerroneously accepting the estimate around q, very likelyunderestimating the true variability, and performing ad-

ditional bootstrap replications too far in the periphery,where variability and, thus, the estimated confidence in-tervals become erroneously inflated owing to poor sam-pling. Recently, one of us (Hill, 2001b) performed MonteCarlo simulations to test the coverage of various bootstrapconfidence interval methods and found that the above

method, based on 25% of points or more, was a good wayof guaranteeing that both sides of a two-tailed confidenceinterval had at least the desired level of coverage.

MONTE CARLO SIMULATIONS

In both our papers, we use only the specific case of the2AFC paradigm in our examples: Thus, gis fixed at .5.In our simulations, where we must assume a distributionoftrue generating probabilities, we always use the Wei-bull function in conjunction with the same f ixed set ofgenerating parameters qgen: {agen5 10, bgen5 3, ggen5 .5,

lgen5 .01}. In our investigation of the effects of samplingpatterns, we shall always use K5 6 and ni constantthatis, six blocks of trials with the same number of points ineach block. The number of observations per point,ni, couldbe set to 20, 40, 80, or 160, and with K5 6, this meansthat the total number of observationsNcould take the val-ues 120, 240, 480, and 960.

We have introduced these limitations purely for the pur-poses of illustration, to keep our explanatory variablesdown to a manageable number. We have found, in manyother simulations, that in most cases this is done withoutloss of generality of our conclusions.

Confidence intervals of parameters, thresholds, andslopes that we report in the following were always obtainedusing a method called bias-corrected and accelerated(BCa ) confidence intervals, which we describe later in ourBootstrap Confidence Intervals section.

The Effects of Sampling Schemesand Number of Trials

One of our aims in this study was to examine the ef-fect ofNand ones choice of sample points x on both thesize of ones conf idence intervals for J and their sensitiv-ity to errors in q.

Seven different sampling schemes were used, each dic-tating a different distribution of datapoints along the stim-ulus axis; they are the same schemes as those used and de-scribed in Wichmann and Hill (2001), and they are shownin Figure 2. Each horizontal chain of symbols representsone of the schemes, marking the stimulus values at whichthe six sample points are placed. The different symbolshapes will be used to identify the sampling schemes in ourresults plots. To provide a frame of reference, the solidcurve shows the psychometric function usedthat is,0.51 0.5F(x;{agen, bgen})with the 55%, 75%, and 95%performance levels marked by dotted lines.

As we shall see, even for a f ixed number of samplepoints and a f ixed number of trials per point, biases inparameter estimation and goodness-of-fit assessment


5/16


(companion paper), as well as the width of confidence in-tervals (this paper), all depend markedly on the distrib-

ution of stimulus values x.Monte Carlo data sets were generated using our seven

sampling schemes shown in Figure 2, using the generationparameters qgen, as well asNand Kas specified above.A maximum-likelihood fit was performed on each sim-ulated data set to obtain bootstrap parameter vectors q*,from which we subsequently derived thex values corre-sponding to threshold0.5 and threshold0.8, as well as to theslope. For each sampling scheme and value ofN, a totalof nine simulations were performed: one at qgen and eightmore at points f1, , f8 as specified in our section on thebootstrap bridging assumption. Thus, each of our 28

conditions (7 sampling schemes3 4 values ofN) required93 1,000 simulated data sets, for a total of 252,000 sim-ulations, or 1.1343 108 simulated 2AFC trials.

Figures 3, 4, and 5 show the results of the simulationsdealing with slope0.5, threshold0.5, and threshold0.8, re-

spectively. The top-left panel (A) of each figure plots theWCI68 of the estimate under consideration as a functionofN. Data for all seven sampling schemes are shown,using their respective symbols. The top-right hand panel(B) plots, as a function ofN, the maximal elevation ofthe WCI68 encountered in the vicinity ofqgenthat is,max{WCIf1/WCIqgen, , WCIf8/WCIqgen}. The eleva-tion factor is an indication of the sensitivity of our vari-ability estimates to errors in the estimation ofq. Thecloser it comes to 1.0, the better. The bottom-left panel(C) plots the largest of the WCI68s measurements at qgenand f1, , f8 for a given sampling scheme andN; it is,thus, the product of the elevation factor plotted in panelB by its respective WCI68 in panel A. This quantity wedenote as MWCI68, standing for maximum width of the

8 9 10 11 121

2

3

4

5

6

7

8

parametera

N = 480

= {10, 3, 0.5, 0.01}

parameterb

95% confidence interval

68% confidence interval (WCI68)

Figure 1. B 5 2,000 data sets were generated from a two-alternative forced-choice Weibullpsychometric function with parameter vector qgen 5 {10, 3, .5, .01} and then fit using ourmaximum-likelihood procedure, resulting in 2 ,000 estimated parameter pairs (a , b ) shownas dark circles in abparameter space. The location of the generating aand b (10, 3) ismarked by the large triangle in the center of the plot. The sampling scheme s7 was used togenerate the data sets (see Figure 2 for details) with N5 480. Solid lines mark the 68% con-

fidence interval width (WCI68) separately for aand b; broken lines mark the 95% confidenceintervals. The light small triangles show the abparameter sets f1, , f8 from which eachbootstrap is repeated during sensitivity analysis while keeping the x-values of the samplingscheme unchanged.


6/16


68% confidence interval; this is the confidence intervalwe suggest experimenters should use when they reporterror bounds. The bottom-right panel (D), finally, showsMWCI68 multiplied by a sweatfactor,N/120. Ideally

that is, for very large KandNconfidence intervalsshould be inversely proportional to N, and multiplica-tion by our sweat factor should remove this dependency.Any change in the sweat-adjusted MWCI68 as a functionofNmight thus be taken as an indicator of the degree towhich sampling scheme,N, and true conf idence intervalwidth interact in a way not predicted by asymptotic theory.

Figure 3A shows the WCI68 around the median esti-mated slope. Clearly, the different sampling schemes havea profound effect on the magnitude of the confidence in-tervals for slope estimates. For example, in order to en-sure that the WCI68 is approximately 0.06, one requires

nearly 960 trials if using sampling s1 or s4. Samplingschemes s3, s5, or s7, on the other hand, require onlyaround 200 trials to achieve similar confidence intervalwidth. The important difference that makes, for example,s3, s5, and s7 more efficient than s1 and s4 is the presenceof samples at high predicted performance values (p$ .9)where binomial variability is low and, thus, the data con-strain our maximum-likelihood fit more tightly. Figure 3Billustrates the complex interactions between different sam-pling schemes,N, and the stability of the bootstrap esti-mates of variability, as indicated by the local flatness of thecontours around qgen. A perfectly flat local contour wouldresult in the horizontal line at 1. Sampling scheme s6 iswell behaved forN$ 480, its maximal elevation beingaround 1.7. ForN, 240, however, elevation rises to near

2.5. Other schemes, like s1, s4, or s7, never rise above an el-evation of 2 regardless ofN. It is important to note that themagnitude of WCI68 at qgen is by no means a good predic-tor of the stabilityof the estimate, as indicated by the sen-

sitivity factor. Figure 3C shows the MWCI68, and here thedifferences in efficiency between sampling schemes areeven more apparent than in Figure 3A. Sampling schemess3, s5, and s7 are clearly superior to all other samplingschemes, particularly forN, 480; these three samplingschemes are the ones that include two sampling points at

p . .92.Figure 4, showing the equivalent data for the estimates

of threshold0.5, also illustrates that some sampling schemesmake much more efficient use of the experimenters timeby providing MWCI68s that are more compact than oth-ers by a factor of 3.2.

Two aspects of the data shown in Figure 4B and 4C areimportant. First, the sampling schemes fall into two dis-tinct classes: Five of the seven sampling schemes are al-most ideally well behaved, with elevations barely exceed-ing 1.5 even for smallN(and MWCI68# 3); the remainingtwo, on the other hand, behave poorly, with elevations inexcess of 3 forN, 240 (MWCI68. 5). The two unstablesampling schemes, s1 and s4, are those that do notin-clude at least a single sample point atp $ .95 (see Fig-ure 2). It thus appears crucial to include at least one sam-ple point atp$ .95 to make ones estimates of variabilityrobust to small errors in q , even if the threshold of inter-est, as in our example, has ap of only approximately .75.Second, both s1 and s4 are prone to lead to a serious un-derestimation of the true width of the confidence interval

4 5 7 10 12 15 17 20

0

.1

.2

.3

.4

.5

.6

.7

.8

.9

1

signal intensity

proportionco

rrect

s7

s6

s5

s4

s3

s2

s1

Figure 2. A two-alternative forced-choice Weibull psychometric function with parameter vector q5{10, 3, .5, 0} on semilogarithmic coordinates. The rows of symbols below the curve mark the x valuesof the seven different sampling schemes, s1s7, used throughout the remainder of the paper.


7/16


if the sensitivity analysis (or bootstrap bridging assump-

tion test) is not carried out: The WCI68 is unstable in thevicinity ofqgen even though WCI68 is small at qgen.

Figure 5 is similar to Figure 4, except that it shows theWCI68 around threshold0.8. The trends found in Figure 4are even more exaggerated here: The sampling schemes

without high sampling points (s1, s4) are unstable to the

point of being meaningless forN, 480. In addition, theirWCI68 is inflated, relative to that of the other samplingschemes, even at qgen.

The results of the Monte Carlo simulations are sum-marized in Table 1. The columns of Table 1 correspond to

120 240 480 960

0.05

0.1

0.15

120 240 480 960

1

1.5

2

2.5

120 240 480 960

0.05

0.1

0.15

120 240 480 960

0.1

0.15

0.2

0.25

maximalelevationofWCI68

slopeWCI68

sqrt(N/120)*slopeMWCI68

slopeMWCI68

A B

C D

total number of observations N total number of observations N

total number of observations Ntotal number of observations N

sampling schemes

s1 s7s6s5s4s3s2

Figure 3. (A) The width of the 68% BCa confidence interval (WCI68) of slopes ofB 5 999 fitted psychometric functions to para-metric bootstrap data sets generated from qgen 5 {10, 3, .5, .01} as a function of the total number of observations, N. (B) The maximalelevation of the WCI68 in the vicinity ofqgen, again as a function ofN(see the text for details). (C) The maximal width of the 68% BCaconfidence interval (MWCI68) in the vicinity ofqgen,that is, the product of the respective WCI68 and elevations factors of panels Aand B. (D) MWCI68 scaled by the sweat factor sqrt(N/120), the ideally expected decrease in confidence interval width with increasingN. The seven symbols denote the seven sampling schemes as of Figure 2.


8/16


the different sampling schemes, marked by their respec-tive symbols. The first four rows contain the MWCI68 atthreshold0.5 forN5 120, 240, 480, and 960; similarly, thenext four rows contain the MWCI68 at threshold0.8, and

the following four those for the slope0.5. The scheme withthe lowest MWCI68 in each row is given the score of 100.The others on the same row are given proportionally higherscores, to indicate their MWCI68 as a percentage of thebest schemes value. The bottom three rows of Table1 con-

tain summary statistics of how well the different samplingschemes do across all 12 estimates.

An inspection of Table 1 reveals that the samplingschemes fall into four categories. By a long way worst

are sampling schemes s1 and s4, with mean and medianMWCI68. 210%. Already somewhat superior is s2, par-ticularly for smallN, with a median MWCI68 of 180% anda significantly lower standard deviation. Next come sam-pling schemes s5 and s6, with means and medians of

120 240 480 960

0.5

0.7

1

2

3

4

5

7

120 240 480 960

1

2

3

4

120 240 480 960

0.5

0.7

1

2

3

4

5

7

120 240 480 9601

2

3

4

maximalelevationofWCI68

threshold(F0.5

)WCI68

-1

sqrt(N/120)*threshold(F0.5

)M

WCI68

-1

threshold(F0.5

)MWCI6

8

-1

A B

C D



sampling schemes

s1 s7s6s5s4s3s2

Figure 4. Similar to Figure 3, except that it shows thresholds corresponding to F 10.5

(approximately equal to 75% correct during two-

alternative forced-choice).


9/16


around 130%. Each of these has at least one sample atp$.95, and 50% at p $ .80. The importance of this highsample point is clearly demonstrated by s6. Comparing s1and s6, s6 is identical to scheme s1, except that one sam-

ple point was moved from .75 to .95. Still, s6 is superiorto s1 on each of the 12 estimates, and often very markedlyso. Finally, there are two sampling schemes with meansand medians below 120%very nearly optimal6 on mostestimates. Both of these sampling schemes, s3 and s7,

have 50% of their sample points atp $ .90 and one thirdatp $ .95.

In order to obtain stable estimates of the variability ofparameters, thresholds, and slopes of psychometric

functions, it appears that we must include at least one,but preferably more, sample points at largep values. Suchsampling schemes are, however, sensitive to stimulus-independent lapses that could potentially bias the esti-mates if we were to f ix the upper asymptote of the psy-

120 240 480 960

0.7

1

2

3

5

10

120 240 480 960

1

2

3

4

5

120 240 480 9600.7

1

2

3

5

10

120 240 480 960

2

4

6

8

10

12

maxim

alelevationofWCI68

threshold(F0.8

)WCI68

-1

sqrt(N/120)*threshold(F0.8

)MWCI68

-1

threshold(F0.8

)MWCI68

-1

A B

C D



sampling schemes

s7s6s5s4s3s1 s2

Figure 5. Similar to Figure 3, except that it shows thresholds corresponding to F 10.8 (approximately equal to 90% correct during two-alternative forced-choice).


10/16


chometric function (the parameter l in Equation 1; seeour companion paper, Wichmann & Hill, 2001).

Somewhat counterintuitively, it is thus not sensible toplace all or most samples close to the point of interest (e.g.,close to threshold0.5, in order to obtain tight confidence in-tervals for threshold0.5), because estimation is done via the

whole psychometric function, which, in turn, is estimatedfrom the entire data set: Sample points at highp values (ornear zero in yes/no tasks) have very little variance and thusconstrain the fit more tightly than do points in the middleof the psychometric function, where the expected varianceis highest (at least during maximum-likelihood fitting).Hence, adaptive techniques that sample predominantlyaround the threshold value of interest are less efficientthan one might think (cf. Lam, Mills, & Dubno, 1996).

All the simulations reported in this section were per-formed with the aand bparameter of the Weibull distri-bution function constrained during fitting; we used Baye-

sian priors to limit ato be between 0 and 25 and bto bebetween 0 and 15 (not including the endpoints of the in-tervals), similar to constraining lto be between 0 and .06(see our discussion of Bayesian priors in Wichmann &Hill, 2001). Such fairly stringent bounds on the possibleparameter values are frequently justifiable given the ex-perimental context (cf. Figure 2); it is important to note,however, that allowing aand bto be unconstrained wouldonly amplify the differences between the samplingschemes7 but would not change them qualitatively.

Influence of the Distribution Function

on Estimates of VariabilityThus far, we have argued in favor of the bootstrapmethod for estimating the variability of fitted parameters,

thresholds, and slopes, since its estimates do not rely onasymptotic theory. However, in the context of fitting psy-chometric functions, one requires in addition that theexact form of the distribution function FWeibull, logis-tic, cumulative Gaussian, Gumbel, or any other reason-ably similar sigmoidhas only a minor influence on the

estimates of variability. The importance of this cannot beunderestimated, since a strong dependence of the esti-mates of variability on the precise algebraic form of thedistribution function would call the usefulness of the boot-strap into question, because, as experimenters, we do notknow, and never will, the true underlying distributionfunction or objective function from which the empiricaldata were generated. The problem is illustrated in Fig-ure 6; Figure 6A shows four different psychometric func-tions: (1) yW(x;qW), using the Weibull as F, and qW 5{10, 3, .5, .01} (our standard generating function ygen);(2) yCG(x;qCG), using the cumulative Gaussian with qCG5

{8.875, 3.278, .5, .01}; (3) yL(x;qL ), using the logisticwith qL5 {8.957, 2.014, .5, .01}; and finally, (4) yG(x;qG),using the Gumbel and qG5 {10.022, 2.906, .5, .01}. Forall practical purposes in psychophysics, the four functionsare indistinguishable. Thus, if one of the above psycho-metric functions were to provide a good fit to a data set,all of them would, despite the fact that, at most, one ofthem is correct. The question one has to ask is whethermaking the choice of one distribution function over an-other markedly changes the bootstrap estimates of vari-ability.8 Note that this is not trivially true: Although it canbe the case that several psychometric functions with dif-

ferent distribution functionsFare indistinguishablegivena particular data setas is shown in Figure6Athis doesnot imply that the same is true for every data set generated

Table 1

Summary of Results of Monte Carlo Simulations

N s1 s2 s3 s4 s5 s6 s7

MWCI68 at 120 331 135 143 369 162 100 110

x 5 F.51 240 209 163 126 339 133 100 113

480 156 179 143 193 158 100 155

960 124 141 140 138 152 100 119MWCI68 at 120 5725 428 100 1761 180 263 115

x 5 F.81 240 1388 205 100 1257 110 125 109

480 385 181 115 478 100 116 137960 215 149 100 291 102 119 101

MWCI68 of dF/dx at 120 212 305 119 321 139 217 100x 5 F

.51 240 228 263 100 254 128 185 121

480 186 213 119 217 100 148 134960 186 175 113 201 100 151 118

Mean 779 211 118 485 130 140 119

Standard deviation 1595 85 17 499 28 49 16Median 214 180 117 306 131 122 117

NoteColumns correspond to the seven sampling schemes and are marked by their respective symbols (see Figure 2). The

first four rows contain the MWCI68 at threshold0.5 forN5 120, 240, 480, and 960; similarly, the next eight rows contain the

MWCI68 at threshold0.8 and slope0.5. (See the text for the definition of the MWCI68.) Each entry corresponds to the largestMWCI68 in the vicinity ofqgen, as sampled at the points qgen and f1, , f8. The MWCI68 values are expressed in percent-age relative to the minimal MWCI68 per row. The bottom three rows contain summary statistics of how well the different

sampling schemes perform across estimates.


11/16


from one of such similar psychometric functions duringthe bootstrap procedure: Figure 6B shows the fit of twopsychometric functions (Weibull and logistic) to a data setgenerated from our standard generating function ygen,using sampling scheme s2 withN5 120.

Slope, threshold0.8, and threshold0.2 are quite dissimilarfor the two f its, illustrating the point that there is a realpossibility that the bootstrap distributions of thresholds

and slopes from theB bootstrap repeats differ substan-tially for different choices ofF, even if the fits to theoriginal (empirical) data set were almost identical.

To explore the effect ofFon estimates of variability,we conducted Monte Carlo simulations, using

as the generating function, and fitted psychometric func-tions, using the Weibull, cumulative Gaussian, logistic,and Gumbel as the distribution functions to each data set.

From the f itted psychometric functions, we obtained es-timates of threshold0.2, threshold0.5, threshold0.8, andslope0.5, as described previously. All four different val-

y y y y y = + + +[ ]14

W CG L G

2 4 6 8 10 12 14.4

.5

.6

.7

.8

.9

1

2 4 6 8 10 12 14.4

.5

.6

.7

.8

.9

1

signal intensity

prop

ortioncorrect

signal intensity

proportionc

orrect

W(Weibull), = {10, 3, 0.5, 0.01}

CG(cumulative Gaussian), = {8.9, 3.3, 0.5, 0.01}

L(logistic) , = {9.0, 2.0, 0.5, 0.01}

G(Gumbel), = {10.0, 2.9, 0.5, 0.02}

W(Weibull), = {8.4, 3.4, 0.5, 0.05}

L(logistic) , = {7.8, 1.0, 0.5, 0.05}

A

B

Figure 6. (A) Four two-alternative forced-choice psychometric functions plotted onsemilogarithmic coordinates; each has a different distribution function F (Weibull, cu-

mulative Gaussian, logistic, and Gumbel). See the text for details. (B) A f it of two psy-chometric functions with different distribution functionsF (Weibull, logistic) to the same

data set, generated from the mean of the four psychometric functions shown in panel A,using sampling scheme s2 with N5 120. See the text for details.


12/16


ues ofNand our seven sampling schemes were used, re-sulting in 112 conditions (4 distribution functions 3 7sampling schemes3 4Nvalues). In addition, we repeatedthe above procedure 40 times to obtain an estimate of thenumerical variability intrinsic to our bootstrap routines,9

for a total of 4,480 bootstrap repeats affording 8,960,000psychometric function fits (4.0323 109 simulated 2AFC

trials).An analysis of variance was applied to the resulting data,

with the number of trialsN, the sampling schemes s1 to s7,and the distribution function Fas independent factors(variables). The dependent variables were the confidenceinterval widths (WCI68); each cell contained the WCI68estimates from our 40 repetitions. For all four dependentmeasuresthreshold0.2, threshold0.5, threshold0.8, andslope0.5not only the first two factors, the number oftrialsNand the sampling scheme, were, as was expected,significant (p, .0001), but also the distribution functionFand all possible interactions: The three two-way inter-

actions and the three-way interaction were similarly sig-nificant atp, .0001. This result in itself, however, is notnecessarily damaging to the bootstrap method applied topsychophysical data, because the significance is broughtabout by the very low (and desirable) variability of ourWCI68 estimates: ModelR

2 is between .995 and .997,implying that virtually all the variance in our simulationsis due toN, sampling scheme, Fand interactions thereof.

Rather than focusing exclusively on significance, inTable 2 we provide information about effect sizenamely,the percentage of the total sum of squares of variation ac-counted for by the different factors and their interactions.

For threshold0.5 and slope0.5 (columns 2 and 4),N, samplingscheme, and their interaction account for 98.63% and96.39% of the total variance, respectively.10 The choiceof distribution function Fdoes not have, despite being asignificant factor, a large effect on WCI68 for thresh-old0.5 and slope0.5.

The same is not true, however, for the WCI68 of thresh-old0.2. Here, the choice ofFhas an undesirably large ef-fect on the bootstrap estimate of WCI68its influence islarger than that of the sampling scheme usedand only84.36% of the variance is explained by N, samplingscheme, and their interaction. Figure 7, f inally, summa-

rizes the effect sizes ofN, sampling scheme, and Fgraph-ically: Each of the four panels of Figure 7 plots the WCI68

(normalized by dividing each WCI68 score by the largestmean WCI68) on they-axis as a function ofNon thex-axis; the different symbols refer to the different samplingschemes. The two symbols shown in each panel corre-spond to the sampling schemes that yielded the smallestand largest mean WCI68 (averaged across FandN). Thegray levels, finally, code the smallest (black), mean (gray),

and largest (white) WCI68 for a givenNand samplingscheme as a function of the distribution function F.

For threshold0.5 and slope0.5 (Figure 7B and 7D), esti-mates are virtually unaffected by the choice ofF, but forthreshold0.2, the choice ofFhas a profound influence onWCI68 (e.g., in Figure 7A, there is a difference of nearlya factor of two for sampling scheme s7 [triangles] when

N5 120). The same is also true, albeit to a lesser extent,if one is interested in threshold0.8: Figure 7C shows the(again undesirable) interaction between sampling schemeand choice ofF. WCI68 estimates for sampling schemes5 (leftward triangles) show little influence ofF, but for

sampling scheme s4 (rightward triangles) the choice ofFhas a marked influence on WCI68. It was generally thecase for threshold0.8 that those sampling schemes that re-sulted in small confidence intervals (s2, s3, s5, s6, and s7;see the previous section) were less affected by Fthan werethose resulting in large confidence intervals (s1 and s4).

Two main conclusions can be drawn from these simu-lations. First, in the absence of any other constraints, ex-perimenters should choose as thresholdand slope mea-sures corresponding to threshold0.5 and slope0.5, becauseonly then are the main factors influencing the estimates ofvariability the number and placement of stimuli, as we

would like it to be. Second, away from the midpoint ofF,estimates of variability are, however, not as independentof the distribution function chosen as one might wishin particular, for lower proportions correct (threshold0.2 ismuch more affected by the choice ofFthan threshold0.8;cf. Figures 7A and 7C). If very low (or, perhaps, very high)response thresholds must be used when comparing ex-perimental conditionsfor example, 60% (or 90%) cor-rect in 2AFCand only small differences exist betweenthe different experimental conditions, this requires the ex-ploration of a number of distribution functions Fto avoidfinding significant differences between conditions, owing

to the (arbitrary) choice of a distribution functions result-ing in comparatively narrow confidence intervals.

Table 2Summary of Analysis of Variance Effect Size (Sum of Squares [s5] Normalized to 100%)

Factor Threshold0.2 Threshold0.5 Threshold0.8 Slope0.5

Number of trialsN 76.24 87.30 65.14 72.86

Sampling schemes s1 . . . s7 6.60 10.00 19.93 19.69Distribution function F 11.36 0.13 3.87 0.92

Error (numerical variability in bootstrap) 0.34 0.35 0.46 0.34Interaction ofNand sampling schemes s1 . . . s7 1.18 0.98 5.97 3.50

Sum of interactions involvingF 4.28 1.24 4.63 2.69

Percentage of SS accounted for without F 84.36 98.63 91.5 96.39

NoteThe columns refer to threshold0.2, threshold0.5, threshold0.8, and slope0.5that is, to approximately 60%,

75%, and 90% correct and the slope at 75% correct during two-alternative forced choice, respectively. Rowscorrespond to the independent variables, their interactions, and summary statistics. See the text for details.


13/16


total number of observations N

threshold(F0.5

)WCI68

-1

threshold(F0.2

)WCI68

-1

threshold(F0.8

)WCI68

-1

slopeWCI68

total number of observations N

120 240 480 9600

0.2

0.4

0.6

0.8

1

1.2

1.4

120 240 480 9600

0.2

0.4

0.6

0.8

1

1.2

1.4

120 240 480 9600

0.2

0.4

0.6

0.8

1

1.2

1.4

120 240 480 9600

0.2

0.4

0.6

0.8

1

1.2

1.4

C D

BA

Color Code

largest WCI68mean WCI68minimal WCI68

mean WCI68 as a function ofN

range of WCI68s for a given

sampling scheme as a function

ofF

Symbols refer to the

sampling scheme

Figure 7. The four panels show the width of WCI68 as a function ofN and sampling scheme, a s well as of the distribution functionF. The symbo ls refer to the d ifferent sampling schemes, as in Figure 2; gray levels code the influence of the distribution function F on

the width of WCI68 for a given sampling scheme and N: Black symbols show the smallest WCI68 obtained, middle gray the mean W CI68across the four distribution functions used, and white the largest WCI68. (A) WCI68 for threshold0.2. (B) WCI68 for threshold0.5. (C)

WCI68 for threshold0.8. (D) WCI68 for slope0.5.

Bootstrap Confidence IntervalsIn the existing literature on bootstrap estimates of the

parameters and thresholds of psychometric functions, moststudies use parametric or nonparametricplug-in esti-mates11 of the variability of a distribution J *. For example,

Foster and Bischof (1997) estimate parametric (moment-based) standard deviations sby the plug-in estimate s.Maloney (1990), in addition to s, uses a comparable non-parametric estimate, obtained by scaling plug-in estimatesof the interquartile range so as to cover a confidence in-


14/16


terval of 68.3%. Neither kind of plug-in estimate is guar-anteed to be reliable, however: Moment-based estimatesof a distributions central tendency (such as the mean) orvariability (such as s ) are not robust; they are very sen-sitive to outliers, because a change in a single sample canhave an arbitrarily large effect on the estimate. (The esti-

mator is said to have a breakdown of 1/n, because that isthe proportion of the data set that can have such an effect.Nonparametric estimates are usually much less sensitiveto outliers, and the median, for example, has a breakdownof 1/2, as opposed to 1/n for the mean.) A moment-basedestimate of a quantity Jmight be seriously in error if onlya single bootstrap J i* estimate is wrong by a largeamount. Large errors can and do occur occasionallyforexample, when the maximum-likelihood search algo-rithm gets stuck in a local minimum on its error surface.12

Nonparametric plug-in estimates are also not withoutproblems. Percentile-based bootstrap conf idence intervals

are sometimes significantly biased and converge slowlyto the true confidence intervals (Efron & Tibshirani, 1993,chaps. 1214, 22). In the psychological literature, thisproblem was critically noted by Rasmussen (1987, 1988).

Methods to improve convergence accuracy and avoidbias have received the most theoretical attention in thestudy of the bootstrap (Efron, 1987, 1988; Efron & Tib-shirani, 1993; Hall, 1988; Hinkley, 1988; Strube, 1988; cf.Foster & Bischof, 1991, p. 158).

In situations in which asymptotic confidence intervalsare known to apply and are correct, bias-corrected andaccelerated (BCa) confidence intervals have been demon-

strated to show faster convergence and increased accuracyover ordinary percentile-based methods, while retainingthe desirable property of robustness (see, in particular,Efron & Tibshirani, 1993, p. 183, Table 14.2, and p. 184,Figure 14.3, as well as Efron, 1988; Rasmussen, 1987,1988; and Strube, 1988).

BCa confidence intervals are necessary because thedistribution of sampling points x along the stimulus axismay cause the distribution of bootstrap estimates q* to bebiased and skewed estimators of the generating values q.The same applies to the bootstrap distributions of estimatesJ * of any quantity of interest, be it thresholds, slopes, or

whatever. Maloney (1990) found that skew and bias par-ticularly raised problems for the distribution of the bpa-rameter of the Weibull, b* (N5 210, K5 7). We alsofound in our simulations that b*and thus, slopes s*were skewed and biased forNsmaller than 480, evenusing the best of our sampling schemes. The BCa methodattempts to correct both bias and skew by assuming thatan increasing transformation, m, exists to transform thebootstrap distribution into a normal distribution. Hence,we assume that F 5 m(q) and F5 m(q ) resulting in

(2)

where

(3)

and kF0 any reference point on the scale ofF values. InEquation3,z0 is the bias correction term, and a in Equa-tion 4 is theacceleration term. Assuming Equation2 to becorrect, it has been shown that an e-level confidence in-terval endpoint of the BCa interval can be calculated as

(4)

where CG is the cumulative Gaussian distribution func-tion, G 1 is the inverse of the cumulative distributionfunction of the bootstrap replications q*,z0 is our esti-mate of the bias, and a is our estimate of acceleration. Fordetails on how to calculate the bias and the accelerationterm for a single fitted parameter, see Efron and Tibshi-rani (1993, chap. 22) and also Davison and Hinkley (1997,chap. 5); the extension to two or more f itted parameters

is provided by Hill (2001b).For large classes of problems, it has been shown that

Equation 2 is approximately correct and that the error inthe confidence intervals obtained from Equation 4 issmaller than those introduced by the standard percentileapproximation to the true underlying distribution, wherean e-level confidence interval endpoint is simply G[e](Efron & Tibshirani, 1993). Although we cannot offer aformal proof that this is also true for the bias and skewsometimes found in bootstrap estimates from f its to psy-chophysical data, to our knowledge, it has been shownonly that BCa confidence intervals are either superior or

equally good in performance to standard percentiles, butnot that they perform significantly worse. Hill (2001a,2001b) recently performed Monte Carlo simulations totest the coverage of various confidence interval methods,including the BCa and other bootstrap methods, as wellas asymptotic methods from probit analysis. In general,the BCa method was the most reliable, in that its cover-age was least affected by variations in sampling schemeand inN, and the least imbalanced (i.e., probabilities ofcovering the true parameter value in the upper and lowerparts of the confidence interval were roughly equal). TheBCa method by itself was found to produce confidence

intervals that are a little too narrow; this underlines theneed for sensitivity analysis, as described above.

CONCLUSIONS

In this paper, we have given an account of the procedureswe use to estimate the variability of fitted parameters and thederived measures, such as thresholds and slopes of psy-chometric functions.

First, we recommend the use of Efrons parametric boot-strap technique, because t raditional asymptotic methods

have been found to be unsuitable, given the small numberof datapoints typically taken in psychophysical experi-ments. Second, we have introduced a practicable test ofthe bootstrap bridging assumption or sensitivity analysisthat must be applied every time bootstrap-derived vari-

,

( )

( )q e

e

eBCaG z

z z

a z z[ ] = + +

+( )

10

0

01CG

k k aF F F F= + ( )0

0 ,

~ , ,F F

F ( )kN z0 1


15/16


ability estimates are obtained, to ensure that variability es-timates do not change markedly with small variations inthe bootstrap generating functions parameters. This is crit-ical because the fitted parameters q are almost certain todeviate at least slightly from the (unknown) underlyingparameters q. Third, we explored the influence of differ-

ent sampling schemes (x) on both the size of ones con-fidence intervals and their sensitivity to errors in q. Weconclude that only sampling schemes including at leastone sample atp $ .95 yield reliable bootstrap confi-dence intervals. Fourth, we have shown that the size ofbootstrap confidence intervals is mainly influenced by xandNif and only if we choose as threshold and slope val-ues around the midpoint of the distribution function F;particularly for low thresholds (threshold0.2), the precisemathematical form ofFexerts a noticeable and undesir-able influence on the size of bootstrap confidence inter-vals. Finally, we have reported the use of BCa confidence

intervals that improve on parametric and percentile-basedbootstrap confidence intervals, whose bias and slow con-vergence had previously been noted (Rasmussen, 1987).

With this and our companion paper (Wichmann & Hill,2001), we have covered the three central aspects of mod-eling experimental data: first, parameter estimation; sec-ond, obtaining error estimates on these parameters; andthird, assessing goodness of fit between model and data.

REFERENCES

Cox, D. R., & Hinkl ey, D. V. (1974). Theoretical statistics. London:

Chapman and Hall.Davison, A. C., & Hink l ey, D. V. (1997).Bootstrap methods and their

application. Cambridge: Cambridge University Press.Efr on , B. (1979). Bootstrap methods: Another look at the jackknife.

Annals of Statistics , 7, 1-26.Efr on, B. (1982).The jackknife, the bootstrap and other resampling plans

(CBMS-NSF Regional Conference Series in Applied Mathematics).Philadelphia: Society for Industrial and Applied Mathematics.

Efr on , B. (1987). Better bootstrap confidence intervals.Journal of theAmerican Statistical Association , 82, 171-200.

Efr on , B. (1988). Bootstrap confidence intervals: Good or bad? Psy-chological Bulletin, 104, 293-296.

Efr on , B., & Gong , G. (1983). A leisurely look at the bootstrap, thejackknife, and cross-validation.American Statistician, 37, 36-48.

Efr on , B., & Tibshir an i, R. (1991). Statistical data analysis in the com-puter age. Science, 253, 390-395.

Efr on, B., & Tibshir an i, R. J. (1993).An introduction to the bootstrap .New York: Chapman and Hall.

Finney, D. J. (1952).Probit analysis (2nd ed.). Cambridge: CambridgeUniversity Press.

Finn ey, D. J. (1971).Probit analysis. (3rd ed.). Cambridge: CambridgeUniversity Press.

Fost er , D. H., & Bischof , W. F. (1987). Bootstrap variance estimatorsfor the parameters of small-sample sensory-performance functions.

Biological Cybernetics, 57, 341-347.Fost er , D. H., & Bisch of , W. F. (1991). Thresholds from psychometric

functions: Superiority of bootstrap to incremental and probit varianceestimators. Psychological Bulletin, 109, 152-159.

Fost er , D. H., & Bischof, W. F. (1997). Bootstrap estimates of the statis-tical accuracy of thresholds obtained from psychometric functions.Spa-

tial Vision, 11, 135-139.Hal l , P. (1988). Theoretical comparison of bootstrap confidence inter-

vals (with discussion).Annals of Statistics, 16, 927-953.Hil l , N. J. (2001a, May).An investigation of bootstrap interval coverage

and sampling efficiency in psychometric functions. Poster presented atthe Annual Meeting of the Vision Sciences Society, Sarasota, FL.

Hil l , N. J. (2001b). Testing hypotheses about psychometric functions:An investigation of some confidence interval methods, their validity,

and their use in assessing optimal sampling strategies. Forthcomingdoctoral dissertation, Oxford University.

Hil l , N. J., & Wichmann , F. A. (1998, April).A bootstrap method fortesting hypotheses concerning psychometric functions. Paper presented

at the Computers in Psychology, York, U.K.Hink l ey, D. V. (1988). Bootstrap methods.Journal of the Royal Statis-

tical Society B, 50, 321-337.Kenda l l , M. K., & St uar t , A. (1979).The advanced theory of statistics:

Vol. 2. Inference and relationship. New York: Macmillan.Lam, C. F., Mil l s, J. H., & Dubno, J. R. (1996). Placement of obser-

vations for the efficient estimation of a psychometric function.Jour-nal of the Acoustical Society of America, 99, 3689-3693.

Mal oney, L. T. (1990). Confidence intervals for the parameters of psy-chometric functions. Perception & Psychophysics, 47, 127-134.

McKee, S. P., Kl ein, S. A., & Tel l er , D. Y. (1985). Statistical proper-ties of forced-choice psychometric functions: Implications of probit

analysis.Perception & Psychophysics, 37, 286-298.Rasmussen, J. L. (1987). Estimating correlation coefficients: Bootstrap

and parametric approaches. Psychological Bulletin, 101, 136-139.Rasmussen, J. L. (1988). Bootstrap confidence intervals: Good or bad.

Comments on Efron (1988) and Strube (1988) and further evaluation.Psychological Bulletin, 104, 297-299.

St r ube, M. J. (1988). Bootstrap type I error rates for the correlation co-efficient: An examination of alternate procedures. Psychological Bul-

letin, 104, 290-292.Tr eut wein, B. (1995, August).Error estimates for the parameters of

psychometric functions from a single session . Poster presented at theEuropean Conference of Visual Perception, Tbingen.

Tr eut wein, B., & St r asbur ger , H. (1999, September).Assessing thevariability of psychometric functions. Paper presented at the Euro-

pean Mathematical Psychology Meeting, Mannheim.Wichmann, F. A., & Hil l , N. J. (2001). The psychometric function: I.Fit-

ting, sampling, and goodness of fit. Perception & Psychophysics, 63,1293-1313.

NOTES

1. Under different circumstances and in the absence of a model of thenoise and/or the process from which the data stem, this is frequently the

best one can do.2. This is true, of course, only if we have reason to believe that our

model actually is a good model of the process under study. This importantissue is taken up in the section on goodness of fit in our companion paper.

3. This holds only if our model is correct: Maximum-likelihood pa-rameter estimation for two-parameter psychometric functions to data from

observers who occasionally lapsethat is, display nonstationarityis as-ymptoticallybiased, as we show in our companion paper, together with

a method to overcome such bias (Wichmann & Hill, 2001).4. Conf idence intervals here are computed by the bootstrap percentile

method: The 95% confidence interval for a, for example, was deter-mined simply by [a*(.025) , a*(.975)], where a*(n) denotes the 100nth per-centile of the bootstrap distribution a*.

5. Because this is the approximate coverage of the familiarity stan-

dard error bardenoting ones original estimate 6 1 standard deviationof a Gaussian, 68% was chosen.

6. Optimal here, of course, means relative to the sampling schemesexplored.

7. In our initial simulations, we did not constrain aand b other than tolimit them to be positive (the Weibull function is not defined for negative

parameters). Sampling schemes s1 and s4 were even worse, and s3 and s7were, relative to the other schemes, even more superior under noncon-

strained fitting conditions. The only substantial difference was perfor-mance of sampling scheme s5: It does very well now, but its slope esti-

mates were unstable during sensitivity analysis ofbwas unconstrained.This is a good example of the importance of (sensible) Bayesian assump-

tions constraining the fit. We wish to thank Stan Klein for encouraging usto redo our simulations with aand bconstrained.

8. Of course, there might be situations where one psychometric func-tion using a particular distribution function provides a signif icantly bet-
http://zerlina.ingentaselect.com/nw=1/rpsv/cgi-bin/linker?ext=a&reqidx=/0001-4966^28^2999L.3689[aid=895936]http://zerlina.ingentaselect.com/nw=1/rpsv/cgi-bin/linker?ext=a&reqidx=/0031-5117^28^2937L.286[aid=846785]http://zerlina.ingentaselect.com/nw=1/rpsv/cgi-bin/linker?ext=a&reqidx=/0031-5117^28^2963L.1293[aid=2045594]http://zerlina.ingentaselect.com/nw=1/rpsv/cgi-bin/linker?ext=a&reqidx=/0031-5117^28^2963L.1293[aid=2045594]http://zerlina.ingentaselect.com/nw=1/rpsv/cgi-bin/linker?ext=a&reqidx=/0001-4966^28^2999L.3689[aid=895936]http://zerlina.ingentaselect.com/nw=1/rpsv/cgi-bin/linker?ext=a&reqidx=/0033-2909^28^29104L.293[aid=833058]http://zerlina.ingentaselect.com/nw=1/rpsv/cgi-bin/linker?ext=a&reqidx=/0031-5117^28^2963L.1293[aid=2045594]http://zerlina.ingentaselect.com/nw=1/rpsv/cgi-bin/linker?ext=a&reqidx=/0031-5117^28^2937L.286[aid=846785]http://zerlina.ingentaselect.com/nw=1/rpsv/cgi-bin/linker?ext=a&reqidx=/0031-5117^28^2947L.127[aid=2045622]http://zerlina.ingentaselect.com/nw=1/rpsv/cgi-bin/linker?ext=a&reqidx=/0001-4966^28^2999L.3689[aid=895936]http://zerlina.ingentaselect.com/nw=1/rpsv/cgi-bin/linker?ext=a&reqidx=/0340-1200^28^2957L.341[aid=2045551]http://zerlina.ingentaselect.com/nw=1/rpsv/cgi-bin/linker?ext=a&reqidx=/0033-2909^28^29104L.293[aid=833058]


16/16


ter fit to a given data set than do others using different distribution func-tions. Differences in bootstrap estimates of variability in such cases are

not worrisome: The appropriate estimates of variability are those of thebest-fitting function.

9. WCI68 estimates were obtained using BCa confidence intervals,described in the next section.

10. Clearly, the above-reported effect sizes are tied to the ranges inthe factors explored:Nspanned a comparatively large range of 120960

observations, or a factor of 8, whereas all of our sampling schemes werereasonable; inclusion of unrealistic or unusual sampling schemes

for example, allx values such that nominaly values are below 0.55would have increased the percentage of variation accounted for by sam-

pling scheme. Taken together,Nand sampling scheme should be repre-sentative of most typically used psychophysical settings, however.

11. A straightforward way to estimate a quantity J, which is derivedfrom a probability distribution Fby J5 t(F), is to obtain Ffrom em-pirical data and then use J 5 t(F) as an estimate. This is called aplug-in estimate.

12. Foster and Bischof (1987) report problems with local minima, whichthey overcame by discarding bootstrap estimates that were larger than 20

times the stimulus range (4.2% of their datapoints had to be removed).Nonparametric estimates naturally avoid having to perform posthoc data

smoothing by their resilience to such infrequent but extreme outliers.

(Manuscript received June 10, 1998;revision accepted for publication February 27, 2001.)

Wichmann (2001) the Psycho Metric Function. II. Bootstrap-based Confidence Intervals and Sampling

Documents

Transcript of Wichmann (2001) the Psycho Metric Function. II. Bootstrap-based Confidence Intervals and Sampling