Global Model Analysis by Landscaping

Global Model Analysis by Landscaping

Daniel J. Navarro, In Jae Myung, Mark A. Pitt and Woojae Kimfnavarro.20, myung.1, pitt.2, [email protected]

Department of PsychologyOhio State University1827 Neil Avenue Mall

Columbus, OH 43210, USA

Abstract

How well do you know your favorite computationalmodel of cognition? Most likely your knowledgeof its behavior has accrued from tests of its abilityto mimic human data, what we call local analy-ses because performance is assessed in a speciÞctesting situation. Global model analysis by land-scaping is an approach that �sketches� out the per-formance of a model at all of its parameter val-ues, creating a landscape of how the relative per-formance abilities of the model and a competingmodel. We demonstrate the usefulness of landscap-ing by comparing two models of information inte-gration (Fuzzy Logic Model of Perception and theLinear Integration Model). The results show thatmodel distinguishability is akin to power, and maybe improved by increasing the sample size, usingbetter statistics, or redesigning the experiment. Weshow how landscaping can be used to measure thisimprovement.

Introduction

The development and testing of theories is one of themost important aspects of scientiÞc inquiry. Theo-ries provide us with the tools we need to understandthe world (Kuhn, 1962), and frequently spark newand exciting experimental work (Estes, 2002). Whenwe develop a mathematical or computational modelof a cognitive process, it is generally an instanti-ation of a few fundamental properties of a verbaltheory. This translation process requires some de-grees of freedom because data sets can vary in manyways, and still be consistent with the qualitativeideas in the model. For instance, while forgettingcurves generally look like decreasing exponential orpower functions (e.g., Rubin & Wenzel, 1996), somepeople remember items more easily than others. Wecapture this idea by proposing models that have anumber of free parameters which may be Þne-tunedto Þt the data. This idea has been widely adopted,and has led to successful models of a wide variety ofcognitive phenomena.One potential drawback with this approach is

that, as the models we propose become more elabo-rate, the task of understanding the model itself be-comes increasingly difficult. When assessing the per-formance of a model in light of some observed data,

we generally try to Þnd a single set of parameterswhich allow the model to Þt the data best. Maxi-mum likelihood estimation and least sum of squaresmethods are examples of this approach. Because amodel is evaluated by identifying the single, best-Þtting parameter set, these methods are examplesof local model analysis. Like an iceberg, the vastmajority of the model is hidden beneath the waves:only those few parameter sets that provide best Þtsever allow the model to come to the surface and showus how it behaves. As a result, our experience of themodel is limited to a few, possibly unrepresentativecases.As modelers, we want to learn something about

how our models behave in general, not just at thefew speciÞc points that come to light when we use lo-cal methods such as Þtting data sets generated in anexperiment. The limitations of such methods leave anumber of interesting and important questions unan-swered. We may be concerned that our model makesso many predictions that it could provide a good Þtto almost any data (model complexity), that manydifferent models make essentially the same predic-tions (model distinguishability), or that the predic-tions do not conform to the original qualitative the-ory (model faithfulness). Questions such as thesecan be referred to as issues of global model analysis.Global model analysis, as we conceive it, refers tothe task of discovering what a model can and can-not do, particularly in light of empirical data andother models. In this paper, we introduce a simpleglobal analysis method that we call landscaping.

Sketching a LandscapeThe idea underlying landscaping is remarkably sim-ple. Find all the things that a model can do, com-pare it to the things that other models can do, andsee how these things relate to empirical data. In onesense it is the very opposite of parameter optimiza-tion (i.e., Þnding the best Þt): rather than look fora single set of parameters (and a single prediction),we look at them all. When we do this, we obtain thefull range of predictions made by the model, whichwe call the landscape.Landscaping is a modeling tool, not a statistic,

and can be adapted to answer many questions. In

this paper we are concerned primarily with modeldistinguishability and the relationship to experimen-tal design.

The Domain of Application

Models of information integration are concerned pri-marily with stimulus identiÞcation. Given somepotentially ambiguous information from multiplesources regarding a stimulus, what is the stimulusmost likely to be? The classic example is phone-mic identiÞcation, in which visual and auditory cuesare combined in order to make a decision (e.g. wasit /ba/ or /da/?). In this study we compared thelandscapes of two models of information integration:Oden and Massaro�s (1978) Fuzzy Logic Model ofPerception (FLMP) and Anderson�s (1981) LinearIntegration Model (LIM).

To brießy summarize an extensive literature:FLMP provides a remarkably good account of a widerange of phenomena, including some that look ratherLIM-like and many that do not. There are a num-ber of local analyses of both these models. It wouldbe useful to unify them in some way to understandbetter how their performance is related. Landscap-ing does just this. In the process it provides answersto related questions. For instance, are there largenumbers of FLMP patterns that are never observedin experimental data? Are there LIM patterns thatFLMP cannot mimic, or is FLMP ßexible enoughto Þt all LIM-like patterns? Is experimental designimportant? In short, we want to Þnd out what liesbeneath the surface of all these local analyses.

Furthermore, we want to answer these questionswith respect to the kind of small sample sizes thatcharacterize real experiments. With that in mind,we approach the comparison as experimentalists.We have in mind a particular experiment that wewish to conduct, and have a number of questionsabout the relationships between FLMP, LIM, andexperimental data. For example, what kinds of datasets are consistent with each of the two models? Arethere some kinds of data that are consistent withboth models? How successfully can our experimenttell the two models apart? What statistics will weneed to do so? These questions can be very difficultto answer using local methods, but readily fall outof a landscaping analysis.

Experiment One

The experiment that we have in mind is a two-choiceidentiÞcation task (i.e. choose A or B) with a 2£ 8design. In other words, there are two different lev-els of one source (e.g., visual) , i 2 (i1, i2), andeight different levels of the other source (e.g., audi-tory) j 2 (j1, . . . , j8). In total, there are 16 stimulithat may be produced by combining the two evi-

dence sources1. Furthermore, we anticipate a sam-ple size of N = 24 (not unusual in psychologicalexperiments). Letting pij denote the probability ofresponding �A� when presented with the i-th levelof one source and the j-th level of the other, FLMPis characterized by the equation,

pij =θiλj

θiλj + (1¡ θi)(1¡ λj) ,

whereas LIM predicts that

pij =θi + λj2

.

In both cases we assume that θi · θi+1 and λj ·λj+1 for all i and j.

The Landscape of Model Fits

Our landscaping analysis consists of generating alarge number �experimental� data sets from eachmodel: that is, the kind of data the we would expectto observe if FLMP (or LIM) were the true model ofhuman performance. The comparison between themodels is then based on how well each model Þtsall of these data sets. The results reveal the distin-guishability of the models and their similarities anddifferences.Generating hypothetical experimental data is sim-

ple. In a two-choice task, both FLMP and LIMassume that the sampling error follows a binomialdistribution (with N = 24 in this case). To sketcha landscape of FLMP data, we randomly generateda large number of FLMP parameter sets (10,000 inthis case), found the pij values, and added samplingerror2. This was then repeated using LIM.Because each of these data sets represents the po-

tential outcome of an information integration experi-ment, by Þtting both FLMP and LIM to them (usingmaximum likelihood estimation) we can determinehow effectively an experiment of this kind will dis-criminate between the two models. Intuitively, oneexpects the generating model to Þt its own data bet-ter than the competing model, but due to the joint

1It is well known that FLMP is non-identiÞable forthis experimental design, but that we may Þx oneparameter value (say, θ1) without loss of generality(Crowther, Batchelder & Hu 1995). Although this tech-nique does not work for LIM, LIM may be reparame-trized as the identiÞable model, pij = αi + βj + c,where αmin = βmin = 0, αi ∈ [0, 12 ], βj ∈ [0, 12 ], andc ∈ [0, 1− αmax − βmax].

2Clearly, the data sets depend on the distributionfrom which one samples. In this case we sampled fromJeffreys� distribution (see Robert 2001, for instance), cor-responding to the assumption of maximum uncertaintyabout the data. However, Jeffreys� distribution is dif-Þcult to sample from in many situations, and one maywish to specify a different distribution. Another prin-cipled choice is the uniform distribution, which corre-sponds to maximum uncertainty about the parameters,and is trivial to sample from.

−100 −50 0−150

−100

−50

0

FLMP fit

LIM

fit

b

a

d

c

−100 −50 0−150

−100

−50

0

FLMP fit

LIM

fit

h

e

f

g

Figure 1: Scatterplots of the (logarithm of) maximum likelihood estimates for 10,000 data sets generatedby FLMP (left panel) and LIM (right panel). Values on the x-axis denote the Þt of FLMP to the data, andy-values denote LIM Þt. Data come from a 2 £ 8 experimental design with N = 24. The inset panels (a)through (h) display typical data sets sampled from different regions. The solid line depicts the ML decisionthreshold, and the broken line is the MDL threshold.

effects of sampling error and differences between themodels, this is not always the case.

The Lay of the Land

Figure 1 displays the 10,000 data sets generated byFLMP (left panel) and LIM (right panel), plottedas a function of the maximum likelihood estimatefor each model. The solid line marks the decisionboundary for the maximum likelihood (ML) crite-rion: LIM provides a better Þt to all data sets thatlie above the solid line, whereas FLMP provides thebetter Þt to the points below. The inset panels dis-play pij values (on the y-axis) as a function of the 8levels of j (on the x-axis). The two lines correspondto the two levels of i (the upper line represents i2).These plots are remarkably informative because

the the relative performance (i.e., Þts) of the twomodels across then entire range of data patterns foreach model is visible. Inspection of the left panelreveals that data generated by FLMP are almostalways better Þt by FLMP than by LIM. In fact,there are only 6 points (of 10,000) above the solidline. In other words, if we used ML as a methodto guess which model generated the data, we would

only be incorrect in a tiny proportion of cases. Notonly that, the vast majority of FLMP patterns arenowhere near the solid line, indicating that in mostcases, the decision is clear-cut: FLMP provides thebetter Þt. Interestingly, we note that the scatterplottapers in the lower righthand area: when the LIMÞt is exceptionally poor, the FLMP Þt is especiallygood. In short, FLMP will almost never be confusedfor LIM.

On another note, it appears that the variabilityin this scatterplot is interpretable in terms of hu-man performance. Even a cursory examination ofthe types of response patterns that fall in differentregions of the FLMP landscape is informative. Inset(d) in Figure 1 shows a sample data set drawn fromthe lower tail of FLMP distribution, which displaysa pattern that is typical of those observed in such ex-periments. The sigmoidal curves in (c), and to someextent (b), are not atypical of human data, thoughthe step-function curves in (a) are not characteristicof human performance.

The right panel tells a different tale, in which LIMdata sets tend to cluster in a tight region near the

decision boundary. In fact, a total of 3,130 of the10,000 data sets fall on the wrong side of it, meaningthat FLMP can Þt LIM data better than LIM itselfalmost one third of the time. Worse yet, the LIMdata sets cluster in a direction parallel to the deci-sion boundaries. This means that when LIM Þts thedata well, so does FLMP, even though the data setscame from LIM. In fact, since the data sets rarely fallfar away from the solid line, it is clear that FLMP iscapable of mimicking LIM all the way across LIM�sparameter space. The models are highly confusable.A cursory sweep through the LIM data is consis-

tent with these Þndings. Since LIM is such a simplemodel, most response patterns look alike (parallellines), and the major source of variation is samplingerror. When both models Þt well, as in inset (h), thedata tend to look like parallel lines. Both models Þtpoorly when the noise heavily distorts the responsepattern as in (f). Occasionally, as in (e), samplingerror is more damaging to the FLMP Þt than theLIM Þt. On the other hand, as in (g), it sometimesallows FLMP to Þt better than LIM.The major implication of the landscape analysis

is this: if FLMP is the correct model, then it shouldbe easy to perform 2 £ 8 design experiments thatsupport it over LIM. However, if LIM is the cor-rect model, such an experiment will not be veryeffective in distinguishing it from FLMP, and an-other test will need to be devised in order to do so.Looked at another way, the inability to distinguishbetween two models is an issue of power, of deter-mining how effective an experiment can possibly be.This insight makes it clear that power is asymmetricin the current experiment: It is easy to distinguishFLMP from LIM, but difficult to distinguish LIMfrom FLMP.

Remedying the AsymmetryThere are at least three ways of increasing the powerof the experiment to overcome the asymmetry andmake the comparison a more balanced test of the twomodels. The standard remedy is also the simplest:increase the sample size. By increasing the sam-ple size, we decrease the amount of sampling error,and should therefore be better able to discriminatebetween the two. However, this approach can suf-fer from pragmatic and theoretical difficulties. Thepragmatic problem is that it may not be feasible toincrease the sample size, as in clinical studies for in-stance, where one may have limited opportunitiesto collect data. From a theoretical point of view, itis possible that increasing the sample size will yieldlimited returns. If FLMP can produce response pat-terns that look LIM-like even without sampling error(i.e. as N ! 1), then reducing the error may nothelp.The second solution is to use more powerful sta-

tistics. Athough ML is useful for measuring Þts todata, it is outperformed in small samples by a great

many other statistics. One of the more accurateof these is Rissanen�s (1996) Minimum DescriptionLength (MDL) criterion, which has recently beenemployed with some success in psychological model-ing (e.g., Lee 2002; Navarro & Lee 2002), and is moreeffective at discriminating between FLMP and LIM(Pitt, Kim & Myung, in press). ML and MDL dif-fer only by a constant �geometric complexity� (GC)term (Pitt, Myung & Zhang 2002):

MDL = ¡ ln(ML) + GC.In this case, the geometric complexity of FLMP isgreater than that of LIM by only 1.9, which canseem like a small difference in view of the variabilityin Figure 1. Nevertheless, when the MDL criterionis applied (shown by the broken lines) instead of ML,the asymmetry greatly diminishes. By introducinga complexity penalty, MDL makes a few more mis-classiÞcations for FLMP data, incorrectly choosingLIM 28 times out of 10,000. However, the major dif-ference occurs for the LIM data, in which the errorrate falls ten-fold, from 3,130 to 356 out of 10,000.Because the LIM data sets cluster in such a tightregion in the scatterplot, this small correction pro-duces a massive improvement in classiÞcation: theoverall error rate across the 20,000 data sets dropsfrom 15.7% to 2.3%. On the basis of these results,it is tempting to simply recommend the use of MDLover ML, since it is the better statistic in general (seeGrunwald 2000). However, the GC term in the MDLcriterion can be very difficult to evaluate even forsimple models due to an often-intractable integralterm. For nonlinear models with many parameters,it can be nearly impossible.The third remedy relies on Lord Rutherford�s as-

sertion to the effect that �if your experiment needsstatistics, you should have done a better experi-ment�. It might be that, with only minor alter-ations, we could perform an experiment that wouldbe more likely to distinguish between the modelswithout requiring elaborate statistical inference orenormous sample sizes. Of course, inventing newexperimental designs requires the kind of insight onbehalf of experimenters for which no methodologycan substitute. On the other hand, once we havethought of a new experimental setup, it is simpleenough to use landscaping to see if the new designis likely to be more sucessful than the Þrst. It is thispossibility that we now examine.

Experiment TwoOne of the difficulties with the original 2 £ 8 de-sign is that the experiment does not directly mea-sure θi and λj . In other words, it does not askhow would people respond if only one source of ev-idence (auditory or visual) were provided. FLMPand LIM make different predictions in this regard:LIM predicts pi = θi, whereas FLMP predicts that

−100 −50 0−150

−100

−50

0

FLMP fit

LIM

fit

b

a

d

c

−100 −50 0−150

−100

−50

0

FLMP fit

LIM

fit

h

e

f

g

i

Figure 2: Scatterplots of FLMP Þt versus LIM Þt (again on a logarithmic scale) for the expanded 2 £ 8experimental design with unimodal conditions and N = 24. As before, the solid line represents equal Þt, thebroken line represents equal MDL values, and the inset panels display sample data sets.

pi = θi/(1¡ θi). This suggests an elegant alterationto the original design, by including the 10 extra �uni-modal� stimuli as additional conditions in the de-sign. This redesign accomplishes three objectives.Firstly, the non-identiÞability problem that we al-luded to in footnote 1 vanishes. Secondly, the recov-ered parameter values are more easily interpretedas measurements of the evidence provided by eachsource. Thirdly, the power of the experiment is in-creased, as we show below.

Performing the same landscaping analysis with10,000 data sets on our new 2 £ 8 (plus unimodal)design shows the effect of adding these conditions,displayed in Figure 2. Given that the shapes of thescatterplots are quite similar to those in Figure 1,it seems likely that there are no substantial quali-tative differences between the models across experi-ments. Rather, the change in design has constrainedtheir behavior (i.e., data-Þtting ability) to regions inthe landscape in which LIM is distinguishable fromFLMP and vice versa.

As before, the FLMP data sets are generally quitedistant from the decision thresholds, and both MLand MDL are very successful in selecting FLMP asthe correct model: ML makes no misclassiÞcations

at all, whereas MDL makes only 18 errors. An in-formal scan across the scatterplot supports our in-tuition that the qualitative features of FLMP areunchanged. The lower tail of the FLMP distribu-tion still contains the classic FLMP-like data sets,illustrated in panels (c) and (d), whereas the pat-terns closest to the decision boundaries, as in (b),are much more linear. It is not clear, however, ifthe pattern in (a) represents a difference from itscounterpart in Figure 1.

Inspection of the panel on the right reveals thatthe expanded design has allowed the LIM data todistinguish itself from FLMP. Although the distri-bution of data sets is still parallel to the decisionthresholds, indicating that FLMP can still mimicLIM, they are shifted away from the decision crite-ria, indicating that the extent of the mimickabilityhas been substantially reduced. In this experimentaldesign, ML makes far fewer mistakes, only 100, andMDL makes only a single misclassiÞciation. Thatis, by adopting this expanded design, the ML errorrate drops from 15.7% to 0.5%, and the MDL errorrate drops from 2.3% to 0.1%. Again, a brief surveyof the landscape shows that the patterns illustratedby (e), (f) and (g) match their counterparts in Fig-

ure 1. However, since LIM has developed a long tail,we display two patterns (h) and (i), both of whichdisplay a resemblance to panel (h) in Figure 1.

General DiscussionLandscaping is a simple and powerful tool for modelevaluation and comparison. It is a method for view-ing the relationship between two models across thespace of all possible data patterns that the modelscan generate within a particular experimental de-sign. FLMP encompasses a larger range of data setsthan LIM, a range that includes patterns producedby participants. Furthermore, these patterns fall inthe main body of the FLMP scatterplot, and appearto be as representative of FLMP as they are unrep-resentative of LIM.Secondly, by plotting the LIM data sets, we be-

came aware of the potential difficulty of distinguish-ing between FLMP and LIM. To do so required theuse of very powerful statistical methods (MDL) oran expanded experimental design.Thirdly, despite the change in experimental de-

sign, the shape and composition of the model land-scapes seem to be more or less invariant. We specu-late that, although data Þts and model distinguisha-bility vary substantially across experimental designs,the qualitative ßavor of the landscape is constant.Among the strengths of landscaping is its adapt-

ability. It is a tool that can and should be modiÞedto suit the circumstances. For instance, in this pa-per we sampled parameter values (with some pain)from a Jeffreys� distribution. In many cases a sim-ple uniform distribution is appropriate, particularlyif the parameters are assumed to correspond to realpsychological variables. Similarly, few modeling sit-uations require 10,000 data sets. If the aim is onlyto estimate the power of an experimental design, afew hundred would likely suffice, since the Þne detailof the landscape is irrelevant. If we are interested inlooking at the types of response patterns predictedby the models (rather than the data sets that wewould expect to observe), there is no need to addsampling error.In general, we suspect that landscaping analyses

on the scale that we have undertaken here will rarelybe required, and smaller, simpler evaluations willsuffice. Even a little landscaping may go a long way.If model indistinguishability is unavoidable, we arealerted to the necessity of tools such as MDL. Onthe other hand, high distinguishability suggests thatsmaller samples, simpler designs, or more convenientstatistics will be adequate.If local analyses such as maximum likelihood show

us the tip of the iceberg, then global methods suchas landscaping allow us to look beneath the sur-face to the model below. We hope that in doingso, global methods may actually simplify the workrequired to distinguish models. Landscaping allowsone to quickly �sketch out� all the possible outcomes

of an experiment that we are thinking about con-ducting. Should it reveal a problem such as indis-tinguishable models, landscaping can be used to es-timate the effectiveness of a proposed solution to in-crease the power of the experiment. Every remedyrequires something extra to be added, be it statisti-cal machinery, participants, or experimental condi-tions. Perhaps this is unavoidable. Even so, whilethere may be no free lunches in model evaluationand testing, we can often choose a preferred methodof payment.

AcknowledgmentsThis research was supported by research grant R01MH57472 from the National Institute of MentalHealth. We thank Nancy Briggs, Michael Lee andYong Su for many helpful comments and discussions.

ReferencesAnderson, N. H. (1981). Foundations of Information In-tegration Theory. New York: Academic Press

Crowther, C. S., Batchelder, W. H., and Hu, X. (1995).A measurement-theoretic analysis of the fuzzy logicmodel of perception. Psychological Review. 1995, 102,396-408.

Estes, W. K. (2002). Traps in the route to models ofmemory and decision. Psychonomic Bulletin & Re-view, 9, 3-25.

Grunwald, P. (2000). Model selection based on minimumdescription length. Journal of Mathematical Psychol-ogy, 44, 133-152.

Kuhn, T. S. (1962). The Structure of ScientiÞc Revolu-tions. Chicago: University of Chicago Press.

Lee, M. D. (2002). Generating additive clustering mod-els with limited stochastic complexity. Journal ofClassiÞcation, 19(1), 69-85.

Navarro, D. J. & Lee, M. D. (2002). Commonalitiesand distinctions in featural stimulus representations.In: W. G. Gray, and C. D. Schunn (Eds.) Proceed-ings of the 24th Annual Conference of the CognitiveScience Society, pp. 685-690, Mahwah, NJ: LawrenceErlbaum.

Oden, G. C., & Massaro, D. W. (1978). Integration ofFeatural Information in Speech Perception. Psycho-logical Review, 85, 172-191.

Pitt, M. A., Kim, W. and Myung, I. J. (in press). Flex-ibility versus generalizability in model selection. Psy-chonomic Bulletin and Review.

Pitt, M. A., Myung, I. J., & Zhang, S. (2002). Towarda method of selecting among computational models ofcognition. Psychological Review, 109, 472-491.

Rissanen, J. (1996). Fisher information and stochasticcomplexity. IEEE Transactions on Information The-ory, 42, 40-47.

Robert, C. P. (2001). The Bayesian Choice (2nd ed.).New York: Springer.

Rubin, D. C. &Wenzel, A. E. (1996). One hundred yearsof forgetting: A quantitative description of retention.Psychological Review, 103, 734-760.

Global Model Analysis by Landscaping

Documents

Transcript of Global Model Analysis by Landscaping