Modeling Spatial Dependencies for Mining Geospatial Data197

17
Modeling Spatial Dependencies for Mining Geospatial Data Sanjay Chawla , Shashi Shekhar , Weili Wu § , and Uygar Ozesmi 1 Introduction Widespread use of spatial databases[24] is leading to an increasing interest in min- ing interesting and useful but implicit spatial patterns[14, 17, 10, 22]. Efficient tools for extracting information from geo-spatial data, the focus of this work, are crucial to organizations which make decisions based on large spatial data sets. These organizations are spread across many domains including ecology and envi- ronment management, public safety, transportation, public health, business, travel and tourism[2, 12]. Classical data mining algorithms[1] often make assumptions (e.g. independent, identical distributions) which violate Tobler’s first law of Geography: everything is related to everything else but nearby things are more related than distant things[25]. In other words, the values of attributes of nearby spatial objects tend to systemati- cally affect each other. In spatial statistics, an area within statistics devoted to the analysis of spatial data, this is called spatial autocorrelation[6]. Knowledge discov- ery techniques which ignore spatial autocorrelation typically perform poorly in the Support in part by the Army High Performance Computing Research Center under the auspices of Department of the Army, Army Research Laboratory Cooperative agreement number DAAH04- 95-2-0003/contract number DAAH04-95-C-0008, and by the National Science Foundation under grant 963 1539. Vignette Corporation, Waltham MA 02451. Email:[email protected] Department of Computer Science, University of Minnesota, Minneapolis, MN 55455, USA. Email: [email protected] § Department of Computer Science, University of Minnesota, Minneapolis, MN 55455, USA. Email: [email protected] Department of Environmental Sciences, Ericyes University, Kayseri, Turkey. Email:[email protected] 1 Copyright © by SIAM. Unauthorized reproduction of this article is prohibited

description

modeling-spatial-dependencies-for-mining-geospatial-data

Transcript of Modeling Spatial Dependencies for Mining Geospatial Data197

ModelingSpatialDependenciesforMiningGeospatial DataSanjay Chawla, Shashi Shekhar, Weili Wu,andUygar Ozesmi1 IntroductionWidespreaduseofspatialdatabases[24]isleadingtoanincreasinginterestinmin-inginterestinganduseful but implicit spatial patterns[14, 17, 10, 22]. Ecienttoolsforextractinginformationfromgeo-spatial data, thefocusof thiswork, arecrucial to organizations whichmake decisions basedonlarge spatial data sets.Theseorganizationsarespreadacrossmanydomainsincludingecologyandenvi-ronmentmanagement,publicsafety,transportation,publichealth,business,travelandtourism[2,12].Classical data mining algorithms[1] often make assumptions (e.g. independent,identicaldistributions)whichviolateToblersrstlawofGeography: everythingisrelated to everything else but nearby things are more related than distant things[25].In other words, the values of attributes of nearby spatial objects tend to systemati-cally aect each other. In spatial statistics,an area within statistics devoted to theanalysisofspatialdata,thisiscalledspatialautocorrelation[6]. Knowledgediscov-erytechniqueswhichignorespatialautocorrelationtypicallyperformpoorlyintheSupport in part by the Army High Performance Computing Research Center under the auspicesof Department of the Army, Army Research Laboratory Cooperative agreement number DAAH04-95-2-0003/contractnumberDAAH04-95-C-0008, andbytheNational ScienceFoundationundergrant9631539.VignetteCorporation,WalthamMA02451. Email:[email protected] of Computer Science, Universityof Minnesota, Minneapolis, MN55455, USA.Email: [email protected] of Computer Science, Universityof Minnesota, Minneapolis, MN55455, USA.Email: [email protected] of Environmental Sciences, Ericyes University, Kayseri, Turkey.Email:[email protected] by SIAM. Unauthorized reproduction of this article is prohibited2presenceofspatial data. Spatial statisticstechniques, ontheotherhand, dotakespatial autocorrelationdirectlyintoaccount[3], buttheresultingmodelsarecom-putationallyexpensiveandaresolvedviacomplexnumerical solversorsamplingbasedMarkovChainMonteCarlo(MCMC)methods[15].In this paper we rst review spatial statistical methods which explictly modelspatial autocorrelationandweproposePLUMS(PredictingLocationsUsingMapSimilarity),anewapproachforsupervisedspatialdataminingproblems. PLUMSsearchestheparameterspaceof modelsusingamap-similaritymeasurewhichismoreappropriateinthecontextof spatial data. Wewill showthatcomparedtostate-of-the-artspatialstatisticsapproaches,PLUMSachivescomparableaccuracybutatafractionofthecost(twoordersofmagnitude). Furthermore,PLUMSpro-videsageneralframeworkforspecializingotherdataminingtechniquesforminingspatialdata.1.1 Uniquefeaturesofspatial dataminingThe dierence betweenclassical andspatial dataminingparallels the dierencebetweenclassical andspatial statistics. Oneofthefundamental assumptionsthatguides statistical analysis is that thedatasamples areindependentlygenerated:theyarelikethesuccessivetossesofacoin,ortherollingofadie. Whenitcomestotheanalysisofspatialdata,theassumptionabouttheindependenceofsamplesis generally false. In fact,spatial data tends to be highly self correlated. For exam-ple,peoplewithsimilarcharacteristics,occupationandbackgroundtendtoclustertogether in the same neighborhoods. The economies of a region tend to be similiar.Changesinnaturalresources,wildlife,andtemperaturevarygraduallyoverspace.Infact, thispropertyof likethingstoclusterinspaceissofundamental that, asnoted earlier, geographers have elevated it to the status of the rst law of geography.Thispropertyofselfcorrelationiscalledspatialautocorrelation. Anotherdistinctpropertyofspatialdata,spatial heterogeneity,impliesthatthevariationinspatialdataisafunctionofitslocation. Spatialheterogeneityismeasuredvialocalmea-suresofspatialautocorrelation[18]. WediscussmeasuresofspatialautocorrelationinSection2.1.2 FamousHistorical ExamplesofSpatial DataExplorationSpatial dataminingis aprocess of automatingthesearchfor potentiallyusefulpatterns. Wenowlistthreehistoricalexamplesofspatialpatternswhichhavehadaprofoundeectonsocietyandscienticdiscourse[11].1. In1855, whenthe Asiatic cholerawas sweepingthroughLondon, anepi-demiologistmarkedalllocationsonamapwherethediseasehadstruckanddiscovered that the locations formed a cluster whose centroid turned out to beawater-pump. Whenthegovernmentauthoritiesturned-othewaterpump,thecholerabegantosubside. Laterscientistsconrmedthewater-bornena-tureofthedisease.2. ThetheoryofGondwanaland,whichsaysthatallthecontinentsonceformedCopyright by SIAM. Unauthorized reproduction of this article is prohibited3onelandmass, was postulatedafter R. Lenzdiscovered(usingmaps) thatall thecontinentscouldbettedtogetherintoonepiecelikeonegiantjig-sawpuzzle. Laterfossil studiesprovidedadditional evidencesupportingthehypothesis.3. In1909agroupofdentistsdiscoveredthattheresidentsofColoradoSpringshad unusually healthy teeth, and they attributed this to high levels of naturalourideinthelocal drinkingwatersupply. Researcherslaterconrmedthepositive role of ouride incontrollingtooth-decay. Nowall municipalitiesintheUnitedStates ensurethat drinkingwater supplies arefortiedwithouride.Ineachof thesethreeinstances, spatial dataexplorationresultedinasetof un-expectedhypotheses (or patterns) whichwere later validatedbyspecialists andexperts. Thegoal of spatial dataminingis toautomatethediscoveries of suchpatternswhichcanthenbeexaminedbydomainexpertsforvalidation. Validationis usuallyaccomplishedbyacombinationof domainexpertise andconventionalstatisticaltechniques.1.3 AnIllustrativeApplicationDomainWe now introduce an example which will be used throughout this paper to illustratethe dierent concepts in spatial data mining. We are given data about two wetlands,named Darr and Stubble, on the shores of Lake Erie in Ohio USA in order to predictthe spatial distribution of a marsh-breeding bird, the red-winged blackbird (Agelaiusphoeniceus). The data was collected from April to June in two successive years, 1995and1996.Auniformgridwasimposedonthetwowetlandsanddierenttypesofmea-surementswererecordedateachcell orpixel. Intotal, valuesof sevenattributeswere recorded at each cell. Of course domain knowledge is crucial in deciding whichattributes areimportant andwhicharenot. For example, VegetationDurabilitywaschosenoverVegetationSpeciesbecausespecializedknowledgeaboutthebird-nesting habits of the red-winged blackbird suggested that the choice of nest locationis more dependent on plant structure and plant resistance to wind and wave actionthanontheplantspecies.Ourgoal istobuildamodel forpredictingthelocationof birdnestsinthewetlands. Typicallythe model is built using a portionof the data, calledtheLearning or Tranining data, and then tested on the remainder of the data, calledthe Testing data. For example,later on we will build a model using the 1995 dataon the Darr wetland and then test it on either the 1996 Darr or 1995 Stubble wetlanddata. Inthelearningdata, all theattributesareusedtobuildthemodel andinthetraininngdata, onevalueishidden, inourcasethelocationof thenests, andusingknowledgegainedfromthe1995Darrdataandthevalueoftheindependentattributes in the test data, we want to predict the location of the nests in Darr 1996orinStubble1995.Inthis paper wefocus onthreeindependent attributes, namelyVegetationDurability, DistancetoOpenWater, andWaterDepth. Thesignicanceof theseCopyright by SIAM. Unauthorized reproduction of this article is prohibited4three variables was established using classical statistical analysis. The spatial distri-bution of these variables and the actual nest locations for the Darr wetland in 1995areshowninFigure1. Thesemapsillustratetwoimportantpropertiesinherentinspatialdata.1. Thevalueofattributeswhicharereferencedbyspatiallocationtendtovarygraduallyover space. Whilethis mayseemobvious, classical dataminingtechniques, either explictly or implicitly, assume that the data is independentlygenerated. Forexample, themapsinFigure2showthespatial distributionof attributesif theywereindependentlygenerated. Oneof theauthorshasapplied classical data mining techniques like logistic regression[20] and neuralnetworks[19] to build spatial habitat models. Logistic regression was used be-cause the dependent variable is binary (nest/no-nest) and the logistic functionsquashes the real line onto the unit-interval. The values in the unit-intervalcanthenbeinterpretedasprobabilities. Thestudyconcludedthatwiththeuseof logisticregression, thenestscouldbeclassiedatarate24%betterthanrandom[19].2. Thespatial distributions of attributes sometimes havedistinct local trendswhichcontradicttheglobaltrends. ThisisseenmostvividlyinFigure1(b),wherethespatial distributionofVegetationDurabilityisjaggedinthewest-ern section of the wetland as compared to the overall impression of uniformityacrossthewetland. Thispropertyiscalledspatial heterogeneity. InSection2.2wedescribetwomeasureswhichquantifythenotionofspatialautocorre-lationandspatialheterogeneity.Thefactthatclassical dataminingtechniquesignorespatial autocorrelationandspatial heterogeneityinthemodel buildingprocess is onereasonwhythesetechniquesdoapoorjob. Asecond, moresubtlebutequallyimportantreasonisrelatedtothechoiceof theobjectivefunctiontomeasureclassicationaccuracy.Foratwo-classproblem,thestandardwaytomeasureclassicationaccuracyistocalcuatethepercentageof correctlyclassiedobjects. Thismeasuremaynotbethe most suitable in a spatial context. Spatialaccuracyhow far the predictions arefromtheactualsisasimportantinthisapplicationdomainduetotheeectsofdiscretizationsof acontinuouswetlandintodiscretepixels, asshowninFigure3.Figure3(a) shows theactual locations of nests and 3(b) shows thepixels withactual nests. Notethelossof informationduringthediscretizationof continuousspaceintopixels. ManynestlocationsbarelyfallwithinthepixelslabeledAandarequiteclosetootherblankpixels, whichrepresentno-nest. Nowconsidertwopredictions shown in Figure 3(c) and3(d). Domain scientists prefer prediction3(d)over 3(c),sincepredictednestlocationsarecloseronaveragetosomeactualnestlocations. The classication accuracy measure cannot distinguish between 3(c) and3(d),andameasureofspatialaccuracyisneededtocapturethispreference.A simple and intuitive measure of spatial accuracy is the Average Distance toCopyright by SIAM. Unauthorized reproduction of this article is prohibited50 20 40 60 80 100 120 140 16001020304050607080nz = 85Nest sites for 1995 Darr locationMarsh landNest sites(a)NestLocations0 20 40 60 80 100 120 140 16001020304050607080nz = 5372Vegetation distribution across the marshland0 10 20 30 40 50 60 70 80 90(b)VegetationDurability0 20 40 60 80 100 120 140 16001020304050607080nz = 5372Water depth variation across marshland0 10 20 30 40 50 60 70 80 90(c)WaterDepth0 20 40 60 80 100 120 140 16001020304050607080nz = 5372Distance to open water0 10 20 30 40 50 60(d)DistancetoOpenWaterFigure 1. (a) Learning dataset: The geometryof the wetlandandthelocations of the nests, (b) The spatial distributionof vegetationdurabilityoverthe marshland, (c) The spatial distributionof water depth, and(d) The spatialdistributionofdistancetoopenwater.NearestPrediction(ADNP)fromtheactualnestsites,whichcanbedenedasADNP(A, P) =1KK

k=1d(Ak, Ak.nearest(P)).HereAkrepresentstheactualnestlocations, PisthemaplayerofpredictednestlocationsandAk.nearest(P)denotesthenearestpredictedlocationtoAk. Kisthenumberofactualnestsites. InSection3wewillintegratetheADNPmeasureinto the PLUMS framework. We now formalize the spatial data mining problem byincorporating notions of spatial autocorrelation and spatial accuracy in the problemdenition.1.4 LocationPrediction: ProblemFormulationTheLocationPredictionproblemisageneralizationofthenestlocationpredictionproblem. It capturestheessential propertiesof similar problemsfromother do-Copyright by SIAM. Unauthorized reproduction of this article is prohibited60 20 40 60 80 100 120 140 16001020304050607080nz = 5372White Noise No spatial autocorrelation0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9(a) pixel property with independentidenticaldistribution0 20 40 60 80 100 120 140 16001020304050607080nz = 195Random distributed nest sites(b)RandomnestlocationsFigure2. Spatial distributionsatisfyingrandomdistributionassumptionsofclassical regressionA= nest locationP = predicted nest in pixelA=actual nest in pixelP PAA P PA AA(a)AA A(b)(d) (c)PPLegendFigure 3. (a)The actual locations of nest, (b)Pixels withactual nests,(c)Locationpredictedbyamodel, (d)Locationpredictedbyanothermodel. Predic-tion(d)isspatiallymoreaccuratethan(c).mainsincludingcrimepreventionandenvironmentalmanagement. Theproblemisformallydenedasfollows:Given: AspatialframeworkSconsistingofsites {s1, . . . , sn}foranunderlyinggeographicspaceG. AcollectionofexplanatoryfunctionsfXk:S Rk, k=1, . . . K. Rkistherangeofpossiblevaluesfortheexplanatoryfunctions. AdependentfunctionfY: S RY Afamily FoflearningmodelfunctionsmappingR1 . . . RK RY.Find: AfunctionfY F.Objective: maximizesimilarity(mapsiS( fY(fX1, . . . , fXK)), map(fY (si)))=(1 ) classicationaccuracy( fY, fY ) + ()spatialaccuracy(( fY, fY )Constraints:Copyright by SIAM. Unauthorized reproduction of this article is prohibited71. GeographicSpaceSisamulti-dimensionalEuclideanSpace1.2. Thevaluesof theexplanatoryfunctions, fX1, ..., fXKandtheresponsefunctionfYmaynot beindependent withrespect tothoseof nearbyspatialsites,i.e.,spatialautocorrelationexists.3. ThedomainRkoftheexplanatoryfunctionsistheone-dimensionaldo-mainofrealnumbers.4. Thedomainofthedependentvariable,RY= {0, 1}.The above formulation highlights two important aspects of location prediction.Itexplicitlyindicatesthat(i)thedatasamplesmayexhibitspatialautocorrelationand, (ii)anobjectivefunctioni.e., amapsimilaritymeasureisacombinationofclassicationaccuracyandspatialaccuracy. Thesimilaritybetweenthedependentvariable fYand the predicted variablefYis a combination of the traditional classi-cation accuracy and a representation dependent spatial classication accuracy.Theregularizationtermcontrolsthedegreeofimportanceofspatial accuracyandis typicallydomaindependent. As 0, themapsimilaritymeasureap-proaches the traditional classication accuracy measure. Intuitively, captures thespatialautocorrelationpresentinspatialdata.Thestudyof thenestinglocations of red-wingedblackbirds [19, 20] is aninstance of the location prediction problem. The underlying spatial framework is thecollection of 5m5m pixels in the grid imposed on marshes. Explanatory variables,e.g. waterdepth, vegetationdurabilityindex, distancetoopenwater, mappixelstoreal numbers. Dependentvariable, i.e. nestlocations, mapspixelstoabinarydomain. Theexplanatoryanddependentvariablesexhibitspatialautocorrelation,e.g., gradual variationoverspace, asshowninFigure1. Domainscientistspreferspatiallyaccuratepredictionswhichareclosertoactualnests,i.e, > 0.Finally,itisimportanttonotethatinspatialstatisticsthegeneralapproachformodelingspatial autocorrelationistoenlarge F, thefamilyof learningmodelfunctions (see Section 3). The PLUMS approach (See Section 3) allows the exibilityofincorporatingspatial autocorrelationinthemodel, intheobjectivefunctionorinboth. Lateronwewill showthatretainingtheclassical regressionmodel as Fbut modifying the objective function leads to results which are comparable to thosefrom spatial statistical methods but which incur only a fraction of the computationalcosts.1.5 RelatedWorkandOurContributionsRelatedworkexaminestheareaofspatialstatisticsandspatialdatamining.Spatial Statistics: The goal of spatial statistics is to model the special prop-ertiesofspatial data. Theprimarydistinguishingpropertyofspatial dataisthatneighboringdatasamplestendtosystematicallyaecteachother. Thustheclas-sical assumptionthatdatasamplesaregeneratedfromindependentandidenticaldistributionsisnotvalid. Currentresearchinspatial econometrics, geo-statistics,1The entire surface of the Earthcannot be modeledas a Euclideanspace but locallytheapproximationholdstrue.Copyright by SIAM. Unauthorized reproduction of this article is prohibited8and ecological modeling[3, 16, 11] has focused on extending classical statistical tech-niquesinordertocapturetheuniquecharacteristicsinherentinspatial data. InSection2webrieyreviewsomebasicspatialstatisticalmeasuresandtechniques.Spatial DataMining: Spatial datamining[9, 13, 14, 22, 4], asubeldofdatamining[1], isconcernedwiththediscoveryof interestinganduseful butim-plicit knowledge in spatial databases. Challenges in Spatial Data Mining arise fromthefollowingissues. First, classical datamining[1] dealswithnumbersandcate-gories; In contrast, spatial data is more complex and includes extended objects suchaspoints, lines, andpolygons. Second, classical dataminingworkswithexplicitinputs, whereasspatial predicates(e.g., overlap)areoftenimplicit. Third, classi-cal dataminingtreatseachinputindependentlyof otherinputs, whereasspatialpatternsoftenexhibitcontinuityandhighautocorrelationamongnearbyfeatures.Forexample,thepopulationdensitiesofnearbylocationsareoftenrelated. Inthepresenceof spatial data, thestandardapproachinthedataminingcommunityistomaterializespatial relationshipsasattributesandrebuildthemodel withthesenewspatialattributes[14]. Inpreviouswork[4]westudiedspatialstatisticstech-niques which explictly model spatial autocorrelation. In particular we described thespatial autoregressionregression(SAR)model whichextendslinearregressionforspatial data. WealsocomparedthelinearregressionandtheSARmodel onthebirdwetlanddataset.Ourcontributions: Inthispaper, weproposePredictingLocationsUsingMapSimilarity(PLUMS), anewframeworkfor supervisedspatial dataminingproblems. Thisframeworkconsistsofacombinationofastatisticalmodel, amapsimilaritymeasurealongwithasearchalgorithm, andadiscretizationof thepa-rameterspace. Weshowthatthecharacteristicpropertyof spatial data, namely,spatial autocorrelation, canbeincorporatedineitherthestatistical model ortheobjectivefunction. Wealsopresentresultsof experimentsonthebird-nestingdatatocompareourapproachwithspatialstatisticaltechniques.Outlineandscopeof Paper: Therestofthepaperisasfollows. Section2presentsareviewof spatial statistical techniquesincludingtheSpatial Autore-gressive Regressive (SAR) model[15], which extends regression modeling for spatialdata. In Section 3 we propose PLUMS, a new framework for supervised spatial dataminingandcompareitwithspatial statistical techniques. Inthispaperwefocusexclusivelyonclassicationtechniques. Section4presentsresultsof experimentsonthebirdnestingdatasetsandsection5concludesthewholepaper.2 BasicConcepts: ModelingSpatial Dependencies2.1 Spatial AutocorrelationandExamplesManymeasures are available for quantifying spatial autocorrelation. Eachhasstrengthsandweaknesses. HerewebrieydescribetheMoransImeasure.Inmostcases,theMoransImeasure(henceforthMI)rangesbetween-1and+1andthusissimilartotheclassicalmeasureofcorrelation. Intuitively,ahigherpositivevalueindicateshighspatial autocorrelation. Thisimpliesthatlikevaluestendtoclustertogetherorattracteachother. AlownegativevalueindicatesthatCopyright by SIAM. Unauthorized reproduction of this article is prohibited9BCDA B C D(a) Map b)Contiguity Matrix W0 1 00 1 00 1 0 01 0 1 111AABC DFigure4. Aspatial neighborhoodanditscontiguitymatrixhighandlowvaluesareinterspersed. Thuslikevaluesarede-clusteredandtendtorepel eachother. Avalueclosetozeroisanindicationthatnospatial trend(randomdistribution)isdiscernibleusingthegivenmeasure.All spatial autocorrelationmeasures are cruciallydependent onthe choiceanddesignof the contiguitymatrixW. The designof the matrixitself reectstheinuenceof neighborhood. Twocommonchoices arethefour andtheeightneighborhood. ThusgivenalatticestructureandapointSinthelattice, afour-neighborhoodassumesthatSinuencesallcellswhichshareanedgewithS.Inaneight-neighborhood, itisassumedthatSinuencesall cellswhicheithershareanedgeoravertex. AneightneighborhoodcontiguitymatrixisshowninFigure 4.Thecontiguitymatrixoftheunevenlattice(left)isshownontherighthand-side.Thecontiguitymatrixplaysapivotalroleinthespatialextensionoftheregressionmodel.2.2 Spatial AutoregressionModels: SARWe now show how spatial dependencies are modeled in the framework of regressionanalysis. This framework may serve as a template for modeling spatial dependenciesinotherdataminingtechniques. Inspatialregression,thespatialdependenciesoftheerrorterm, or, thedependentvariable, aredirectlymodeledintheregressionequation[3]. Assumethatthedependentvaluesy

iarerelatedtoeachother, i.e.,yi= f(yj) i = j.Thentheregressionequationcanbemodiedasy = Wy +X + .Here Wis the neighborhood relationship contiguity matrix and is a parameter thatreects the strength of spatial dependencies between the elements of the dependentvariable. After the correctiontermWyis introduced, the components of theresidual error vectorarethenassumedtobegeneratedfromindependent andidenticalstandardnormaldistributions.WerefertothisequationastheSpatial AutoregressiveModel (SAR).Noticethatwhen=0, thisequationcollapsestotheclassical regressionmodel.Thebenetsofmodelingspatial autocorrelationaremany: (1)Theresidual errorwill havemuchlowerspatial autocorrelation, i.e., systematicvariation. WiththeCopyright by SIAM. Unauthorized reproduction of this article is prohibited10proper choice of W, the residual error should, at least theoretically, have no system-atic variation. (2) If the spatial autocorrelation coecient is statistically signicant,thenSARwillquantifythepresenceofspatialautocorrelation. Itwillindicatetheextenttowhichvariationsinthedependentvariable(y)areexplainedbytheaver-ageofneighboringobservationvalues. (3)Finally,themodelwillhaveabettert,i.e.,ahigherR-squaredstatistic.As in the case of classical regression, the SAR equation has to be transformedvia the logistic function for binary dependent variables. The estimates of and canbe derived using maximum likelihood theory or Bayesian statistics. We have carriedout preliminary experiments using the spatial econometrics matlab package2whichimplements a Bayesian approach using sampling based Markov Chain Monte Carlo(MCMC)methods[16]. ThegeneralapproachofMCMCmethodsisthatwhenthejoint-probabilitydistributionistoocomplicatedtobecomputedanalytically,thena suciently large number of samples from the conditional probability distributionscan be used to estimate the statistics of the full joint probability distribution. WhilethisapproachisveryexibleandtheworkhorseofBayesianstatistics,itisacom-putationallyexpensiveprocesswithslowconvergenceproperties. Furthermore, atleastfornon-statisticians,itisanon-trivialtasktodecidewhatpriorstochooseandwhatanalyticexpressionstousefortheconditionalprobabilitydistributions.3 PredictingLocationsUsingMapSimilarity(PLUMS)Recall thatweproposedageneral problemdenitionfortheLocationPredictionproblem, withtheobjectiveofmaximizingmapsimilarity, whichcombinesspa-tial accuracyandclassicationaccuracy. Inthissection, weproposethePLUMSframeworkforspatialdatamining.IndependentDiscretizedvar. mapsrasterDependentvar. mapbinary rasterDiscretizedLearned SpatialModelMeasuresMap Similarity Discretization graphfor parameter spaceAlgo. to searchparameter spaceFamily of functions(i.e. spatial models)Learning dataPLUMSFigure5. Theframeworkforthelocationpredictionprocess2We would like to thank James Lesage (http://www.spatial-econometrics.com/) for making thematlabtoolboxavailableontheweb.Copyright by SIAM. Unauthorized reproduction of this article is prohibited113.1 ProposedApproach: PredictingLocationsUsingMapSimilarity(PLUMS)PredictingLocationsUsingMapSimilarity(PLUMS)istheproposedsupervisedlearningapproach. Figure 5shows the context andcomponents of PLUMS. Ittakes a set of maps for explanatory variables and a map for the dependent variable.Themapsmustuseacommonspatial framework, i.e., commongeographicspaceandcommondiscretization, andproducealearnedspatial modeltopredictthedependent variable using explanatory variables. PLUMS has four basic components:amapsimilaritymeasure, afamilyof parametric functions representingspatialmodels,adiscretizationofparameterspace,andasearchalgorithm. PLUMSusesthesearchalgorithmtoexploretheparameterspacetondtheparametervaluetuplewhichmaximizethegivenmapsimilaritymeasure. Eachparameter valuetuplespeciesafunctionfromthegivenfamilyasacandidatespatialmodel.A simple map similarity measure focusing on spatial accuracy for nest-locationmaps (or point sets in general) is the average distance from an actual nest site to theclosest predicted nest-site. Other spatial accuracy and map similarity measures canbe dened using techniques such as the nearest neighbor index[7], and the principalcomponentanalysisofapairofrastermaps.3.2 GreedySearchalgorithmofPLUMSAlgorithm1greedy-search-algorithmparameter-value-set nd-A-local-maxima(parameter-value-set PVS, discretization-of-parameter-spaceSF,map-similarity-measure-functionMSM,learning-map-setLMS) {parameter-value-setbest-neighbor,a-neighbor;real best-improvement=1,an-improvement;while(best-improvement>0)do {best-neighbor=PVS.get-a-neighbor(SF);best-improvement=MSM(best-neighbor,LMS)-MSM(PVS,LMS);foreacha-neighborinPVS.get-all-neighbors(SF)do {an-improvement=MSM(a-neighbor,LMS)-MSM(PVS,LMS);if(an-improvement>best-improvement) {best-neighbor=a-neighbor;best-improvement=an-improvement;}}if(best-improvement>0)thenPVS=best-neighbor;}/*foundalocal maximainparameterspace*/returnPVS;}A special case of PLUMS using greedy search is described in Algorithm 1. Thefunction nd-A-local-maxima, takes a seed value-tuple of parameters, a discretiza-tion of parameter space, a map-similarity function, and a learning data set consistingofmapsofexplanatoryanddependentvariables. Itevaluatestheparameter-valuetupleintheimmediateneighborhoodofcurrentparameter-valuetupleinthegivendiscretization. An example of a current parameter-value tuple in a red-winged-blackbirdapplicationwiththreeexplanatoryvariablesis(a,b,c). ItsneighborhoodmayCopyright by SIAM. Unauthorized reproduction of this article is prohibited12includethefollowingparameter valuetuples: (a+,b,c), (a-,b,c),(a,b+,c),(a,b-,c),(a,b,c+), and(a,b,c-)givenauniformgridwithcell-sizediscretizationofparameterspace. Amoresophisticateddiscretizationmayusenon-uniformgrids.PLUMSevaluates themapsimilaritymeasureoneachparameter valuetupleintheneighborhood. If someof theneighborshavehighervaluesforthemapsim-ilaritymeasure, theneighborwiththehighestvalueof mapsimilaritymeasureischosen. This process is repeated until no neighbor has a higher map similarity mea-surevalue,i.e.,alocalmaximahasbeenfound. Clearly,thissearchalgorithmcanbeimprovedusingavarietyof ideasincludinggradientdescent[5] andsimulatedannealing[23]. Asimplefunctionfamilyisthefamilyofgeneralizedlinearmodels,e.g., logisticregression[15], withorwithoutautocorrelationterms. Otherinterest-ingfamiliesincludenon-linearfunctions. Inthespatial statisticsliterature, manyfunctionshavebeenproposedtocapturethespatialautocorrelationproperty. Forexample, econometricians use the familyof spatial autoregressionmodels[3, 16],geo-statisticians[12] useCo-Krigingandecologists usetheAuto-Logisticmodels.Table 1 summarizes several special cases of PLUMS by enumerating various choicesforthefourcomponents.ThedesignspaceofPLUMSisshowninFigure 6. EachinstanceofPLUMSisapointinthefourdimensionalconceptualspacespannedbysimilaritymeasure,family of functions, discretization of parameter space, and external search algorithm.Forexample, thePLUMSimplementationlabeledAinFigure 6correspondstothespatial accuracymeasure(ADNP), generalizedlinearmodel (forthefamilyoffunctions),agreedysearchalgorithmanduniformdiscretization.PLUMSComponentChoicesComponent ChoicesMapsimilarity avg. distancetonearestpredictionfromactual(ADNP),...Searchalgorithm greedy,gradientdescent,simulatedannealing,...Functionfamily generalizedlinear(GL)(logit,probit),non-linear,GLwithautocorrelationDiscretizationofparameterspace Uniform,non-uniform,multi-resolution,...Table1. PLUMSComponentChoices4 ExperimentDesignandEvaluationWecarriedoutexperimentstocomparetheclassicalregressionandspatialautore-gressiveregression(SAR)models[4]andaninstanceofthePLUMSframework.Goals: The goals of the experiments were (1) to evaluate the eects of including thespatialautoregressiveterm,Wy,inthelogisticregressionmodeland(2)comparetheaccuracyandperformanceof aninstanceof PLUMSwithspatial regressionmodels.The 1995 Darr wetland data was used as the learning set to build the classicalandspatial models. Theparametersoftheclassical logisticandspatial regressionmodel were derivedusingmaximumlikelihoodestimationandMCMCmethods(Gibbs Sampling). The two models were evaluated based on their ability to predictCopyright by SIAM. Unauthorized reproduction of this article is prohibited13Generalized Linearwith AutocorrelationGreedy(G)Non-Linearwith AutocorrelationUniform(NU)Non-SimulatedAnnealing(SA)SearchGeneralized LinearG SADiscretizationUniform(U)NU(2) (3)G SA(5)UNU(4)Spatial accuracymeasureMap similarityClassification accuracymeasuremeasure(0