Agenda Today

46
1 Agenda Today We will discuss a few interesting spatial data mining patterns Then come back to summarize what we have learned in this course so far

description

Agenda Today. We will discuss a few interesting spatial data mining patterns Then come back to summarize what we have learned in this course so far. Spatial Data Management: Summary. Course Summary. 1. Introduction to Spatial Databases 2. Spatial Concepts and Data Models - PowerPoint PPT Presentation

Transcript of Agenda Today

Page 1: Agenda Today

1

Agenda Today

We will discuss a few interesting spatial data mining patternsThen come back to summarize what we have learned in this course so far

Page 2: Agenda Today

2

Spatial Data Management: Summary

Page 3: Agenda Today

3

Course Summary

1. Introduction to Spatial Databases2. Spatial Concepts and Data Models3. Spatial Query Languages: SQL34. Spatial Storage and Indexing: R-tree, Grid File5. Query Processing and Query Optimization

Strategies for range query, nearest neighbor querySpatial joins (e.g. tree matching), cost models

6. Spatial Network Model7. Spatial Data Mining

Spatial auto-correlation, co-location patterns, spatial outliers, classification methods

8. Trends in Spatial Database (Moving Object)

Page 4: Agenda Today

4

1. IntroductionTraditional (non-spatial) database management systems provide:

Persistence across failures

Allows concurrent access to data

Scalability to search queries on very large datasets which do not fit inside main memories of computers

Efficient for non-spatial queries, but not for spatial queries

Non-spatial queries:List the names of all bookstore with more than ten thousand titles.

List the names of ten customers, in terms of sales, in the year 2001

Use an index to narrow down the search

Spatial Queries:List the names of all bookstores with ten miles of Minneapolis

List all customers who live in Tennessee and its adjoining states

List all the customers who reside within fifty miles of the company headquarter

Page 5: Agenda Today

5

1. Spatial Data Examples

Examples of non-spatial dataNames, phone numbers, …

Examples of Spatial dataCensus DataNASA satellites imagery - terabytes of data per dayWeather and Climate DataRivers, Farms, ecological impactMedical Imaging

Page 6: Agenda Today

6

2. Spatial Object Model

Object model conceptsObjects: distinct identifiable things relevant to an applicationObjects have attributes and operationsAttribute: a simple (e.g. numeric, string) property of an object Operations: function maps object attributes to other objects

Example from a roadmapObjects: roads, landmarks, ...Attributes of road objects:

• spatial: location, e.g. polygon boundary of land-parcel• non-spatial: name (e.g. Route 66), type (e.g. interstate,

residential street), number of lanes, speed limit, …

Operations on road objects: determine center line, determine length, determine intersection with other roads, ...

Page 7: Agenda Today

7

2. Classifying Spatial objects

Spatial Object Types

Example Object Dimension

Point City 0

Curve River 1

Surface Country 2

• Spatial objets are spatial attributes of general objects• Spatial objects are of many types

•Simple•0- dimensional (points), 1 dimensional (curves), 2 dimensional (surfaces)•Example given at the bottom of this slide

•Collections•Polygon collection (e.g. boundary of Japan or Hawaii), …•See more complete list in Figure 2.2

Page 8: Agenda Today

8

2. Spatial Object Types in OGIS Data Model

Fig 2.2: Each rectangle shows a distinct spatial object type

Page 9: Agenda Today

9

2. Classifying Operations on spatial objects in Object Model

Set theory based Union, Intersection, Containment,

Topological Touches, Disjoint, Overlap, etc.

Directional East,North-West, etc.

Metric Distance

•Classifying operations •Set based: 2-dimensional spatial objects (e.g. polygons) are sets of points

• A set operation (e.g. intersection) of 2 polygons produce another polygon

• Topological operations: Boundary of USA touches boundary of Canada• Directional: New York city is to east of Chicago• Metric: Chicago is about 700 miles from New York city.

Page 10: Agenda Today

10

2. Specifying topological operation

Fig 2.3: 9 intersection matrices for a few topological operations

Page 11: Agenda Today

11

2. Conceptual DM: The ER Model

3 basic conceptsEntities have an independent conceptual or physical existence.

• Examples: Forest, Road, Manager, ...

Entities are characterized by Attributes• Example: Forest has attributes of name, elevation, etc.

An Entity interacts with another Entity through relationships.

• Road allow access to Forest interiors.• This relationship may be name “Accesses”

Page 12: Agenda Today

12

2. ER Diagram for “State-Park”Fig 2.4

Page 13: Agenda Today

13

Pictorial Enhanced ER Diagram for “State-Park

Page 14: Agenda Today

14

2. Mapping ER to Relational

•Highlights of translation rules

•Entity becomes Relation•Attributes become columns in the relation•Multi-valued attributes become a new relation

•includes foreign key to link to relation for the entity•Relationships (1:1, 1:N) become foreign keys•M:N Relationships become a relation

•containing foreign keys or relations from participating entities

Page 15: Agenda Today

15

3. Three Components of SQL?

Data Definition Language (DDL) Creation and modification of relational schemaSchema objects include relations, indexes, etc.

Data Manipulation Language (DML)Insert, delete, update rows in tablesQuery data in tables

Data Control Language (DCL)Concurrency control, transactionsAdministrative tasks, e.g. set up database users, security permissions

Page 16: Agenda Today

16

3. Creating Tables in SQL

• Table definition• “CREATE TABLE” statement• Specifies table name, attribute names and data types• Create a table with no rows.• See an example at the bottom

• Related statements• ALTER TABLE statement modifies table schema if needed• DROP TABLE statement removes an empty table

Page 17: Agenda Today

17

3. Populating Tables in SQL

• Adding a row to an existing table• “INSERT INTO” statement• Specifies table name, attribute names and values• Example: INSERT INTO River(Name, Origin, Length) VALUES(‘Mississippi’, ‘USA’, 6000)

• Related statements• SELECT statement with INTO clause can insert multiple rows in a table• Bulk load, import commands also add multiple rows• DELETE statement removes rows• UPDATE statement can change values within selected rows

Page 18: Agenda Today

18

3. SELECT Statement- General Information

• Clauses•SELECT specifies desired columns•FROM specifies relevant tables•WHERE specifies qualifying conditions for rows•ORDER BY specifies sorting columns for results•GROUP BY, HAVING specifies aggregation and statistics

•Operators and functions•arithmetic operators, e.g. +, -, …•comparison operators, e.g. =, <, >, BETWEEN, LIKE…•logical operators, e.g. AND, OR, NOT, EXISTS, •set operators, e.g. UNION, IN, ALL, ANY, …•statistical functions, e.g. SUM, COUNT, ...• many other operators on strings, date, currency, ...

Page 19: Agenda Today

19

4. Query Operation & Spatial Index

Filter Step: Select the objects whose mbb satisfies the spatial predicateTraverse the index apply the spatial test on the mbbOutput: set of oids

Refinement Step:Spatial test is done on the actual geometries of objects whose mbb satisfied the filter stepCostly operationExecuted only on a limited number of objects

Concentrate on the design of efficient SAMs for the filter step

Page 20: Agenda Today

20

4. Why spatial index method?

B-tree & hash tablesGuarantee the number of I/O operations is respectively logarithmic and constant in the collection sized Index a collection on a key Rely on a total order on the key domain, the order of natural numbers, or the lexicographic order on strings

There is no such total order for geometric objects SAMs were designed to try as much as possible to preserve spatial object proximity

Page 21: Agenda Today

21

4. Space-Driven v.s. Data-Driven SAMs

Space-Driven structures: Partition the embedding 2D Space into rectangular cellsIndependently of the distribution of the objectsObjects are mapped to the cells based on some geometric criterion Grid file, linear structure

Data-Driven structures: Organized by partitioning the set of objects, as opposed to the embedding spaceAdapts to the objects’ distribution in the embedding spaceR-tree, R* tree, R+ tree

Page 22: Agenda Today

22

4. Grid File – point indexing

One page is associated with each cellWhen a cell overflow, it is split into two cells and the points are assigned to the new cell

Two adjacent cells can reference the same page

The cells are of different size and the partition adapts to the point distribution

Page 23: Agenda Today

23

4. The Quad tree

The index is represented as a quaternary treeEach internal node has four children, one per quadrant NW, NE, SW, SEEach leaf is associated a disk page, which stores the index entries

Page 24: Agenda Today

24

4. The original R-Tree

A leaf entry is a pair (mbb, oid)A non-leaf node contains an array of node entries The number of entries is between m and MFor each entry (dr, node_id) in a non-leaf node N, dr is the directory rectangle of a child node of N, whose page address is node_idAll leaves are at the same level An object appears in one, and only one of the tree leaves

Page 25: Agenda Today

25

4. The R+ Tree

The directory rectangles at a given level do not overlapFor a point query, a single path is followed from the root to a leafThe I/O complexity is bounded by the depth of the tree

Page 26: Agenda Today

26

5. What is Query Processing and Optimization (QPO)?

Basic idea of QPOIn SQL, queries are expressed in high level declarative formQPO translates a SQL query to an execution plan

• over physical data model• using operations on file structures, indices, etc.

Ideal execution plan answers Q in as little time as possibleConstraints: QPO overheads are small

• Computation time for QPO steps << that for execution plan

Page 27: Agenda Today

27

5. QPO Challenges in SDBMS

Building Blocks for spatial queriesRich set of spatial data types, operations A consensus on “building blocks” is lackingCurrent choices include spatial select, spatial join, nearest neighbor

Choice of strategiesLimited choice for some building blocks, e.g. nearest neighbor

Choosing best strategiesCost models are more complex since

• Spatial Queries are both CPU and I/O intensive• While traditional queries are I/O intensive

Cost models of spatial strategies are not mature.

Page 28: Agenda Today

28

5. Choice of building blocks

Choice of building blocksVaries across software vendors and products

List of representative building blocksPoint Query- Name a highlighted city on a digital map.

• Return one spatial object out of a table

Range Query- List all countries crossed by of the river Amazon.

• Returns several objects within a spatial region from a table

Spatial Join: List all pairs of overlapping rivers and countries.

• Return pairs from 2 tables satisfying a spatial predicate

Nearest Neighbor: Find the city closest to Mount Everest.• Return one spatial object from a collection

Page 29: Agenda Today

29

5. Strategies for Spatial JoinsRecall Spatial Join Example:

List all pairs of overlapping rivers and countries.Return pairs from 2 tables satisfying a spatial predicate

List of strategiesNested loop:

• Test all possible pairs for spatial predicate• All rivers are paired with all countries

Space Partitioning:• Test pairs of objects from common spatial regions only• Rivers in Africa are tested with countries in Africa only!

Tree Matching• Hierarchical pairing of object groups from each table, section

5.1.6 pp.121

Other, e.g. spatial-join-index based, external plane-sweep, …

Page 30: Agenda Today

30

5. Query Processing and Optimizer process

Fig 5.2

• A site-seeing trip •Start: A SQL Query•End: An execution plan•Intermediate Stopovers

•query trees•logical tree transforms•strategy selection

• What happens after the journey?

•Execution plan is executed•Query answer returned

Page 31: Agenda Today

31

5. Query Trees

Fig 5.3

• Nodes = building blocks of (spatial) queries • See section 3.2 (pp.55) for symbols sigma, pi and join

• Children = inputs to a building block• Leafs = Tables • Example SQL query and its query tree follows:

Page 32: Agenda Today

32

5. Logical Transformation of Query Trees

• Motivation• Transformation do not change the answer of the query• But can reduce computational cost by

• reducing data produced by sub-queries• reducing computation needs of parent node

• Example Transformation• Push down select operation below join• Example: Fig. 5.4 (compare w/ Fig 5.3, last slide)• Reduces size of table for join operation

• Other common transformations• Push project down• Reorder join operations • ... Fig 5.4

Page 33: Agenda Today

33

5. Execution Plans

An execution plan has 3 componentsA query tree An ordering of evaluation of non-leaf nodes A strategy selected for each non-leaf node

ExampleStrategies for Query tree in Fig. 5.5

• Use scan for Area(L.Geometry) > 20• Use index for Fa.Name = ‘Campground’• Use space-partitioning join for

– Distance(Fa, L) < 50

• Use on-the-fly for projection

Ordering• As listed above

Fig 5.5

Page 34: Agenda Today

34

7. What is Spatial Data Mining?

Non-trivial search for interesting and unexpected spatial pattern Non-trivial Search

Large (e.g. exponential) search space of plausible hypothesisEx. Asiatic cholera : causes: water, food, air, insects, …; water delivery mechanisms - numerous pumps, rivers, ponds, wells, pipes, ...

InterestingUseful in certain application domainEx. Shutting off identified Water pump => saved human life

UnexpectedPattern is not common knowledge May provide a new understanding of worldEx. Water pump - Cholera connection lead to the “germ” theory

Page 35: Agenda Today

35

7. Choice of Methods

Two Approaches to mining Spatial Data

Pick spatial features; use classical DM methodsUse novel spatial data mining techniques

Possible Approach:Define the problem: capture special needsExplore data using maps, other visualizationTry reusing classical DM methods If classical DM perform poorly, try new methodsEvaluate chosen methods rigorouslyPerformance tuning as needed

Page 36: Agenda Today

36

Given:1. Spatial Framework

2. Explanatory functions:3. A dependent class:4. A family of function

mappings:

Find: Classification model:

Objective:maximizeclassification_accuracy

Constraints: Spatial Autocorrelation

exists

},...{ 1 nssS RSf

kX:

},...{: 1 MC ccCSf

CRR ...

cf̂

),ˆ( cc ff

Nest locations Distance to open water

Vegetation durability Water depth

7. Location Prediction as a classification problem

Color version of Fig. 7.3, pp. 188

Page 37: Agenda Today

37

7. Techniques for Location Prediction

Classical method: logistic regression, decision trees, bayesian classifierassumes learning samples are independent of each otherSpatial auto-correlation violates this assumption!Q? What will a map look like where the properties of a pixel was independent of the properties of other pixels? (see below - Fig. 7.4, pp. 189)

New spatial methodsSpatial auto-regression (SAR), Markov random field

• bayesian classifier

Page 38: Agenda Today

38

• Spatial Autoregression Model (SAR)• y = Wy + X +

• W models neighborhood relationships models strength of spatial dependencies error vector

• Solutions and - can be estimated using ML or Bayesian stat.• e.g., spatial econometrics package uses Bayesian

approach using sampling-based Markov Chain Monte Carlo (MCMC) method.

• Likelihood-based estimation requires O(n3) ops.• Other alternatives – divide and conquer, sparse matrix,

LU decomposition, etc.

7. Spatial AutoRegression (SAR)

Page 39: Agenda Today

39Answers: and

7. Associations, Spatial associations, Co-location

Page 40: Agenda Today

40

7. Association Rules: Formal Definitions

Consider a set of items,

Consider a set of transactions where each is a subset of I.

Support of C

Then iffSupport: occurs in at least s percent of the transactions:Confidence: At least c%

Example: Table 7.4 (pp. 202) using data in Section 7.4

},...,{ 1 kiiI

nttT ,...,1

it

tCTttC ,|)(

21 ii

||

)( 21

T

ii

)(

)(

1

21

i

ii

1i

Page 41: Agenda Today

41

Participation index = min{pr(fi, c)}

Where pr(fi, c) of feature fi in co-location c = {f1, f2, …, fk}:

= fraction of instances of fi with feature {f1, …, fi-1, fi+1, …, fk} nearby

N(L) = neighborhood of location L

Pr.[ A in N(L) | B at location L ]Pr.[ A in T | B in T ]conditional probability metric

Neighborhood (N)Transaction (T)collection

events /Boolean spatial featuresitem-typesitem-types

support

discrete sets

Association rules Co-location rules

participation indexprevalence measure

continuous spaceUnderlying space

7. Co-location rules vs. association rules

Page 42: Agenda Today

42

7. Spatial Outlier Detection

))](()([)( )( yfExfxS xNy)(

|)(|)( s

uxSZ s

xS

• Compute where

•Select points (e.g. S with Z(S(x)) above 3

Page 43: Agenda Today

43

7. Spatial Outlier Detection: Example

f

Given A spatial graph G={V,E} A neighbor relationship (K neighbors) An attribute function : V -> RFind O = {vi | vi V, vi is a spatial outlier}

Spatial Outlier Detection Test1. Choice of Spatial Statistic S(x) = [f(x)–E y N(x)(f(y))]

2. Test for Outlier Detection | (S(x) - s) / s | >

Rationale:Theorem: S(x) is normally distributed if f(x) is normally

distributed

Color version of Fig. 7.19 pp. 219

Page 44: Agenda Today

44

8. Spatiotemporal Data

Two types of problems:Indexing the current positions and movements of objects and querying their anticipated future positions.Indexing and querying the past movements of mobile objects.

On Indexing Mobile ObjectsIndexing the Positions of Continuously Moving Objects

Page 45: Agenda Today

45

Spatiotemporal Data (cont’d)

Indexing current/future locations mobile objects

The TPR-tree• Like the R-tree, but the MBRs are time-

parameterized to conservative bounding intervals (CBI).

• How are the CBI computed? What is the best way to group objects into a CBI?– By minimizing an objective function (e.g., overlap)

over the time the TPR-tree is valid.

• How do we answer queries using the TPR-tree?

Page 46: Agenda Today

46

Conclusion

Good progress… still more work is needed:Devising clean and complete semantics for data models and operators for spatial data, spatial-temporal dataEfficient implementationIndexing, query processing, query optimization, cost model Develop efficient algorithms to mine spatial dataAlternatives architectures

• spatial-temporal data, moving objects• mobile, wireless applications• web GIS