Agenda Today
-
Upload
griffin-mayer -
Category
Documents
-
view
18 -
download
1
description
Transcript of Agenda Today
1
Agenda Today
We will discuss a few interesting spatial data mining patternsThen come back to summarize what we have learned in this course so far
2
Spatial Data Management: Summary
3
Course Summary
1. Introduction to Spatial Databases2. Spatial Concepts and Data Models3. Spatial Query Languages: SQL34. Spatial Storage and Indexing: R-tree, Grid File5. Query Processing and Query Optimization
Strategies for range query, nearest neighbor querySpatial joins (e.g. tree matching), cost models
6. Spatial Network Model7. Spatial Data Mining
Spatial auto-correlation, co-location patterns, spatial outliers, classification methods
8. Trends in Spatial Database (Moving Object)
4
1. IntroductionTraditional (non-spatial) database management systems provide:
Persistence across failures
Allows concurrent access to data
Scalability to search queries on very large datasets which do not fit inside main memories of computers
Efficient for non-spatial queries, but not for spatial queries
Non-spatial queries:List the names of all bookstore with more than ten thousand titles.
List the names of ten customers, in terms of sales, in the year 2001
Use an index to narrow down the search
Spatial Queries:List the names of all bookstores with ten miles of Minneapolis
List all customers who live in Tennessee and its adjoining states
List all the customers who reside within fifty miles of the company headquarter
5
1. Spatial Data Examples
Examples of non-spatial dataNames, phone numbers, …
Examples of Spatial dataCensus DataNASA satellites imagery - terabytes of data per dayWeather and Climate DataRivers, Farms, ecological impactMedical Imaging
6
2. Spatial Object Model
Object model conceptsObjects: distinct identifiable things relevant to an applicationObjects have attributes and operationsAttribute: a simple (e.g. numeric, string) property of an object Operations: function maps object attributes to other objects
Example from a roadmapObjects: roads, landmarks, ...Attributes of road objects:
• spatial: location, e.g. polygon boundary of land-parcel• non-spatial: name (e.g. Route 66), type (e.g. interstate,
residential street), number of lanes, speed limit, …
Operations on road objects: determine center line, determine length, determine intersection with other roads, ...
7
2. Classifying Spatial objects
Spatial Object Types
Example Object Dimension
Point City 0
Curve River 1
Surface Country 2
• Spatial objets are spatial attributes of general objects• Spatial objects are of many types
•Simple•0- dimensional (points), 1 dimensional (curves), 2 dimensional (surfaces)•Example given at the bottom of this slide
•Collections•Polygon collection (e.g. boundary of Japan or Hawaii), …•See more complete list in Figure 2.2
8
2. Spatial Object Types in OGIS Data Model
Fig 2.2: Each rectangle shows a distinct spatial object type
9
2. Classifying Operations on spatial objects in Object Model
Set theory based Union, Intersection, Containment,
Topological Touches, Disjoint, Overlap, etc.
Directional East,North-West, etc.
Metric Distance
•Classifying operations •Set based: 2-dimensional spatial objects (e.g. polygons) are sets of points
• A set operation (e.g. intersection) of 2 polygons produce another polygon
• Topological operations: Boundary of USA touches boundary of Canada• Directional: New York city is to east of Chicago• Metric: Chicago is about 700 miles from New York city.
10
2. Specifying topological operation
Fig 2.3: 9 intersection matrices for a few topological operations
11
2. Conceptual DM: The ER Model
3 basic conceptsEntities have an independent conceptual or physical existence.
• Examples: Forest, Road, Manager, ...
Entities are characterized by Attributes• Example: Forest has attributes of name, elevation, etc.
An Entity interacts with another Entity through relationships.
• Road allow access to Forest interiors.• This relationship may be name “Accesses”
12
2. ER Diagram for “State-Park”Fig 2.4
13
Pictorial Enhanced ER Diagram for “State-Park
14
2. Mapping ER to Relational
•Highlights of translation rules
•Entity becomes Relation•Attributes become columns in the relation•Multi-valued attributes become a new relation
•includes foreign key to link to relation for the entity•Relationships (1:1, 1:N) become foreign keys•M:N Relationships become a relation
•containing foreign keys or relations from participating entities
15
3. Three Components of SQL?
Data Definition Language (DDL) Creation and modification of relational schemaSchema objects include relations, indexes, etc.
Data Manipulation Language (DML)Insert, delete, update rows in tablesQuery data in tables
Data Control Language (DCL)Concurrency control, transactionsAdministrative tasks, e.g. set up database users, security permissions
16
3. Creating Tables in SQL
• Table definition• “CREATE TABLE” statement• Specifies table name, attribute names and data types• Create a table with no rows.• See an example at the bottom
• Related statements• ALTER TABLE statement modifies table schema if needed• DROP TABLE statement removes an empty table
17
3. Populating Tables in SQL
• Adding a row to an existing table• “INSERT INTO” statement• Specifies table name, attribute names and values• Example: INSERT INTO River(Name, Origin, Length) VALUES(‘Mississippi’, ‘USA’, 6000)
• Related statements• SELECT statement with INTO clause can insert multiple rows in a table• Bulk load, import commands also add multiple rows• DELETE statement removes rows• UPDATE statement can change values within selected rows
18
3. SELECT Statement- General Information
• Clauses•SELECT specifies desired columns•FROM specifies relevant tables•WHERE specifies qualifying conditions for rows•ORDER BY specifies sorting columns for results•GROUP BY, HAVING specifies aggregation and statistics
•Operators and functions•arithmetic operators, e.g. +, -, …•comparison operators, e.g. =, <, >, BETWEEN, LIKE…•logical operators, e.g. AND, OR, NOT, EXISTS, •set operators, e.g. UNION, IN, ALL, ANY, …•statistical functions, e.g. SUM, COUNT, ...• many other operators on strings, date, currency, ...
19
4. Query Operation & Spatial Index
Filter Step: Select the objects whose mbb satisfies the spatial predicateTraverse the index apply the spatial test on the mbbOutput: set of oids
Refinement Step:Spatial test is done on the actual geometries of objects whose mbb satisfied the filter stepCostly operationExecuted only on a limited number of objects
Concentrate on the design of efficient SAMs for the filter step
20
4. Why spatial index method?
B-tree & hash tablesGuarantee the number of I/O operations is respectively logarithmic and constant in the collection sized Index a collection on a key Rely on a total order on the key domain, the order of natural numbers, or the lexicographic order on strings
There is no such total order for geometric objects SAMs were designed to try as much as possible to preserve spatial object proximity
21
4. Space-Driven v.s. Data-Driven SAMs
Space-Driven structures: Partition the embedding 2D Space into rectangular cellsIndependently of the distribution of the objectsObjects are mapped to the cells based on some geometric criterion Grid file, linear structure
Data-Driven structures: Organized by partitioning the set of objects, as opposed to the embedding spaceAdapts to the objects’ distribution in the embedding spaceR-tree, R* tree, R+ tree
22
4. Grid File – point indexing
One page is associated with each cellWhen a cell overflow, it is split into two cells and the points are assigned to the new cell
Two adjacent cells can reference the same page
The cells are of different size and the partition adapts to the point distribution
23
4. The Quad tree
The index is represented as a quaternary treeEach internal node has four children, one per quadrant NW, NE, SW, SEEach leaf is associated a disk page, which stores the index entries
24
4. The original R-Tree
A leaf entry is a pair (mbb, oid)A non-leaf node contains an array of node entries The number of entries is between m and MFor each entry (dr, node_id) in a non-leaf node N, dr is the directory rectangle of a child node of N, whose page address is node_idAll leaves are at the same level An object appears in one, and only one of the tree leaves
25
4. The R+ Tree
The directory rectangles at a given level do not overlapFor a point query, a single path is followed from the root to a leafThe I/O complexity is bounded by the depth of the tree
26
5. What is Query Processing and Optimization (QPO)?
Basic idea of QPOIn SQL, queries are expressed in high level declarative formQPO translates a SQL query to an execution plan
• over physical data model• using operations on file structures, indices, etc.
Ideal execution plan answers Q in as little time as possibleConstraints: QPO overheads are small
• Computation time for QPO steps << that for execution plan
27
5. QPO Challenges in SDBMS
Building Blocks for spatial queriesRich set of spatial data types, operations A consensus on “building blocks” is lackingCurrent choices include spatial select, spatial join, nearest neighbor
Choice of strategiesLimited choice for some building blocks, e.g. nearest neighbor
Choosing best strategiesCost models are more complex since
• Spatial Queries are both CPU and I/O intensive• While traditional queries are I/O intensive
Cost models of spatial strategies are not mature.
28
5. Choice of building blocks
Choice of building blocksVaries across software vendors and products
List of representative building blocksPoint Query- Name a highlighted city on a digital map.
• Return one spatial object out of a table
Range Query- List all countries crossed by of the river Amazon.
• Returns several objects within a spatial region from a table
Spatial Join: List all pairs of overlapping rivers and countries.
• Return pairs from 2 tables satisfying a spatial predicate
Nearest Neighbor: Find the city closest to Mount Everest.• Return one spatial object from a collection
29
5. Strategies for Spatial JoinsRecall Spatial Join Example:
List all pairs of overlapping rivers and countries.Return pairs from 2 tables satisfying a spatial predicate
List of strategiesNested loop:
• Test all possible pairs for spatial predicate• All rivers are paired with all countries
Space Partitioning:• Test pairs of objects from common spatial regions only• Rivers in Africa are tested with countries in Africa only!
Tree Matching• Hierarchical pairing of object groups from each table, section
5.1.6 pp.121
Other, e.g. spatial-join-index based, external plane-sweep, …
30
5. Query Processing and Optimizer process
Fig 5.2
• A site-seeing trip •Start: A SQL Query•End: An execution plan•Intermediate Stopovers
•query trees•logical tree transforms•strategy selection
• What happens after the journey?
•Execution plan is executed•Query answer returned
31
5. Query Trees
Fig 5.3
• Nodes = building blocks of (spatial) queries • See section 3.2 (pp.55) for symbols sigma, pi and join
• Children = inputs to a building block• Leafs = Tables • Example SQL query and its query tree follows:
32
5. Logical Transformation of Query Trees
• Motivation• Transformation do not change the answer of the query• But can reduce computational cost by
• reducing data produced by sub-queries• reducing computation needs of parent node
• Example Transformation• Push down select operation below join• Example: Fig. 5.4 (compare w/ Fig 5.3, last slide)• Reduces size of table for join operation
• Other common transformations• Push project down• Reorder join operations • ... Fig 5.4
33
5. Execution Plans
An execution plan has 3 componentsA query tree An ordering of evaluation of non-leaf nodes A strategy selected for each non-leaf node
ExampleStrategies for Query tree in Fig. 5.5
• Use scan for Area(L.Geometry) > 20• Use index for Fa.Name = ‘Campground’• Use space-partitioning join for
– Distance(Fa, L) < 50
• Use on-the-fly for projection
Ordering• As listed above
Fig 5.5
34
7. What is Spatial Data Mining?
Non-trivial search for interesting and unexpected spatial pattern Non-trivial Search
Large (e.g. exponential) search space of plausible hypothesisEx. Asiatic cholera : causes: water, food, air, insects, …; water delivery mechanisms - numerous pumps, rivers, ponds, wells, pipes, ...
InterestingUseful in certain application domainEx. Shutting off identified Water pump => saved human life
UnexpectedPattern is not common knowledge May provide a new understanding of worldEx. Water pump - Cholera connection lead to the “germ” theory
35
7. Choice of Methods
Two Approaches to mining Spatial Data
Pick spatial features; use classical DM methodsUse novel spatial data mining techniques
Possible Approach:Define the problem: capture special needsExplore data using maps, other visualizationTry reusing classical DM methods If classical DM perform poorly, try new methodsEvaluate chosen methods rigorouslyPerformance tuning as needed
36
Given:1. Spatial Framework
2. Explanatory functions:3. A dependent class:4. A family of function
mappings:
Find: Classification model:
Objective:maximizeclassification_accuracy
Constraints: Spatial Autocorrelation
exists
},...{ 1 nssS RSf
kX:
},...{: 1 MC ccCSf
CRR ...
cf̂
),ˆ( cc ff
Nest locations Distance to open water
Vegetation durability Water depth
7. Location Prediction as a classification problem
Color version of Fig. 7.3, pp. 188
37
7. Techniques for Location Prediction
Classical method: logistic regression, decision trees, bayesian classifierassumes learning samples are independent of each otherSpatial auto-correlation violates this assumption!Q? What will a map look like where the properties of a pixel was independent of the properties of other pixels? (see below - Fig. 7.4, pp. 189)
New spatial methodsSpatial auto-regression (SAR), Markov random field
• bayesian classifier
38
• Spatial Autoregression Model (SAR)• y = Wy + X +
• W models neighborhood relationships models strength of spatial dependencies error vector
• Solutions and - can be estimated using ML or Bayesian stat.• e.g., spatial econometrics package uses Bayesian
approach using sampling-based Markov Chain Monte Carlo (MCMC) method.
• Likelihood-based estimation requires O(n3) ops.• Other alternatives – divide and conquer, sparse matrix,
LU decomposition, etc.
7. Spatial AutoRegression (SAR)
39Answers: and
7. Associations, Spatial associations, Co-location
40
7. Association Rules: Formal Definitions
Consider a set of items,
Consider a set of transactions where each is a subset of I.
Support of C
Then iffSupport: occurs in at least s percent of the transactions:Confidence: At least c%
Example: Table 7.4 (pp. 202) using data in Section 7.4
},...,{ 1 kiiI
nttT ,...,1
it
tCTttC ,|)(
21 ii
||
)( 21
T
ii
)(
)(
1
21
i
ii
1i
41
Participation index = min{pr(fi, c)}
Where pr(fi, c) of feature fi in co-location c = {f1, f2, …, fk}:
= fraction of instances of fi with feature {f1, …, fi-1, fi+1, …, fk} nearby
N(L) = neighborhood of location L
Pr.[ A in N(L) | B at location L ]Pr.[ A in T | B in T ]conditional probability metric
Neighborhood (N)Transaction (T)collection
events /Boolean spatial featuresitem-typesitem-types
support
discrete sets
Association rules Co-location rules
participation indexprevalence measure
continuous spaceUnderlying space
7. Co-location rules vs. association rules
42
7. Spatial Outlier Detection
))](()([)( )( yfExfxS xNy)(
|)(|)( s
uxSZ s
xS
• Compute where
•Select points (e.g. S with Z(S(x)) above 3
43
7. Spatial Outlier Detection: Example
f
Given A spatial graph G={V,E} A neighbor relationship (K neighbors) An attribute function : V -> RFind O = {vi | vi V, vi is a spatial outlier}
Spatial Outlier Detection Test1. Choice of Spatial Statistic S(x) = [f(x)–E y N(x)(f(y))]
2. Test for Outlier Detection | (S(x) - s) / s | >
Rationale:Theorem: S(x) is normally distributed if f(x) is normally
distributed
Color version of Fig. 7.19 pp. 219
44
8. Spatiotemporal Data
Two types of problems:Indexing the current positions and movements of objects and querying their anticipated future positions.Indexing and querying the past movements of mobile objects.
On Indexing Mobile ObjectsIndexing the Positions of Continuously Moving Objects
45
Spatiotemporal Data (cont’d)
Indexing current/future locations mobile objects
The TPR-tree• Like the R-tree, but the MBRs are time-
parameterized to conservative bounding intervals (CBI).
• How are the CBI computed? What is the best way to group objects into a CBI?– By minimizing an objective function (e.g., overlap)
over the time the TPR-tree is valid.
• How do we answer queries using the TPR-tree?
46
Conclusion
Good progress… still more work is needed:Devising clean and complete semantics for data models and operators for spatial data, spatial-temporal dataEfficient implementationIndexing, query processing, query optimization, cost model Develop efficient algorithms to mine spatial dataAlternatives architectures
• spatial-temporal data, moving objects• mobile, wireless applications• web GIS