Data Mining in Spatial Data Sets

Data Mining in Spatial Data Sets

Hemant Kumar Jerath,B.Tech.

MS Project Student

Mangalore University

Advisors: Dr. B.K Mohan & Dr.(Mrs.).P. Venkatachalam

CSRE, IIT Bombay

• Data Management System

• Data Mining-Concepts, Algorithms & Tasks

• Data Warehouse

• OLAP(On-line Analytical Processing)

• Knowledge Discovery Process

• Spatial Data Warehouse & OLAP

• Spatial Data Mining – Concept & Definition

• Case Studies - Data Mining Software

• Spatial Data Mining- Software Architecture

Contents

Data Base Management System

Data warehouse

OLAP

SQL QUERY INTERFACE

OUTPUT/Knowledge Explicit/Trivial Knowledge

Data Mining techniques has an answer to explore the implicit knowledge.

DBMS Vs. Data Mining?

DBMS: sql driven exploration

Data Mining: automatic exploration

Data Mining

Definition:

Data Mining is analysis of (often large) observational data sets to find implicit relationships and to summarize the data in a novel ways that are both understandable and useful to the data owner.[Hand, et al]

Keywords in Definition

• the large data sets• observational data:opposed to the experimental data

• relationship and summaries- referred as model and patterns

– e.g. linear equations, rules, clusters, graphs, tree structures and recurrent patterns in the time series.

Data Tombs

Golden NuggetsDATA MINING

Implicit Knowledge

Transform your data to critical knowledge

Data Mining

Information Theory

Machine Learning

Artificial Intelligence

Data Mining – A CONFLUENCE of multi disciplines

Statistics

Knowledge Discovery Process(KDD)

Phase of real discovery

Data Preprocessing• Data Cleaning

– Missing values– Noisy data

• Binning

• Clustering

• Combined computer and human interaction

• Regression

– Inconsistent data

• Data Integration and Transformation– Data Integration– Data Transformation

• Data Transformation– Smoothening

– Aggregation

– Generalization

– Normalization

– Attribute Construction

• Data Reduction– Data Cube aggregation

– Dimension reduction

– Data Compression

– Numerosity reduction

– Discretization and concept hierarchy generation

…Continued

Data Warehouse

Definition:

A data warehouse is a

subject oriented

Integrated (heterogeneous sources)

time variant

and non-volatile

collection of data in support of management

decision making process [W.H.Inmon]

STAR SCHEMA

SNOWFLAKE SCHEMA

[address, time, item] cell<Canada, Q1, TV>

Data Cube Technology

OLAP Operations• Roll Up(Drill-up): summarize data

climbs up hierarchy or by dimension reduction• Drill Down(roll down): reserve of roll-up

from higher level summary to lower summary or detailed data or introducing new dimensions

• Slice and dice:project and select

• Pivot(rotate):

reorient the cube, visualization, 3D to series of 2D planes• Other operations

drill across: involving(across) more than one fact table

drill through: through the bottom level of the cube to its back-end relational tables(using SQL)

Drill Down OperationRoll Up Operation

Mining technology todayData

warehousePreprocessing

utilities Mining

operations

VisualizationTools

Vendors(IDC 1999)– SAS: 29%– SPSS: 13.5%– IBM: 6%

Extract data via ODBC •Sampling

•Attribute transformation

Scalable algorithms• association• classification• clustering• sequence mining

Data Mining Algorithms

Definition:

A data mining algorithm is a well-defined procedure that takes data as input and produces output in the form of models or patterns.

Data Mining Algorithms

Reductionist approach:

A data mining algorithm can be thought of as a 'tuple' consisting of:

{model structure, score function, search method, data management techniques}

CART B.P A Priori

Task Classification & regression

Regression Rule Pattern Discovery

Structure Decision Tree Neural Network

Association Rules

Score function

Cross Validated Loss

function

Squared Error Support/

Accuracy

Search Method

Greedy search over structures

Gradient Descent on parameters

Breadth First with Pruning

DMT* unspecified unspecified Linear Scans

* Data Management Technique

• So eventually, we can generate potentially infinite number of algorithms by combining different;

model structure score function search methods and data management techniques

Data Mining Task-Taxonomy

• Prediction: use of some variables to predict own known or future values of variables– Classification, regression and deviation detection

• Description: Find human interpretable patterns that describe the data– Clustering, association rule discovery, sequential rule

discovery

• Verification Model: affirm or negate the hypothesis( an iterative process, progressing refinement of hypothesis)

• Discovery Driven Model: system automatically finds the information

Data Mining Task-Taxonomy

Mining operationsClassification

• Regression

• Classification trees

• Neural networks

• Bayesian learning

• Nearest neighbor

• Radial basis functions

• Support vector machines

• Meta learning methods– Bagging,boosting

Clustering

• hierarchical

• EM

• density based

Sequence mining• Time series

similarity • Temporal patterns

Item set mining• Association rules• Causality

Sequential classification• Graphical models

– Hidden Markov Models

Mining Tasks

• Discovery of Association rule

X=>Y(s%,c%)

S- support

C- confidence

......Continued

Clustering

Criteria: i. Available similarity

ii. Set function (optimizing technique)

Land-use: Finding the similar areas under the land use in a earth observation database

City-Planning: Identifying a group of houses according to their house type, value and geographic location

• Classification– Finding rules to partition data into disjoint

groups

......Continued

Classification• Given old data about customers and payments,

predict new applicant’s loan eligibility.

AgeSalaryProfessionLocationCustomer type

Previous customers Classifier Decision rules

Salary > 5 L

Prof. = Exec

New applicant’s data

Good/bad

Classification Vs Clustering

• Clustering: methods generate the class labels. [descriptive]

• Classification: allocation of class labels to the data thru classifier.[predictive]

Frequent Episodes

• Sequence of events occur frequently

• these mainly used for the temporal data.

Deviation detection

• Identification of outliers

Sequence Mining

• Sequence of occurrence of the associative rules.

Spatial Data Mining

Spatial Data Mining

Definition:

Spatial data mining is an extraction of implicit knowledge, spatial relationships, or other interesting patterns not explicitly stored in the databases.

What is the difference between Data Mining and spatial data mining?

• Data Mining: – non-spatial attribute

• Spatial Data Mining: – Integration of both spatial and non-spatial

dimension in various KDD algorithms• Spatial attribute (use of thematic maps)

• Non-spatial attribute (relational database)

Spatial Data Models

• Raster Model: pixel data sets

• Vector Model: point, line, polygon objects

Fundamental Operations used to vector data sets

• Spatial Relations with neighbors is an imp. Aspect of Spatial Data Mining– distance between the points– area of the object (a polygon)– length of the chain or polygon– intersection or the union of the objects– mutual position of objects( they can intersect,

overlap or touch)

SOLAP

ARC SDE

DATA MINING

SPATIAL AND NON-SPATIALDATAWAREHOUSE

Attributedata

Shape files

Spatial Warehouse and OLAP

Definition:

The Spatial Data Warehouse is a

subject oriented,

integrated,

time variant and

non-volatile

collection of both spatial and non-spatial data in

support of managements decision making process.

SOLAP and SDW-Issues

• Spatial Data format– Structure specific– Vendor specific

• OLAP processing– Spatial indexing– Accessing methods

• Spatial data Cube Model– Use of spatial dimensions in the cube.

• Star/Snowflake Model

Construction of Spatial Warehouse and OLAP

Star Model of a spatial data warehouse: BC_weather

Agriculture

Cash Crop Grains

Fruits vegetation Rice wheat

mango kiwi Kale tomato jasmine basmati

Concept Hierarchies

G_close_to

Not_disjoint Close_to

Intersects Inside Contains Equal

Adjacent_to intersects covers contains

The hierarchy of topological relations

Modeling dimension-Spatial Data Cube

• Non-spatial Dimension– temp. , precipitation with generalization hot,

wet

• Spatial to Non-Spatial– pacific_northwest, big_state

• Spatial to Spatial dimension

What we can measure in spatial data cube?

• Numerical measure– e.g monthly revenue of the region, and roll up

may get total revenue of the region

• Spatial Measure– collection of pointers to the spatial objects– generalization (roll-up), regions of the same

temperature and precipitation are grouped together.

Spatial Data Mining: A Database ApproachMartin Ester, Hans-Peter Kriegel, Jorg Sander

• Step I: Discover centers based on some non-spatial attribute[clustering-descriptive mining]

• Step II: determine the (theoretical) trend of some non-spatial attribute.

• Step III: discover the deviation of the theoretical trends

• Step IV: explain the deviation by the spatial object, e.g. may be presence of some infrastructure.

Associations looks like this!!

• Spatial Association rules

Is_a(X,school) ^ close_to(X,sports centre)=>close_to(X,parks) [.5%,80%]

Is_a(X,city)^within(X,bc)^adjacent_to(X,water)=>close_to(X,border). (.5,92%)

Predicates like:

Close_to, far_away

Intersect, overlap and disjoint

Left_of, west_of

……Continued

• Introduction to:– neighborhood graphs– neighborhood paths

• The predicate neighbor may be one of the neighborhood relations:

Top-Down Deepening Approach

Large PatternsStrong Implication

At coarse details

Search to low level details

Progressive Deepening

Search Continues till no large patterns are not found.

Top-Down Deepening Approach

• Optimization technique is that the search for large patterns at high concept level– R-tree or plane sweep techniques operating on

MBR(minimum bounding rectangles)

Generalization-based Spatial Data Mining

• nonspatial-dominant generalizations – (-9C,-10-0C) COLD (attribute induction)

• Spatial-dominant generalization– Quad-tree and R-trees (attribute induction)

Spatial Clustering

• Clustering algorithms can be applied to discover centers of high economic power.– DBSCAN

– PAM, CLARA, CLARANS(spatial data dominant clustering and non-spatial data dominant clustering)

– CLARANS(-neighbor graphs)

– DBLEARN on non-spatial

Spatial Classification

• Non-spatial attribute e.g. no. of salespersons in a store

• Spatially related attribute with non-spatial values, like population living within 1km from store

• Spatial predicates, like – Distance_less_than_10km(X,a)

• Spatial function, like driving_distance(X,beach)

Decision Tree

Description of classified objects

Description of census block group

Buffers are definedFor Trade Area

High_profit=N

High_profit=Y

Classification is developed using ID3 algorithm

Spatial Trend Detection

• Trend- a temporal pattern– network alarms– recurrent illness

• algorithm computes the local changes of the specified attribute when moving to the neighbors as well as distance to the neighbors– Use of linear regression for the trend generation

• Location Predictions– Logistic Spatial Autoregressive Model(SAR)

• y=dWy+Xb+e• Contiguity matrix

Spatial Predictions

• Spatial Outlier Detection Techniques– (use of neighborhood graphs, paths and

indices)

.....Continued

GeoMiner Architecture

SPIN ARCHITECTURE

Weka 3: Machine Learning Software in Java

machine learning algorithms for data mining problems

Weka contains tools for•data pre-processing

•classification

•regression

•clustering

•association rules

•and visualization

SDAM Architecture

USE OF MLC++ Library forImplementing DM Techniques

MLC++ extends•supervised machine learning•classification•accuracy estimation•cross-validation•bootstrap•decision trees•ID3•decision graphs•naive-bayes•decision tables•majority•induction algorithms•classifiers•categorizers•general logic diagrams•instance-based algorithms•discretization•lazy learning

Bibliography[1] David Hand, Heikki Mannila, Padhraic Smyth: 'Principles of Data Mining', MITPress, London,2001, ISBN 0-262-08290-X

[2] Adriaans, P., and Zantige, D.: 'Data Mining', Addison-Wesley, Harlow, UK, 1996

[3] Berry, M.J.A, and Linoff, G.: 'Mastering Data Mining', Wiley, New York, 2000

[4] Ian H. Witten, Eibe Frank: ' Data Mining Practical Machine Learning Tools andTechniques with Java Implementation', Morgan Kaufmann Pub., San Francisco, CA,2000 , ISBN 1-55860-552-5

[5] Guting, R.: 'An Introduction to Spatial Database Systems', In Very Large Data BasesJournal, Springer Verlag,1994

[6] Han, J., and Koperski, K.: Discovery of Spatial Association Rules in GeographicInformation Databases. In Proc. Fourth International Symposium on Large SpatialDatabases, Maine. 47-66, 1995

[7] Shekhar, S., and Huang, Y., Co-location Rules Mining: A Summary of Results, Proc.Spatio-Temporal Symposium on Databases, http://www.cs.umn.edu/research/shashi-group/paper_list. html

[8] Barnett, V., and Lewis, T. 'Outliers in Statistical Data', John Wiley (3rd Ed),1994

[9] Hawkins, D. ' Identification of Outliers', Chapman and Hall, 1980

Issues In Building Spatial Data Mining Environment• Size of the database• Static or dynamic database• Testing present spatial data structure for finding the

implicit relationship between the spatial objects for mining tasks.

• Building Spatial Data warehouse and Spatial OLAP• Which Data Mining Task?• Choosing the mining algorithms for specific task….e.g. 10

years span between the concept of associative data mining….various algorithms has been developed and introduced.

• Which platform for implementation of mining algorithms, MLC++ on VC++ or Weka on Java

Data Mining in Spatial Data Sets

Documents

Transcript of Data Mining in Spatial Data Sets