Geographic Dimension in Data Mining - Amazon S3 Dimension in Data Mining ... • Data mining –...
Transcript of Geographic Dimension in Data Mining - Amazon S3 Dimension in Data Mining ... • Data mining –...
Geographic Dimension in Geographic Dimension in Data MiningData Mining
Konrad DramowiczKonrad DramowiczCentre of Geographic SciencesCentre of Geographic Sciences
Lawrencetown, Nova Scotia, CanadaLawrencetown, Nova Scotia, Canada
ESRI Business ESRI Business GeoInfoGeoInfo SummitSummitChicago, April 18Chicago, April 18--19, 200519, 2005
What is data mining?What is data mining?• Data mining (also known as a Knowledge
Discovery) is a technology, science, and art. It can help in extracting significant, previously unknown information from databases.
• Data mining automatically detects relevant patterns in a database. However, for many years statisticians have manually mined databases looking for statistically significant patterns.
• Data mining is also a tool for predicting future trends and behaviors, allowing business to make proactive knowledge driven decisions. Usually experts miss this predictive information because it lies outside their expectations.
GIS as a synergetic technologyGIS as a synergetic technology
• GIS alone is a very powerful technology dealing with spatial aspects of the real world.
• However, GIS with other technologies such as data mining, CRM, or ERP can be seen as synergetic technology.
What is special about spatial data?What is special about spatial data?
• Heavy use of computational geometry algorithms such as– Polygon intersection– Topological operations
• Large and complex objects such as– High fractal dimension polygons– Polygons with attached topological information– Networks and their attributes (for example,
addresses)• Large index tables
– Each spatial feature can be indexed by many z-values
Spatial data structuresSpatial data structures
• Spatial data structures are designed for indexing or storing spatial data.
• Usually they are raster-based structures using the Binary Search Tree (B-tree).
Reasons for emergingReasons for emergingGIS and data miningGIS and data mining
• Abundance of data• Inefficient traditional technology processing
information• Progress in computer technology, including data
structure, database management, computer graphic, artificial intelligence, etc.
• Growing user awareness and demand• Interdisciplinary approach such as:
– GIS: geography, computer science, forestry, land surveying, military applications
– Data mining: statistics, computer science, marketing, quality control, medicine
Major steps in developing Major steps in developing new technology (1970s)new technology (1970s)
• GIS– Data collection– Question example: “What
is the forest stand type in a given polygon?”
– Data delivery:• Retrospective and static
– Enabling technologies:• Mainframe computers• Digitizing tables
– Major users:• Forestry• Military• Land registry
• Data mining– Data collection– Question example: “What
was the total revenue in the last three years?”
– Data delivery:• Retrospective and static
– Enabling technologies:• Mainframe computers• Tapes, disks
Major steps in developing Major steps in developing new technology (1980s)new technology (1980s)
• GIS– Data access– Question example: “Where is
the most suitable animal habitat?”
– Data delivery:• Retrospective and dynamic at
feature level– Enabling issues:
• Vector topology• Raster data structure• DBMS
– Major users:• Geology• Environment • Government
• Data mining– Data access– Question example: “What
were unit sales in Maritimes last April?”
– Data delivery:• Retrospective and dynamic at
record level– Enabling technologies:
• RDBMS• SQL• ODBC
Major steps in developing Major steps in developing new technology (1990s)new technology (1990s)
• GIS– Data modeling and analysis– Question example: “What are
changes in the forest cover in a given area?”
– Data delivery:• Retrospective and dynamic at
multiple levels– Enabling issues:
• Vector/raster integration• GPS• SQL• Interoperability• Portable computers
– Major users:• Corporations• Municipalities• Education
• Data mining– Data warehousing and decision
support– Question example: “What were
unit sales in Maritimes last April? Drill down to Halifax”
– Data delivery:• Retrospective and dynamic at
multiple levels– Enabling technologies:
• OLAP• Data warehouses• Portable computers
Major steps in developing Major steps in developing new technology (emerging today)new technology (emerging today)
• GIS– Deployment of geographical
information– Question example: “How to
get to the closest restaurant?”– Data delivery:
• Prospective and proactive– Enabling issues:
• LBS• Internet mapping• Geodatabases
– Major users:• Communication• Business• General public
• Data mining– Data mining– Question example: “What
likely to happen to Halifax unit sales next month and why?”
– Data delivery:• Prospective and proactive
– Enabling technologies:• Distributive algorithms and
databases• Multiprocessor computers• Massive databases
Ten hottest jobs Ten hottest jobs and jobs that will disappear and jobs that will disappear
((TimeTime magazine, May 22, 2005)magazine, May 22, 2005)1. Tissue engineers2. Gene programmers3. Frankenfood monitors4. Pharmers5. Data miners
“ Research gurus will be on hand to extract useful tidbits from mountain of data, pinpointing behavior patterns for marketers and epidemiologists like”
1. Stockbrokers, auto dealers, mail carriers, insurance and real estate agents
2. Teachers3. Printers4. Stenographers5. CEOs6. Orthodontists7. Prison guards8. Truckers9. Housekeepers10.Fathers (?)
Broad and narrow definition Broad and narrow definition of data miningof data mining
• Broad definition of data mining refers to traditional statistical methods (“we are all data miners”)
• Narrow definition of data mining refers to automated methods, artificial intelligence, computer learning techniques
Data mining and GIS Data mining and GIS as a technology, science, and artas a technology, science, and art
• GIS and data mining as technologies:– Originated and stimulated by computer technology– Dealing with massive databases– Employing graphics
• GIS and data mining as sciences:– Having specific methods– Using specialized tools– Trying to develop own methodology– Having interdisciplinary character
• GIS and data mining can be seen as arts:– Requiring technical experience– Requiring experience in content domain area
Domains and scales of Domains and scales of GIS and data mining applicationsGIS and data mining applications
• The list of GIS and data mining applications is very extensive. Both technologies can be applied practically to any domain. Since it is not appropriate to define GIS and data mining by listing their most typical applications, therefore both technologies can be considered as domain-free.
• Also, GIS and data mining are scale-free since they can be applied to many different scales. There are examples of using GIS for mapping a human eye and for analyzing changes in global or even cosmic scale. Data mining is used for diagnosing a single patient and for international analyses.
Too much information?Too much information?• Can GIS and data mining help to handle the
problem of information overload?– 61% of managers believe that information overload is
present in their own workplace– 80% believe the situation get worse– Over 50% of managers ignore data in current
decision-making process because of the information overload
– 84% of managers store this information for the future; it is not used for current analysis
– 60% believe that the cost of gathering information outweighs its value
(Kantardzic, 2003)
CRISPCRISP• CIRSP (Cross-Industry Standard Process for
Data Mining) is a general data mining protocol developed in late 1990s.
• CRISP is similar to a product life cycle methodology developed in software engineering and implemented in managing GIS projects
• CRISP consists of six phases:1. Business understanding2. Data understanding3. Data preparation4. Modeling5. Evaluation6. Deployment
Mining geographical informationMining geographical information
• Components of Geographic Information Systems:– Data input – Data manipulation – Analysis and modeling– Data output
• Steps in data mining:– Problem
understanding– Data pre-processing– Modeling– Evaluation– Deployment
Comparing data mining and GISComparing data mining and GIS• Data mining
– Operates in a multidimensional abstract space
– Hypotheses are generated by machine learning
– Results go beyond the content of database
• GIS– Operates in
geographical space– Hypotheses are
generated by users– Difficulties in mapping
multivariate dependencies
• Data mining– Operates in a
multidimensional abstract space
– Hypotheses are generated by machine learning
– Results go beyond the content of database
• GIS– Operates in
geographical space– Hypotheses are
generated by users– Difficulties in mapping
multivariate dependencies
Business understandingBusiness understanding• Determine business objectives
– Background– Business objectives– Business success criteria
• Access situation– Inventory of resources– Requirements, assumptions and constraints– Risk and contingencies– Terminology– Costs and benefits
• Determine data mining goals– Data mining goals– Data mining success criteria
• Produce project plan– Project plan– Initial assessment of tools and techniques
Data understandingData understanding
• Collect initial data– Initial data collection report
• Describe data– Data description report
• Explore data– Data exploration report
• Verify data quality– Data quality report
Data preparationData preparation– Data set– Data set description
• Select data– Rationale for inclusion / exclusion
• Clean data– Data cleaning report
• Construct data– Derived attributes– Generated records
• Integrate data– Merged data
• Format data– Reformatted data
ModelingModeling• Select modeling techniques
– Modeling technique– Modeling assumptions
• Generate test design– Test design
• Build model– Parameter setting – Models– Model description
• Access model– Model assessment– Revised parameter settings
EvaluationEvaluation
• Evaluate results– Assessment of data mining results
• Business success criteria– Approved models
• Review process– Review of process
• Determine next steps– List of possible actions – Decision
DeploymentDeployment
• Plan deployment– Deployment plan
• Plan monitoring and maintenance– Monitoring and maintenance plan
• Produce final report– Final report– Final presentation
• Review project– Experience documentation
Integrating GIS and data miningIntegrating GIS and data mining
• There are numerous areas where GIS and data mining already overlap.
• Both, GIS and data mining represent synergetic, powerful, dynamic, and rapidly developing technologies.
• The process of integration of GIS and data mining has been already initiated.
• Further integration can benefit significantly both technologies.
Integration benefits to GISIntegration benefits to GIS
• GIS can benefit from being integrated with data mining by using:– More efficient data manipulation tools – Specialized Exploratory Data Analysis tools– Powerful new modeling tools– Better visualization tools
More efficient More efficient data manipulation toolsdata manipulation tools
• Data manipulation tools represent the primary area of data mining. These tools are important but not critical in GIS, since the manipulation of non-spatial attributes can be always performed outside GIS.
• Data cleansing is one of the most time and cost consuming operation within GIS projects.
• Below are some examples of typical data mining operations that can be done with specialized data manipulation tools:– Detecting and replacing missing data– Improving attribute accuracy– Handling inconsistency in databases– Data reclassification– Merging attributes and appending records– Filtering data
Exploratory Data Analysis toolsExploratory Data Analysis tools
• Exploratory Data Analysis (EDA) tools has been very commonly used in GIS analysis in the last ten years.
• Exploratory Data Analysis is usually the very first step in any spatial analysis.
• Data mining provides very specialized EDA tools for such operations as:– Outlier analysis– Testing normality– Analyzing distribution with boxplots and Q-Q plots
Exploratory Data Analysis Exploratory Data Analysis in data miningin data mining
Powerful new modeling toolsPowerful new modeling tools• Data mining offers numerous powerful modeling
tools that are not implemented yet in GIS, such as:– Decision trees and decision rules– Association rules– Artificial neural networks– Genetic algorithms
• Some data mining tools are partially implemented in some GIS, including:– Fuzzy logic– Cluster analysis
New visualization toolsNew visualization tools
• Visualization tools play critical role in GIS and they are also very important in data mining
• GIS focuses primarily on mapping tools using spatial attributes and employing the art of cartography
• Data mining focuses on charting and graphing non-spatial attributes using statistical methods
Visualization tool in data mining: cluster viewerVisualization tool in data mining: cluster viewer
Integration benefits to data miningIntegration benefits to data mining
• Data mining can specially benefit from being integrated with GIS at the following phases of CRISP methodology:– Data preparation– Analysis– Evaluation– Deployment
Data preparationData preparation• Data preparation represents a critical component in both
technologies. • Spatial (geographically-referenced) attributes are very
common within databases being analyzed with data mining.
• However, using the data mining alone, many typical operations on spatial attributes cannot be performed at all.
• GIS can provide tools for such operations as, for example:– Spatial referencing– Geocoding– Building topological relationships among objects
Deriving new attributesDeriving new attributes• GIS can be very useful in expanding the
number of attributes for further analysis by deriving new attributes.
• These attributes can be derived based on– Geographical (metric) information– Topological information
Deriving geographical (metric) Deriving geographical (metric) attributesattributes
• Length of lines• Areas of polygons• Distance to a closest object• Directions• Density of features
Deriving topological attributesDeriving topological attributes
• Connectivity of nodes• Adjacency of polygons• Information resulting from such topological
operations as, for example: – Inside– Within– Intersects– Contains– Covers
AnalysisAnalysis• Modeling and analysis exemplify the most
powerful component in data mining and GIS.• However, both technologies are
complementary in their approach to modeling. GIS provides more specialized spatial analysis tools, whereas data mining provides rather statistical analysis tools.
• Data mining lacks numerous tools from the domain of GIS.
Missing geographical analytical Missing geographical analytical tools in data miningtools in data mining
• Spatial statistics:– For example: spatial multiple linear regression
• Spatial analysis – For example: spatial autocorrelation
• Geostatistics– For example: kriging or trend surface analysis
• Network analysis– For example: optimal path or minimal tour
• Surface analysis – For example: visibility analysis
• Location-allocation modeling – For example: allocating demand to a given center
• Regionalization (spatial clustering)
EvaluationEvaluation
• Evaluation is a required step in data mining, whereas in GIS evaluation is rather more recommended than strictly forced.
• GIS offers invaluable tools for evaluating residuals (the difference between actual and predicted values), especially for– Mapping residuals– Analyzing the spatial autocorrelation of residuals
DeploymentDeployment
• GIS provides mapping tools that are non-existing in traditional data mining. These tools used for mapping results can enhance the deployment phase in data mining.
Spatial data mining resourcesSpatial data mining resources
• The leading academia centers developing spatial data mining are:– University of Utah (USA)– Southern Illinois University (USA)– Boston University (USA)– Simon Fraser University (Canada)– University of Leeds (England)– University of Munich (Germany)– University of Bari (Italy)– Russian Academy of Sciences
Spatial data mining softwareSpatial data mining software
• GeoMiner is a prototype of a spatial data mining system, including a spatial database server.
• Spin! (Spatial Mining for Data of Public Interest) represents the Web-based integration of data mining and GIS for such applications as public health, environmental protection, seismology, or marketing. This European product includes live Oracle-based queries and data visualization.
Visual programming and streamsVisual programming and streams
• Visual streams constructed from single operations and linked with visual programming represent a common contemporary user interface.
• Some data mining software packages implemented visual programming at the end of 1990s.
• Geoprocessing model building streams were implemented in ArcGIS, version. 8.0.
• Will these streams be ever linked together?
Basic data mining operations (1)Basic data mining operations (1)• Source operations:
• Record operations:
• Fields operations:
Basic data mining operations (2)Basic data mining operations (2)• Graphs operations:
• Modeling operations:
• Output operations:
Clustering stream: Clustering stream: KohonenKohonen, , KK--Means, and Two Step algorithmsMeans, and Two Step algorithms
Data mining modeling toolsData mining modeling tools• Predictive tools
– Neural networks– Multiple linear regression– Logistic regression– Prediction using C5.0 rule-based algorithm
• Rule-based tools– C5.0– CR&T (Classification and Regression Trees)– Association rules
• Apriori• GRI (Generalized )
• Classification tools– K-Means clustering– Kohonen network– TwoStep clustering
Neural networks modelingNeural networks modeling• Purpose: to predict a numeric or categorical
target variable• Output:
– Predicted value – Residuals (actual minus predicted values)– Rules
• Can be mapped:– Actual target variable– Predicted target variable– Residuals– Rules
Rule induction modelingRule induction modeling• Purpose: to predict a categorical target variable• Algorithm: C5.0• Output:
– Importance of predictors– Predicted value – Residuals
• Can be mapped:– Actual target variable– Predicted target variable– Residuals
Multiple linear regressionMultiple linear regression
• Purpose: to predict a numerical target variable using numerical predictors
• Output: – Set of predictors– Predicted target variable– Residuals
• Can be mapped:– Actual target variable– Predicted target variable– Residuals
Prediction stream:Prediction stream:numeric target variablenumeric target variable
Predicting with neural networks:Predicting with neural networks:numeric target variablenumeric target variable
Predicting with neural networks: Predicting with neural networks: residualsresiduals
Predicting with Predicting with multiple linear regressionmultiple linear regression
Predicting with multiple linear Predicting with multiple linear regression: residualsregression: residuals
Generating rulesGenerating rules• Purpose: to perform rule induction or to discover
associations rules• Algorithms:
– C5.0 (categorical target variables, categorical or numerical predictors)
– Apriori (categorical target variables and predictors)– GRI (Generalized Rule Induction, categorical target variables,
categorical or numerical predictors)• Output:
– Rules for groups of records, including their frequency and accuracy• Can be mapped:
– Geographical distribution of rules
Rule induction modeling:Rule induction modeling:numeric target variablenumeric target variable
Examples of rules the target Examples of rules the target variable: high GDPvariable: high GDP
Logistic regressionLogistic regression
• Purpose: to predict a categorical target variable using categorical and numerical predictors
• Output: – Set of predictors– Predicted target variable– Residuals
• Can be mapped:– Actual target variable– Predicted target variable– Residuals
Stream with rules:Stream with rules:C5, C5, AprioriApriori, and GRI algorithms, and GRI algorithms
Supernodes
Stream with rules Stream with rules inside inside supernodesupernode
Prediction stream:Prediction stream:categorical target variablecategorical target variable
Predicting with neural networks: Predicting with neural networks: categorical target variablecategorical target variable
Predicting withPredicting withlogistic regressionlogistic regression
ClusteringClustering• Purpose: to group records into clusters• Algorithms:
– Kohonen network– K-Means– TwoStep
• Output: – Cluster memberships– Cluster description– Distance to cluster centroids
• Can be mapped:– Cluster memberships– Most typical features for each cluster
Clustering: Clustering: KK--Means algorithmMeans algorithm
Clustering:Clustering:similarities to cluster similarities to cluster centroidscentroids
Factor analysis / PCA Factor analysis / PCA (Principal Components Analysis)(Principal Components Analysis)
• Purpose: to reduce the number of variables by replacing individual variables by factors / components
• Output: – Extracted factors– Loadings of variables on factors / components– Factor / component scores
• Can be mapped:– Geographical distribution of scores
Principal Components Analysis:Principal Components Analysis:component scorescomponent scores
Principal Component Analysis:Principal Component Analysis:scores of all five componentsscores of all five components
Classification tree modelingClassification tree modeling• Purpose: to pick individual predictors one at time and classify them
to optimize (minimize or maximize) a target variable• Algorithm:
– Classification and Regression Tree– CHAID– Exhaustive CHAID– QUEST
• Output:– Top predictors – Groups of records (nodes)– Rules
• Can be mapped:– Geographical distribution of rules
Classification tree andcorresponding map
OLAP cubesOLAP cubes• OLAP stands for On-Line Analytical Processing• OLAP tools allow the user to
– query – browse – and summarize information in a very efficient, interactive, and dynamic
way• OLAP tools represent a vital component of both
the Business Intelligence and data mining technology. They provide an aggregated approach to analyzing large amounts of detailed data.
Why cubes?Why cubes?• OLAP databases are referred often as “cubes”
since they have multidimensional nature. A cube is a visual representation of a multidimensional table and has just three dimensions: rows, columns and layers.
• OLAP cubes are very flexible because they allow the user to move information between these three dimensions.
• Users can have multiple cubes for their business data: one cube for customers, one for sales, one for production, one for geography, etc.
Basic operations with OLAP cubesBasic operations with OLAP cubes
• The following are basic operations that can be performed with OLAP cubes:
• Slice• Dice• Roll-up• Drill down• Pivoting
Slicing OLAP cubesSlicing OLAP cubes• The slice operation is based on selecting one dimension
and focusing on a portion of a cube. For example, the following table presents seven statistics for one variable.
Dicing OLAP cubesDicing OLAP cubes
• The dice operation creates a sub-cube by focusing on two or more dimensions.
RollingRolling--up OLAP cubesup OLAP cubes
• Roll-up, also called aggregation or dimension reduction, allows the user to move to the higher aggregation level. For example, instead of aggregating data by county, the user can select the whole province level.
DrillingDrilling--down OLAP cubesdown OLAP cubes• The drill-down
operation is the reverse of a roll-up and represents the situation when the user moves down the hierarchy of aggregation, applying a more detailed grouping.
Pivoting OLAP cubesPivoting OLAP cubes
• Pivoting, or rotation, changes the perspective in presenting the data to the user.
OLAP cubes OLAP cubes and spatial warehousesand spatial warehouses
• So far, the integration of OLAP technology and geographical analysis has been very limited. The majority of integration took place within the building and using of so-called spatial warehouses for visualization of results. The methodology based on similarities between OLAP cubes and map cubes is very promising.
• Whereas non-spatial warehouses utilize OLAP cubes, mostly as summary tables or spreadsheets, the spatial warehouses provide map cubes (collections of maps).
OLAP cubes vs. map cubesOLAP cubes vs. map cubes• The different nature of OLAP cubes and map
cubes also results from different types of aggregation operations.
• Such statistics as arithmetic mean, median, mode, standard deviation, minimum, maximum, count, or sum have their equivalents in so-called map algebra for raster data types.
• In addition, spatial queries utilizing geometric operators (area, perimeter, centroid) and topological operators (inside, within, intersect, contains, connects, borders), supplement significantly non-spatial SQL queries.
Integrating OLAP with GISIntegrating OLAP with GIS• A bridge between OLAP and GIS should allow OLAP
users, who deal with geographical data, to display OLAP cubes as maps using GIS.
• The foundation for going in the opposite direction (from GIS to OLAP) should allow GIS users to create SQL queries with geometric and topological operators and then pass this information to OLAP cubes.
• The final step of the integration would be allowing the user to interactively browse OLAP cubes and to simultaneously view the results on maps. Also, the user should be able to query a map and view corresponding data within OLAP cubes.
Two disjoint technologiesTwo disjoint technologies
• Today, GIS and data mining are still used as separate technologies.
• If GIS and data mining software packages operate under the same operating system, data can be passed easily but still indirectly between such packages.
• The recent emphasis on interoperability in GIS should be extended beyond GIS technology.
The most typical sequence of operationsThe most typical sequence of operations
• The most typical sequence of operations using GIS and data mining is as follows:
Data preparation including data cleansing (DM) Data preparation including deriving new geographical attributes (GIS) Spatial analysis (GIS) Modeling (DM) Validation (DM) Mapping initial results and spatial validation (GIS) Charting and interpreting results (DM) Mapping final results (GIS)
Future directionsFuture directions
• The future directions in spatial data mining can be summarized as follows (after Koperski, 1997):– Integrating artificial intelligence and GIS– Data mining using spatial object-oriented databases– Query language for spatial data mining– Creating multidimensional spatial rules– Mining under uncertainty– Spatial clustering– Visualization using multivariate thematic maps– Parallel data mining– Mining the presence of topological and geometric errors– Mining the remote sensing data– Mining spatiotemporal databases
Selected bibliographySelected bibliography• CRISP-DM 1.0, 1999. SPSS.• Dramowicz K., 2002. Adding Geography to Data Mining. Data Mining Summit, Reston,
VA.• Dunhan M.H., 2003. Data Mining: Introduction and Advanced Topics. Prentice Hall.• Eklund P.W. et al., 1998. Data Mining and Soil Salinity Analysis. International Journal of
Geographical Information Science, 12. pp. 247-268.• Ester M., et al., 1998. Spatial Data Mining: Database Primitives, Algorithms and Efficient
DBMS Support. Data Mining and Knowledge Discovery, 4, 2/3, pp. 193-216.• Gahegan M., 2000. On the Application of Inductive Machine Learning Tools to
Geographical Analysis. Geographical Analysis, 2, pp. 113-139.• Kantardzic M., 2003. Data Mining: Concepts, Models, Methods, and Algorithms. Wiley.• Koperski K. et al., 1997. Spatial Data Mining: Progress and Challenge.
http://db.cs.sfu.ca/GeoMiner/survey/html/survy.html• Koperski K., J. Han, 1995. Discovery of Spatial Association Rules in Geographic
Information Databases. [In:] Egenhofer M., J. Ferring (eds.) Advances in Spatial Databases. Springler-Verlag, pp. 47-66.
• Miller H.J. and J. Han (eds.), 2001. Geographic Data Mining and Knowledge Discovery. Taylor and Francis.
• Oppenshaw S., 1999. Geographic Data Mining: Key Design Issues. 4th International Conference on GeoComputation.