Data Mining & Machine Learning Group CS@UH ACM-GIS08 Christoph Eick (University of Houston, USA),...

28
Data Mining & Machine Learning Group CS@UH ACM-GIS08 Christoph Eick (University of Houston, USA), Rachana Parmar (University of Houston, USA), Wei Ding (University of Massachusetts at Boston, USA, USA), Tomasz Stepinski (Lunar and Planetary Institute, Houston, USA), Jean-Phillippe Nicot (Bureau of Economic Geology, University of Texas, Austin, USA) Finding Regional Co-Location Patterns for Sets of Continuous Variables in Spatial Datasets Irvine(CA), November 6, 2008

Transcript of Data Mining & Machine Learning Group CS@UH ACM-GIS08 Christoph Eick (University of Houston, USA),...

Page 1: Data Mining & Machine Learning Group CS@UH ACM-GIS08 Christoph Eick (University of Houston, USA), Rachana Parmar (University of Houston, USA), Wei Ding.

Data Mining & Machine Learning Group CS@UHACM-GIS08

Christoph Eick (University of Houston, USA), Rachana Parmar (University of Houston, USA), Wei Ding (University of

Massachusetts at Boston, USA, USA), Tomasz Stepinski (Lunar and Planetary Institute, Houston, USA), Jean-Phillippe Nicot

(Bureau of Economic Geology, University of Texas, Austin, USA)

Finding Regional Co-Location Patterns for Sets of Continuous Variables in Spatial Datasets

Irvine(CA), November 6, 2008

Page 2: Data Mining & Machine Learning Group CS@UH ACM-GIS08 Christoph Eick (University of Houston, USA), Rachana Parmar (University of Houston, USA), Wei Ding.

Data Mining & Machine Learning Group CS@UHACM-GIS08

Talk Outline

1. Introduction Co-location Mining2. Clustering with Plug-in Fitness Functions3. An Interestingness Measure for Co-location

Mining Involving Continuous Variables. 4. Case Study: Arsenic Pollution in the Texas

Water Wells5. CLEVER—A Representative-based Clustering

Algorithm6. Conclusion.

Page 3: Data Mining & Machine Learning Group CS@UH ACM-GIS08 Christoph Eick (University of Houston, USA), Rachana Parmar (University of Houston, USA), Wei Ding.

Data Mining & Machine Learning Group CS@UHACM-GIS08

1. Introduction

“Spatial co-locations represent the subsets of features which are frequently located together in geographic space” [Shekhar]

Most of the past research centers on finding categorical co-location patterns which are global.

However, many real world datasets contain continuous variables, and global knowledge may be inconsistent with regional knowledge

Page 4: Data Mining & Machine Learning Group CS@UH ACM-GIS08 Christoph Eick (University of Houston, USA), Rachana Parmar (University of Houston, USA), Wei Ding.

Data Mining & Machine Learning Group CS@UH

Regional Co-location Mining Goal: To discover regional co-location patterns involving

continuous variables in which continuous variables take values from the wings of their statistical distribution

A novel framework that operates in the continuous domain is proposed to accomplish this goal.

Dataset:(longitude,latitude,<concentrations>+)

RegionalCo-location Mining

Page 5: Data Mining & Machine Learning Group CS@UH ACM-GIS08 Christoph Eick (University of Houston, USA), Rachana Parmar (University of Houston, USA), Wei Ding.

Data Mining & Machine Learning Group CS@UHACM-GIS08

Why is Regional Knowledge Important in Spatial Data Mining?

A special challenge in spatial data mining is that information is usually not uniformly distributed in spatial datasets.

It has been pointed out in the literature that “whole map statistics are seldom useful”, that “most relationships in spatial data sets are geographically regional, rather than global”, and that “there is no average place on the Earth’s surface” [Goodchild03, Openshaw99].

Therefore, it is not surprising that domain experts are mostly interested in discovering hidden patterns at a regional or local scale rather than a global scale.

Page 6: Data Mining & Machine Learning Group CS@UH ACM-GIS08 Christoph Eick (University of Houston, USA), Rachana Parmar (University of Houston, USA), Wei Ding.

Data Mining & Machine Learning Group CS@UHACM-GIS08

Related Work Shekhar et al. discuss several interesting

approaches to mine spatial co-location patterns of categorical features.

Huang et al. address the problem of mining co-location patterns with rare features.

Srikant and Agrawal use discretization of continuous variables to form categorical variables on which classical association rule mining is applied.

Calder et al. introduce an approach to use rank correlation to mine quantitative association rules.

Achtert and others give a method to derive quantitative, non-spatial models to describe correlation clusters.

Page 7: Data Mining & Machine Learning Group CS@UH ACM-GIS08 Christoph Eick (University of Houston, USA), Rachana Parmar (University of Houston, USA), Wei Ding.

Data Mining & Machine Learning Group CS@UHACM-GIS08

2. Clustering with Plug-in Fitness Functions

Motivation: Finding subgroups in geo-referenced datasets has many

applications. However, in many applications the subgroups to be searched

for do not share the characteristics considered by traditional clustering algorithms, such as cluster compactness and separation.

Domain knowledge frequently imposes additional requirements concerning what constitutes a “good” subgroup.

Consequently, it is desirable to develop clustering algorithms that provide plug-in fitness functions that allow domain experts to express desirable characteristics of subgroups they are looking for.

Only very few clustering algorithms published in the literature provide plug-in fitness functions; consequently existing clustering paradigms have to be modified and extended by our research to provide such capabilities.

Page 8: Data Mining & Machine Learning Group CS@UH ACM-GIS08 Christoph Eick (University of Houston, USA), Rachana Parmar (University of Houston, USA), Wei Ding.

Data Mining & Machine Learning Group CS@UHACM-GIS08

Current Suite of Spatial Clustering Algorithms Representative-based: SCEC, SRIDHCR, SPAM, CLEVER Grid-based: SCMRG Agglomerative: MOSAIC Density-based: SCDE, DCONTOUR (not really plug-in but some fitness

functions can be simulated)

Clustering Algorithms

Density-based

Agglomerative-basedRepresentative-based

Grid-based

Remark: All algorithms partition a dataset into clusters by maximizing a reward-based, plug-in fitness function.

ACM-GIS08

Page 9: Data Mining & Machine Learning Group CS@UH ACM-GIS08 Christoph Eick (University of Houston, USA), Rachana Parmar (University of Houston, USA), Wei Ding.

Data Mining & Machine Learning Group CS@UHACM-GIS08

Spatial Clustering Alg. Cont. Datasets are assumed to have the following structure: (<spatial attributes>;<non-spatial attributes>) e.g. (longitude, latitude; <chemical concentrations>+) Clusters are found in the subspace of the spatial attributes,

called regions in the following. The non-spatial attributes are used by the fitness function but

neither in distance computations nor by the clustering algorithm itself.

Clustering algorithms are assumed to maximize reward-based fitness functions that have the following structure:

where is a parameter that determines the premium put on cluster size (larger values fewer, larger clusters)

XcXc

ccicrewardXq

*)()()(

Page 10: Data Mining & Machine Learning Group CS@UH ACM-GIS08 Christoph Eick (University of Houston, USA), Rachana Parmar (University of Houston, USA), Wei Ding.

Data Mining & Machine Learning Group CS@UHACM-GIS08

3. An Interestingness Measure for Co-location Mining Involving Continuous Variables

Goal is to discover interesting regions with interesting co-location patterns.

Clustering algorithms that maximize fitness functions of the form already exist:

To use those algorithms for this task, an interestingness measure has to be designed.

XcXc

ccicrewardXq

*)()()(

Page 11: Data Mining & Machine Learning Group CS@UH ACM-GIS08 Christoph Eick (University of Houston, USA), Rachana Parmar (University of Houston, USA), Wei Ding.

Data Mining & Machine Learning Group CS@UHACM-GIS08

Co-location Measure for Continuous Variables

Products of z-scores of continuous variables are used to measure the interestingness of co-location patterns.

Pattern A - Attribute A has high values Pattern A - Attribute A has low values

otherwise

oAscorezifoAscorezoAz

0

0,-,-,

otherwise

oAscorezifoAscorezoAz

0

0,-,-,

Page 12: Data Mining & Machine Learning Group CS@UH ACM-GIS08 Christoph Eick (University of Houston, USA), Rachana Parmar (University of Houston, USA), Wei Ding.

Data Mining & Machine Learning Group CS@UHACM-GIS08

Interestingness of a Pattern

Interestingness of a pattern B (e.g. B= {C, D, E}) for an object o,

Interestingness of a pattern B for a region c,

Bp

opzoBi ),(),(

),(*

,

, cBpurityc

oBi

cB co

Remark: Purity (i(B,o)>0) measures the percentage of objects that exhibit pattern B in region c.

Page 13: Data Mining & Machine Learning Group CS@UH ACM-GIS08 Christoph Eick (University of Houston, USA), Rachana Parmar (University of Houston, USA), Wei Ding.

Data Mining & Machine Learning Group CS@UHACM-GIS08

Region Interestingness Region interestingness is assessed by

computing the most prevalent pattern:

Region interestingness solely depends on the most interesting co-location set for the region.

),(max&1&

cBciBPBSB

Page 14: Data Mining & Machine Learning Group CS@UH ACM-GIS08 Christoph Eick (University of Houston, USA), Rachana Parmar (University of Houston, USA), Wei Ding.

Data Mining & Machine Learning Group CS@UHACM-GIS08

Example of a Result

Exp. No.Top 5

RegionsRegion Size Region Reward

Maximum Valued Pattern in theRegion

PurityAverage Product for

maximum valued pattern

Exp. 1

1 23 174.3191 AsMoVF- 0.83 211.0179

2 40 104.8576 AsMoV 0.65 161.3194

3 11 92.9385 AsMoVSO42- 0.55 170.3873

4 36 89.4068 AsBCl-TDS 0.58 153.2687

5 7 30.5775 AsMoCl-TDS 0.57 53.5107

All experiments: P(B) = (AsB or AsB) and |B|<5.

Experiment 1 = 1.3, θ=1.0

Page 15: Data Mining & Machine Learning Group CS@UH ACM-GIS08 Christoph Eick (University of Houston, USA), Rachana Parmar (University of Houston, USA), Wei Ding.

Data Mining & Machine Learning Group CS@UHACM-GIS08

Summary

Pattern Interestingness in a region is evaluated using products of (cut-off) z-scores. In general, products of z-scores measure correlation.

Additionally, purity is considered that is controlled by a parameter :

Finally, the parameter determines how much premium is put on the size of a region when computing region rewards.

Page 16: Data Mining & Machine Learning Group CS@UH ACM-GIS08 Christoph Eick (University of Houston, USA), Rachana Parmar (University of Houston, USA), Wei Ding.

Data Mining & Machine Learning Group CS@UHACM-GIS08

4. Case Study

Page 17: Data Mining & Machine Learning Group CS@UH ACM-GIS08 Christoph Eick (University of Houston, USA), Rachana Parmar (University of Houston, USA), Wei Ding.

Data Mining & Machine Learning Group CS@UHACM-GIS08

Arsenic Water Pollution Problem Arsenic pollution is a serious problem in the Texas water supply. Hard to explain what causes arsenic pollution to occur. Several Datasets were created using the Ground Water Database

(GWDB) by Texas Water Development Board (TWDB) that tests water wells regularly, one of which was used in the experimental evaluation in the paper: All the wells have a non-null samples for arsenic Multiple sample values are aggregated using avg/max functions Other chemicals may have null values

Format: (Longitude, Latitude, <z-values of chemical concentrations>)

Page 18: Data Mining & Machine Learning Group CS@UH ACM-GIS08 Christoph Eick (University of Houston, USA), Rachana Parmar (University of Houston, USA), Wei Ding.

Data Mining & Machine Learning Group CS@UHACM-GIS08

Interesting Observations High arsenic is a well-known problem in Southern Ogallala aquifer in

the Texas Panhandle and in the Southern Gulf Cost aquifer. The co-location mining framework was able to identify regions in this areas, as for example for =1.3, =1.0 Rank 1, 2 and 3 regions are in Ogallala aquifer. Rank 4 region is in Gulf cost aquifer. The approach not only identified that high arsenic is associated with high vanadium and molybdenum but was also able to discriminate against companion elements like sulfate and fluoride.

Page 19: Data Mining & Machine Learning Group CS@UH ACM-GIS08 Christoph Eick (University of Houston, USA), Rachana Parmar (University of Houston, USA), Wei Ding.

Data Mining & Machine Learning Group CS@UHACM-GIS08

Interesting Observations cont. For =1.5, the extent of arsenic contamination in Texas: Ogallala

Aquifer, Southern Gulf Coast, and West Texas basins, could be recognized.

For =2.0, loosening of cluster definition results in a display of the known, often described as sharp, boundaries between high and low arsenic areas in the Ogallala (Ranks 2 and 4) and the Gulf Coast (Ranks 1 and 3) aquifers.

In general, for =1.3 and =1.5 the discovered regions tend to lie inside Texas aquifers, which is expected, because wells inside the same aquifer are connected by water flow.

The algorithm also finds some inconsistent co-location sets. As for example, for =1.5, rank 3 region in west Texas has high arsenic co-located with high chloride, while rank 4 region in south Texas has low arsenic with high chloride which can be attributed to geographical differences in regions.

When is increased to 5, not surprisingly all top regions have purities of 90% or above.

Page 20: Data Mining & Machine Learning Group CS@UH ACM-GIS08 Christoph Eick (University of Houston, USA), Rachana Parmar (University of Houston, USA), Wei Ding.

Table 5. Top 5 regions ranked by reward (as per formula 8).

Exp. No.

Top 5 Regi-ons

Region Size Region RewardMaximum Valued

Pattern in theRegion

PurityAverage Product

for maximum valued pattern

Exp. 2

1 181 61684.5323 AsMoVF- 0.49 52.1019

2 80 24040.6315 AsBCl-TDS 0.48 70.7322

3 467 1884.8856 AsTDS 0.91 0.2047

4 23 701.7072 AsCl-SO42-TDS 0.78 8.1287

5 189 587.9790 AsF- 0.78 0.2909

Exp. 4

1 7 11669.7965 AsBCl-TDS 1.0 630.1097

2 117 10407.3250 AsVF- 0.91 12.8550

3 4 2203.2526 AsV SO42-TDS 1.0 275.4066

4 2 1531.4887 AsMoVB 1.0 541.46305 530 1426.9140 AsTDS 0.90 0.1939

Example: Differences in Results Medium/High Rewards for Purity

All: (AsB or AsB) and |B|<5

Experiment 2

= 1.5, θ=1.0

Experiment 4

= 1.5, θ=1.0

Page 21: Data Mining & Machine Learning Group CS@UH ACM-GIS08 Christoph Eick (University of Houston, USA), Rachana Parmar (University of Houston, USA), Wei Ding.

Data Mining & Machine Learning Group CS@UHACM-GIS08

High Reward Regions =1 and =5

=1 =5

Page 22: Data Mining & Machine Learning Group CS@UH ACM-GIS08 Christoph Eick (University of Houston, USA), Rachana Parmar (University of Houston, USA), Wei Ding.

Data Mining & Machine Learning Group CS@UHACM-GIS08

Challenges Kind of “seeking a needle in a haystack” problem, because

we search for both interesting places and interesting patterns.

The Interestingness measure is not anti-monotone: a superset of a co-location set might be more interesting.

Only considering the maximum valued pattern when evaluating regions is somewhat crude (employed solution: used seeded pattern and run algorithm multiple times)

Observation: different fitness function parameter settings lead to quite different results, many of which are valuable to domain experts.

New challenge: results of many runs have to be analyzed which is a lot of manual labor need a tool for that.

Page 23: Data Mining & Machine Learning Group CS@UH ACM-GIS08 Christoph Eick (University of Houston, USA), Rachana Parmar (University of Houston, USA), Wei Ding.

Data Mining & Machine Learning Group CS@UH

Representative-based Clustering

Attribute2

Attribute1

1

2

3

4

Objective: Find a set of objects OR such that the clustering X

obtained by using the objects in OR as representatives minimizes q(X).

Properties: Cluster shapes are convex polygonsPopular Algorithms: K-means. K-medoids

Page 24: Data Mining & Machine Learning Group CS@UH ACM-GIS08 Christoph Eick (University of Houston, USA), Rachana Parmar (University of Houston, USA), Wei Ding.

Data Mining & Machine Learning Group CS@UHACM-GIS08

5. CLEVER (ClustEring using representatiVEs and Randomized hill climbing)

Is a representative-based, sometimes called prototype-based clustering algorithm

Uses variable number of clusters and larger neighborhood sizes to battle premature termination and randomized hill climbing and adaptive sampling to reduce complexity.

Searches for optimal number of clusters

Page 25: Data Mining & Machine Learning Group CS@UH ACM-GIS08 Christoph Eick (University of Houston, USA), Rachana Parmar (University of Houston, USA), Wei Ding.

Data Mining & Machine Learning Group CS@UHACM-GIS08

6. Summary A novel framework for mining co-location patterns involving

multiple continuous variables with values from the wings of their statistical distribution is proposed.

Regional co-location mining is approached as a clustering problem in which a reward-based fitness function has to be maximized.

The approach was successfully applied in a real world case study involving arsenic contamination. The case study revealed known areas of arsenic contamination and also some unknown areas with interesting features. Different parameters lead to characterization of arsenic patterns at different scales.

In general, the regional co-location mining framework has been valuable to domain experts in that it provided a data-driven approach that suggests promising hypotheses for future research.

A novel prototype-based clustering named CLEVER was also introduced.

Page 26: Data Mining & Machine Learning Group CS@UH ACM-GIS08 Christoph Eick (University of Houston, USA), Rachana Parmar (University of Houston, USA), Wei Ding.

Data Mining & Machine Learning Group CS@UHACM-GIS08

References S. Shekhar and Y. Huang, “Discovering spatial co-location patterns: A summary of results,”

Lecture Notes in Computer Science, vol. 2121, pp. 236+, 2001. Y. Huang, J. Pei, and H. Xiong, “Mining co-location patterns with rare events from spatial

data sets,” Geoinformatica, vol. 10, no. 3, pp. 239–260, 2006. R. Srikant and R. Agrawal, “Mining quantitative association rules in large relational tables,”

in SIGMOD ’96: Proceedings of the 1996 ACM SIGMOD international conference on Management of data. New York, NY, USA: ACM, 1996, pp. 1–12.

T. Calders, B. Goethals, and S. Jaroszewicz, “Mining rank-correlated sets of numerical attributes,” in KDD ’06: Proceedings of the 12th ACM SIGKDD international conference on Knowledge discovery and data mining. New York, NY, USA: ACM, 2006, pp. 96–105.

E. Achtert, C. B¨ohm, H.-P. Kriegel, P. Kr¨oger, and A. Zimek, “Deriving quantitative models for correlation clusters,” in KDD ’06: Proceedings of the 12th ACM SIGKDD international conference on Knowledge discovery and data mining. New York, NY, USA: ACM, 2006, pp. 4–13.

C. F. Eick, B. Vaezian, D. Jiang, and J. Wang, “Discovery of interesting regions in spatial datasets using supervised clustering,” in Proceedings of the 10th European conference on Principles of Data Mining and Knowledge discovery, 2006.

Page 27: Data Mining & Machine Learning Group CS@UH ACM-GIS08 Christoph Eick (University of Houston, USA), Rachana Parmar (University of Houston, USA), Wei Ding.

Data Mining & Machine Learning Group CS@UHACM-GIS08

Region Discovery Framework

Objective: Develop and implement an integrated framework to automatically discover interesting regional patterns in spatial datasets. Treats region discovery as a clustering problem.

Page 28: Data Mining & Machine Learning Group CS@UH ACM-GIS08 Christoph Eick (University of Houston, USA), Rachana Parmar (University of Houston, USA), Wei Ding.

Data Mining & Machine Learning Group CS@UHACM-GIS08

Region Discovery Framework Continued

The clustering algorithms we currently investigate solve the following problem:Given:A dataset O with a schema RA distance function d defined on instances of RA fitness function q(X) that evaluates clustering X={c1,…,ck} as follows:

q(X)= cX reward(c)=cX interestingness(c)size(c) with >1

Objective:Find c1,…,ck O such that:1. cicj= if ij2. X={c1,…,ck} maximizes q(X)3. All cluster ciX are contiguous in the spatial subspace4. c1,…,ck O 5. c1,…,ck are usually ranked based on the reward each cluster receives, and low

reward clusters are frequently not reported