Data Mining on a bird habitat dataset.doc

24
Data Mining on a bird habitat dataset Pusheng Zhang Department of Computer Science & Engineering, University of Minnesota, MN 55455 [email protected] 1. Introduction Widespread use of spatial databases is leading to an increasing interest in mining interesting and useful but implicit spatial patterns. In addition to the non-spatial attributes, the spatial dataset is more complex and include extended objects such as coordinates, points, polygons and cells in the graph. The values of attributes, which are referenced by spatial locations, tend to vary gradually over space. While classical data mining techniques, either explicitly or implicitly assume that the data is independently generated. The spatial distributions of attributes sometimes have distinct local trends, which contradict the global trends. This is most vivid in Figure 1(b), where the spatial distribution of Vegetation Durability is jagged in the western section of the wetland as compared to the overall impression of uniformity across the wetland. Thus the spatial data is not only not independent it is also not identically distributed.

Transcript of Data Mining on a bird habitat dataset.doc

Page 1: Data Mining on a bird habitat dataset.doc

Data Mining on a bird habitat dataset

Pusheng ZhangDepartment of Computer Science & Engineering,

University of Minnesota, MN 55455 [email protected]

1. Introduction

Widespread use of spatial databases is leading to an increasing interest in mining interesting and useful but implicit spatial patterns. In addition to the non-spatial attributes, the spatial dataset is more complex and include extended objects such as coordinates, points, polygons and cells in the graph. The values of attributes, which are referenced by spatial locations, tend to vary gradually over space. While classical data mining techniques, either explicitly or implicitly assume that the data is independently generated. The spatial distributions of attributes sometimes have distinct local trends, which contradict the global trends. This is most vivid in Figure 1(b), where the spatial distribution of Vegetation Durability is jagged in the western section of the wetland as compared to the overall impression of uniformity across the wetland. Thus the spatial data is not only not independent it is also not identically distributed.

Figure 1: (a) Training dataset: The geometry of the marshland and the locations of nests, (b) The spatial distribution of vegetation durability over the marshland, (c) The spatial distribution of water depth, and (d) The spatial distribution of distance to open water

We are given data about two wetlands on the shores of Lake Erie in Ohio, USA in order to predict the spatial distribution of marsh-breeding bird: red-winged blackbird. The names of the wetlands are Darr and Stubble and the data was collected from April to June in two successive years, 1995 and 1996. For the purpose of data collection, a local 5000 cells are superimposed. The cells of the grid had square geometries of size 5 meters by 5 meters. In each cell the values of several structural and environmental factors were

Page 2: Data Mining on a bird habitat dataset.doc

recorded, included vegetation durability, stem density, stem height, distance to open water, distance to edge, and water depth. At each cell was also recorded the fact whether a bird nest (red-winged blackbird) was present or not. They are shown in Table 1.

Attribute Type DescriptionVegetation Durability (VD) Numeric Scale from 10 to 100Stem Density (SD) Numeric In number of stems/

Stem Height (SH) Numeric In centimeters above waterDistance to Open Water (DOP) Numeric In metersDistance to Edge (DOE) Numeric In metersWater Depth (WD) Numeric In centimetersNest Binary Record the presence/absence of bird nest in the cell

Table 1: The non-spatial attributes in the dataset

In the bird habitat dataset, our collaborators have put forward the solutions [1,2] for some problems. Here we propose some techniques to improve the performance of data mining on this dataset, and expect these techniques can spread across many domains including ecology and environment management, public safety, transportation, business logistics, and tourism.

2. Challenges in the dataset

There are some challenges in this dataset, I will categorize these challenges into spatial issues and non-spatial issue.

For the spatial nature in this dataset, the challenges are the following:1. Neighbor regions tend to have same or similar properties2. Nests are not distributed everywhere, i.e., they are not identically distributed.3. Nests tend to be close, however, they are not too crowded together. Thus even some locations near

nests are nest worthy, they are mislabeled as “Non-Nest”4. Measures of prediction accuracy

First, Classical data mining deals with numbers and categories. In contrast, spatial dataset is more complex and include extended objects such as coordinates, points and cell in the graph. Second, classical data mining algorithms often make assumptions (e.g. independent, identical distributions), and treat each input to be independent of other inputs, which violate the first law of Geography: everything is related to everything else but nearby things are more related than distant things [3]. In other words, the values of attributes of nearby spatial objects tend to systematically affect each other. In spatial statistics, an area within statistics devoted to the analysis of spatial data, this is called spatial autocorrelation [4], and the spatial pattern often exhibit continuity and high autocorrelation among nearby features. However, there are some instances mislabeled in this dataset. For example, some locations near nests maybe are nest worthy, while they are sampled as “Non-Nest” since there were no real nests presented during the data collection. Maybe it’s just because the birds don’t like to be overcrowded for the nest construction although these locations are very suitable for the nests also. We can observe in Figure 2 that some nest worthy locations around the actual nests are mislabeled as non-nest.

Page 3: Data Mining on a bird habitat dataset.doc

Figure 2 Mislabeling some nest worthy locations as non-nest class in the dataset

Moreover, the measure of spatial accuracy is maybe substantially different from classical measures. For the binary classes problem, the standard way to measure classification accuracy is to calculate the percentage of correctly classified objects. This measure may not be the most suitable for the spatial data, e.g., we use the classical accuracy measure in Figure 3, and the accuracies for the two models are same. However, domain experts prefer (b) over (c), since predicted nest locations are closer on average to some actual nest locations. The classical accuracy measure can distinguish between Figure 2(b) and 2(c), and we need a measure of spatial accuracy to capture this preference.

Figure 3: (a) The pixels with actual nests, (b) Location predicted by a model, (c) Location predicted by another model. Prediction (c) is spatially more accurate than (b)

In addition to the challenges due to the spatial nature in the dataset, we also have some general challenges for the non-spatial issues.

5. Different class size6. Outliers7. Relative Thresholds for the model cross8. Temporal nature

In this dataset, nest instances are just a little proportion in the dataset, and the majority of the dataset are non-nest locations. Since we are just interested in the nest pattern, it’s very difficult to capture the features for the nests. Furthermore, since we use C4.5 classification, the performance on different class sizes is poor. Thus we have to develop some techniques to balance the numbers for the two respective classes.

Furthermore, there are outliers and noises in datasets. For the noise, here it means the contradictory records those with the same (or very similar) values for the training attributes but belonging to the different classes. Actually it is similar to the scenario of mislabeling associated with the spatial nature. For the outliers, it means the values of attributes are far from the nearby points, e.g., we can see there are some outliers on the upper margin in the Figure 4 below.

Page 4: Data Mining on a bird habitat dataset.doc

Figure 4: Outlier in the attribute “Distance to Edge” distribution for the Dataset Darr 96

There are different geometries and environmental conditions for the different regions, and birds always choose the local maximum of them for the nests. For example, vegetation durability is one of key attribute in the dataset, in the following two regions with the different distribution, if we build learning model on the Figure5(a) region, we can derive the rule like “vegetation durability >=90” for the nest, the model will fail to predict the nest location in the Figure 5(b) region even if there are same patterns for the vegetation durability formula in the rules for the nests in the two datasets. We have to capture the different thresholds cross the different regions and datasets.

Figure 5: (a) Local maximum in one region, (b) Local maximum in another region

For this dataset there are also some temporal issues, there are different attribute distributions and nest distributions for the same wetland in the two successive years. In Figure 6 we can observe that the distributions for the vegetation durability in the two successive years are different, and even the nest location distributions are also different in Figure 7. It’s very difficult to capture the temporal pattern.

Page 5: Data Mining on a bird habitat dataset.doc

(a) (b) Figure 6: (a) Vegetation durability distribution for Darr wetland in 1995, (b) Vegetation durability distribution for Darr wetland in 1996

Figure 7: Different nest locations distributions for the Darr wetland in 1995 and 1996

3. Related Work

Uygar[5,6] has applied classical data mining techniques like logistic regression and neural network to build spatial habitat models. Logistic regression was used because the dependent variable is binary (nest/non-nest) and logistic function “squash” the real line into the unit-interval. The values in the unit-interval can then be interpreted as probabilities. They concluded that using logistic regression the nests could be classified as a 24% rate better than random. The use of neural networks actually decrease the classification accuracy but led to a better understanding of the interactions between the explanatory and the dependent variables. There are two important reasons why, despite extensive domain knowledge, the results of classical data mining are not satisfactory. First, classical techniques, e.g. logistic regression, make assumption about independent distributions for the properties of each pixel, ignoring spatial autocorrelation. Second, a more subtle but equally important reason is the objective function of classification measure accuracy. Uygar still used the classical accuracy measure, and it may not suit for this spatial dataset.

Page 6: Data Mining on a bird habitat dataset.doc

In spatial autoregressive model, the spatial dependencies of the dependent variable are directly modeled in the regression equations [10]. Assume that the dependent values are related to each other,

i.e., , the regression formula can be defined as

Here W is the neighborhood relationship contiguity matrix and is the parameter that reflects the strength of spatial dependencies between the elements of the dependent variable. Spatial autocorrelation measures are crucially dependent on the choice and design of the contiguity matrix W. The design of the matrix itself is predicted on determining “what constitutes a neighborhood of influence?” Two common choices are the four and eight neighborhood. Thus given a lattice structure and a point S in the lattice, a four-neighbor assumes that S influences all cells which share an edge with S. In eight-neighborhood, S influences all cells which share an edge or vertex with S, and a contiguity matrix is shown in Figure 8. The contiguity matrix of the uneven lattice (left) is shown on the right hand side. The contiguity matrix plays a crucial role in the spatial extension of the regression model.

Figure 8: A spatial neighborhood and its contiguity matrix

We will refer to the regression formula above as spatial autoregressive model (SAM). Notice when =0, this equation collapses to the classical regression model. The benefits of modeling spatial autocorrelation are many: (1) the residual error will have much lower spatial autocorrelation. With the proper choice of W, the residual error should at least theoretically have no systematic variation. (2) If the spatial autocorrelation coefficient is statistically significant then it will quantify the presence of spatial autocorrelation. It will indicate the extent to which variance in the dependent variable (y) are explained by the average of neighboring observation values. (3) Finally, the model will have a better fit. SAM can deal with spatial challenge 1 and 2, and capture the spatial information into regression. In [1,2], PLUMS, a framework for spatial data mining on this dataset was proposed. PLUMS also use spatial autoregressive model with a proper contiguity matrix W. Moreover, it develops a new measure for the spatial accuracy measure, ADNP (Average Distance to Nearest Prediction), which is defined as

Addition to challenge 1 and 2 PLUMS can deal with challenge 4 for the spatial accuracy measure.

However, we still have some challenges, which haven’t solved. So we proposed some techniques in this dataset:

Smoothing using Eight/Four Neighbors, Guassian Distribution and iterative relabeling algorithms to solve challenge 1,2,3 (Mislabeling).

Balancing the number of two classes to solves challenge 5 (Different class sizes) Relabeling using iterative rule based/clustering/regression algorithm solves challenge 3 For relative thresholds (challenge 7) cross the different regions, here we can use peak

selection in the given window, and choose the different thresholds for the different

Page 7: Data Mining on a bird habitat dataset.doc

regions in the raster. We can add one more attribute to the dataset to capture the local ranking for each attribute, and include it in the classification or regression

We will eliminate the outlier data detected for challenge 6 Temporal data mining for challenge 8

In the following sections, I will elaborate these techniques and show the experimental results.

4. Techniques used

4.1 Balance the numbers of two classes

In this dataset, we are more interested in the nest patterns, while there are just a little proportion for nest records, e.g., there are 5372 records in total for the dataset Darr95, while just 85 nests are among them. For the unbalanced numbers of nest records and non-nest records, we can add more weights on the nest records to facilitate to capture the nest patterns. One possible solution is that we randomly get the same size records from non-nest and use the sample plus nest records to do data mining. The problem is that we are losing some information by choosing a fraction of records from non-nest locations. Another one is that we simply replicate the records of nest according to the ratio between the number of non-nest records and nest records to balance the numbers of nest records and non-nest records. Thus we can get roughly equal numbers for the two classes. Here I use the second method to solve the problem.

Here I use the data from Darr wetland in 1995 (Darr95) as the training dataset, which includes 85 records for nest locations out of 5372 records in total, and I use the data from same year but different wetland (Stubble95) as testing dataset, which includes 30 nest locations out of 1818 in total.

We can’t capture any nest feature and predict any nest location using the original dataset Darr95 to do the C4.5 classification, and the model always predicts non-nest everywhere. For the decision tree generated in the C4.5 as shown in Figure 8 is just one node classified as non-nest. Even though the accuracy of this model is not too bad, 5287/5372 = 0.984, the recall for this model is 0. Hence the model is useless since it can’ predict any location for nest.

Non-Nest

Figure 9: Decision tree for the classification on the original dataset Darr95

A P Nest Non-NestNest 0 85

Non-Nest 0 5287

Table 2: Classification matrix on the original dataset Darr 95

Here I apply the balancing method on the Darr95, ratio=62, and we can get the balanced dataset for the Darr95 as shown in table 3.

Nest Number Non-Nest NumberBefore Balancing 85 5287After Balancing 5270 5287

Table 3: Datasets before and after balancing

Page 8: Data Mining on a bird habitat dataset.doc

Then we build the learning model on the balanced dataset for Darr95. The training result is very good as following

A P Nest Non-NestNest* 5270 0

Non-Nest 508 4779

Table 3: Training results for the balanced dataset of Darr95* Note: Here the nest number is the number after balancing

Since we are just interested in the nest records, we prefer to keep higher recall, for the training data we get a perfect recall. As for the testing results, we can see from Table 4 that the model can predict 6 out of 30 nest locations.

A P Nest Non-Nest Nest 6 24

Non-Nest 119 1669

Table 4: Testing results for dataset Stubble95

Compared to the poor classification performance on the original dataset, the technique of balancing dataset can achieve much gain for the classification, especially for using the decision tree based classification methods. It’s a very useful technique, and I will combine it with other techniques to solve the challenges.

4.2 Relabeling using neighborhood information

4.2.1 Algorithm descriptions

Since we have know some nest worthy locations nearby the nests are mislabeled as non-nest class, we want to relabel them into nest according to the neighborhood information. For the neighborhood, we can adapt immediate four- neighbor or eight-neighbor as the neighborhood information. For each nest cell, we assign a weight 1 and we assume that it can affect the immediate the four/eight neighbors; so we can assign the neighbors a proper weight according to the distance as shown in Figure 10

Figure 10: Weight distributions for nest cell (a) Four-neighbor, (b) Eight-neighbor

We can calculate the cumulative weight for each non-nest cell, and choose a threshold to discretize the results to binary classes. Thus we can claim more nest locations. We can see from Figure 11 that the locations relabeled from non-nest to nest are always nearby several nests.

Precision = 0.91Recall = 1

Precision = 0.05Recall = 0.2

Page 9: Data Mining on a bird habitat dataset.doc

Figure 11: After relabeling based on eight-neighbor scheme, we relabel 22 non-nest locations nearby nests as nest locations (Threshold for discretization is 0.5) in dataset Darr 95 (Red are nest locations and blue are non-nest locations)

4.2.2 Experiment Results

Based on the relabeled dataset, we do the classification using C4.5, and we get the following results for the training and testing respectively

A P Nest Non-NestNest* 15 102

Non-Nest 2 5253

Table 5: Training results for the balanced dataset of Darr95* Note: Here the nest number is the number after balancing

A P Nest Non-Nest Nest 5 25

Non-Nest 154 1634

Table 6: Testing results for dataset Stubble95

However, since the weight assignment is somehow arbitrary and the performance is not good enough. So we want to develop more precise weight assignment function, in the next section I will introduce the Guassian distribution scheme for smoothing.

4.3 Relabeling using Guassian distribution

4.3.1 Algorithm descriptions

Precision = 0.88Recall = 0.128

Precision = 0.03Recall = 0.17

Page 10: Data Mining on a bird habitat dataset.doc

Similar to four/eight neighborhood scheme, we use Guassian distribution as the nest influence function for the weight assignment to the nearby location. We distribute the weight for each nest location in a 7*7 window as shown in Figure 12, and calculate the cumulative weight for each non-nest location.

Figure 12: Window size is 7*7 for the nest influence using Guassian Distribution

After calculating the cumulative weights for every non-nest location, we can observe the distribution for the weight as following Figure 13

Figure 13: Weight Distribution after calculating cumulative weight for every non-nest location

Then we can discretize the weight value to the class label based on certain threshold, the following Figure 14 is the nest distribution including relabeled locations after discretization

Page 11: Data Mining on a bird habitat dataset.doc

Figure 14: Relabeling based on threshold 4.3.2 Experiment Results

Based on the relabeled dataset, we first balance the relabeled dataset using the technique discussed in 4.1, then we do the classification using C4.5 on balanced dataset, and we get the following results for the training and testing respectively

A P Nest Non-NestNest* 4130 490

Non-Nest 921 3791

Table 5: Training results for the balanced dataset of Darr95* Note: Here the nest number is the number after relabeling and balancing

A P Nest Non-Nest Nest 10 20

Non-Nest 497 1291

Table 6: Testing results for dataset Stubble95

We can see that Gussian smoothing can achieve some gain compared to the performance just using balancing technique.

4.4 Relabeling based on attribute similarity Since the mislabeled records nearby nests hold the same or similar attribute values to those of nearby actual nest locations. So we can relabel the non-nest locations based on the attribute similarities. The basic process is shown in Figure 15

For each non-nest record For each nest record

Compare corresponding attribute value of nest record with the non-nest record;If all attribute values of the non-nest record are similar to a nest record Then we relabel this non-nest location as nest; break;

EndEnd

Figure 15: Algorithm for the relabeling based on attribute similarity

Precision = 0.818Recall = 0.894

Precision = 0.02Recall = 0.33

Page 12: Data Mining on a bird habitat dataset.doc

Here we can set up a threshold to measure the similarity of attribute values between two records, e.g., 2% or 5%. If we can know the relative significance of each attribute, we can assign the different threshold for each attribute based on them. Since the spatial data exhibits high autocorrelation and contiguity, we can expect that the relabeled records locate nearby the actual nests.

4.5 Iterative Relabeling using C4.5

4.5.1 Algorithm Description

Since we are just interested in the nest records, we put the focus on the desirable class. So before we start relabeling process, we firstly balance the original dataset to facilitate to capture the nest pattern. Then we can construct the learning model using classification and get the rule set for the nest class, and apply the rules with the high confidence to the original dataset to relabel the non-nest records. In fact, we keep the support threshold during the construction of decision tree, so the rules derived also have satisfied the support threshold. So we apply the rules into the original dataset, and relabel the non-nest records. Then we can iterate the relabeling as following

Figure 16: Process graph of iterative relabeling algorithm

Original Dataset Balanced Dataset

Dataset-1

Rule sets

Balanced Dataset

Dataset-2

Rule sets

Final Dataset

Apply the rules with high confidence for the desirable class to the dataset

Iteratively re-label the records of the non-desirable class Finally, it will converge to a

stable dataset, i.e., we can’t re-label more according to the stopping criteria.

Generate the rule set using C4.5

re-label the records of the non-desirable class

……

Apply the rules with high confidence for the desirable class to the dataset

Balance the dataset with equal numbers for two classes

Balance the dataset with equal numbers for two classes

Generate the rule set using C4.5

Page 13: Data Mining on a bird habitat dataset.doc

The algorithm can describe as

Steps: (Initially i=1)

n1. Balance the dataset to

n2. Construct the learning model based on the balanced dataset using C4.5 rule

classification and get the rule set.n3. Choose the rules for desirable class with high confidence and supportn4. Apply the rules to the dataset , and re-label the records of non-desirable class to

get a new dataset .n5. Monitor the performance of classification based on the new dataset n6. Repeat the above 5 steps until we can’t re-label more records or the recall on the dataset begins to decrease

Figure 17: Algorithm for iterative relabeling using classification technique

4.5.2 Experiment Results

Here I also use the data from Darr wetland in 1995 (Darr95) as the training dataset, which includes 85 records for nest locations out of 5372 records in total, and I use the data from same year but different wetland (Stubble95) as testing dataset, which includes 30 nest locations out of 1818 in total.

For the first iteration, we first balance dataset using the technique discussed in 4.1, then we do the classification using C4.5 on balanced dataset, and we get the following results for the training and testing respectively.

A P Nest Non-NestNest* 4836 434

Non-Nest 1778 3509

Table 7: Training results for the balanced dataset of Darr95 in first iteration* Note: Here the nest number is the number after balancing

A P Nest Non-Nest Nest 8 22

Non-Nest 387 1401

Table 8: Testing results for dataset Stubble95 based on the model above

We choose the rules derived from the training set with high confidence (Threshold = 0.9) to apply to the original dataset to relabel the non-nest records. Then we claim 228 more nests from non-nest class, and the number of nest location comes out to 313 (85+228). Then we balance the relabeled dataset into another dataset, and build the learning model on that dataset and derive the rule sets again, then we get the results for the training and testing in second iteration

A P Nest Non-NestNest* 4624 384

Non-Nest 347 4712

Table 9: Training results for the balanced dataset of Darr95 in second iteration* Note: Here the nest number is the number after relabeling and balancing

Precision = 0.731Recall = 0.918

Precision = 0.02Recall = 0.267

Precision = 0.93Recall = 0.92

Page 14: Data Mining on a bird habitat dataset.doc

A P Nest Non-Nest Nest 10 20

Non-Nest 293 1495

Table 10: Testing results for dataset Stubble95 based on the model aboveWe can observe that the classification precision and recall on the testing dataset are both better than the first iteration, i.e. the learning model has been improved a little. So we iterate the process again, balance the dataset to a new balanced dataset, and derive the rule set for the nest locations. We choose the rule with high confidence to apply into the dataset after first re-labeling to do the second relableing. After relabeling, we balanced the dataset and do the classification and monitor the performance of classification on the testing dataset.

A P Nest Non-NestNest* 4720 384

Non-Nest 341 4712

Table 11: Training results for the balanced dataset of Darr95 in third iteration* Note: Here the nest number is the number after twice relabeling and balancing

A P Nest Non-Nest Nest 7 23

Non-Nest 261 1527

Table 12: Testing results for dataset Stubble95 based on the model above

We can observe the performance of third iteration on the testing dataset is not good as performance in iteration; the possible reason is that we are going to a wrong branch in the current iteration for the relabeling trend prediction. So we can stop at this point, in this experiment, and keep the previous iteration result as the final dataset. Since we are more interested in the nest patterns, I just use recall as the key factor for the stopping criteria. Actually we can develop more subtle and precise stopping criteria combining the recall and other factors.

4.5.3 Discussion on the classification measure

For the iterative algorithms, the stopping criteria are crucial issue. For the measure, we can’t just consider the traditional precision and recall. In the traditional pecision and recall scheme as Table 13, we just consider the two classes separately.

A P Nest Non-Nest Nest a B

Non-Nest c D

Table 13: Traditional precision and recall scheme

Precision = 0.03Recall = 0.33

Precision = 0.93Recall = 0.93

Precision = 0.03Recall = 0.23

Page 15: Data Mining on a bird habitat dataset.doc

Here since we relabel some non-nest records into nest records. So we have to develop the measures to capture the feature. Here we introduce some new measures

A P Nest Non-Nest

Nest a BRelabeled c DNon-Nest e F

Table 13: adjusted performance matrix for the relabeling case

Here we consider the recall and precision under the circumstance that we re-label some records.

One measure like recall is

Considering the fraction of re-labeled instances added, we define the coefficient

This coefficient is measuring the fraction of relabeling proportion in the dataset after relabeling. We want to relabel the potential nest worthy nests in to nest class, but we just track the coefficient and want to get the better performance with relabeling the minimum number of non-nest.

One measure like precision is

Or we should combine the measures above to integrate a comprehensive measure:

X measure =

4.6 Iterative Relabeling using clustering

4.6.1 Algorithm desciptions

In the relabeling using classification techniques, we apply the rules for the nest class into the original dataset. We can also use clustering techniques to cluster the nest records and find the attribute boundaries and apply these boundary into the original dataset to relabel the dataset. If we want to apply the clustering techniques, we will make the following assumptions:

1. Binary classes question (There are just 2 classes for this question)2. The records in the desirable class are precise, and only the records in the other class is mislabeled3. The distribution of mislabeled instances is at random4. There are no overlaps in the attribute space for the instances of two classes, i.e., it's linearly separable, thus we can find the boundary for each class5. There is no hole in the sample distribution.6. There are some mislabeled instances in the dataset7. The percentage of instances mislabeled should not exceed 50%

Page 16: Data Mining on a bird habitat dataset.doc

For the process of relabeling using clustering techniques, I describe the algorithm shown in

1. Choose the instances of desired class2. Cluster the instances chosen in the step 1 into clusters using some clustering algorithm (e.g. K-Means, it's difficult to deal with the shape and size problem for the dataset, so it's not good enough. However, we can use it if the mislabeling percentage is about 20-40%. I am thinking of the better algorithm, such as Chameleon)3. Shrink the boundary for the cluster4. Iterate the step 3 until we get a stable cluster boundary5. Apply the boundary into the dataset to re-label the instances of the non-desired class.

For the clustering algorithm, we can use the classical algorithms such as K-Means, especially when the mislabeling rate is not too high. However, K-Means is difficult to deal with the shape and size problem for the dataset, so it's not good enough. We can try Chameleon combined with shrink boundary technique like DBSCAN to get the more precise boundary for the nest records and apply to the original dataset.

I just try this dataset on a synthetic dataset, and it works well. Later I will try this technique in the real dataset such as UCI dataset satisfying the assumptions.

4.7 Relabeling using Regression

In this dataset, all attributes are numerical, so regression is a natural method to fit the model using the regression formula. This is a binary response problem; we can use Logit/Probit to build the regression model.

Here we first don’t consider the spatial autocorrelation, we can build the regression using probit regression formula, and we can get the regression value distribution as shown in Figure 18 and Figure 19.

Page 17: Data Mining on a bird habitat dataset.doc

Figure 18 Probit regression for Darr 95

Figure 19 Probit regression for Stubble 95

Moreover, our collaborators [2] have added spatial regression by adding the neighborhood information as the contiguity matrix in the regression formula as SAM. I will use SAM for the relabeling, we will relabel the non-nest records based on the regression value, which is the cumulative probability for the nest distribution in spatial autoregressive regression.

Page 18: Data Mining on a bird habitat dataset.doc

5. Summary

There are lots of interesting research issues in this dataset, and we have applied several techniques to deal with the challenges in this dataset. These techniques can solve some challenges, and we are working for new techniques to deal with other challenges also.

Challenge for spatial issues

Not Independent Distribution

Not Identical Distribution

Mislabel Nest-Worthy Locations

Measure of Accuracy

SAM x xPLUMS x x XOur techniques 4.2, 4.3 4.2, 4.3 4.2, 4.3, 4.4, 4.5,

4.6, 4.7

Table14: Comparison for solving the challenges for spatial nature

Challenge for non-spatial issues

Different class size Outliers Relative Thresholds Temporal

SAMPLUMSOur techniques 4.1 Detect and

Eliminate outliersProposed in future work

Table15: Comparison for solving the challenges for non-spatial nature

We will apply these techniques in other datasets, and expect to solve the similar problems cross the domains.

6. Future Work

I will try the iterative relabeling algorithms using clustering techniques, and find the better clustering algorithm to find the more precise boundaries for the nest patterns.

I will try relabeling algorithm using the spatial autoregressive model with the proper continuity matrix W, which capture the neighborhood information.

For the relative thresholds cross the different regions and dataset, I will add additional attributes, e.g., local ranking of each attribute for each cell within certain window size to capture the local information. And we build the learning model using the extended dataset including the attribute rankings, and I am doing the experiment now.

For the prediction accuracy measure, ADNP in PLUMS actually is not good enough, e.g., if we predict nest everywhere, ADNP is perfect, however, the quality of prediction is not good. So we should also keep an eye on the number of predictions, we can use average distance to nearest prediction and average distance to nearest actual location together. We will work for other good measures for classification accuracy and relabeling properties.

Page 19: Data Mining on a bird habitat dataset.doc

References

[1] S. Chawla, S. Shekhar, W. Wu, U. Ozesmi. Prediction Location Using Map Similarity(PLUMS): A Framework for Spatial Data Mining. [2] S. Chawla, S. Shekhar, W. Wu, U. Ozesmi. Extending Data Mining for Spatial Applications: A Case Study in Predicting Nest Locations.[3] P.Could The Geography at Work. Routledge and Kegan Paul, London, 1985[4] N.A. Cressie. Statistics for Spatial Data (Revised Edition). Wiley, New York, 1993[5] S. Ozesmi and U. Ozesmi. An Artificial Neural Network Approach to Spatial Habitat Modeling with Interspecific Interaction. Ecological Modeling, Elsevier Science B. V., (116): 15-31, 1999[6] U. Ozesmi and W. Mitsch. A Spatial Habitat Model for the Marsh-breeding Red-winged Black-bird (Agelaius Phoeniceus L.) in Coastal Lake Erie Wetland. Ecological Modeling, Elsevier Science B. V., (101): 139-152, 1997[7] J. I. Maletic and A. Marcus. Data cleansing: Beyond Integrity Analysis. Proceedings of The Conference on Information Quality (IQ2000), Massachusetts Institute of Technology, October 20-22, 2000[8] Jiawei Han and Micheline Kamber. Data Mining: Concepts and Techniques. Morgan Kaufmann Publishers, August 2000[9] K. Koperski, J. Han and N. Stefanovic. An Efficient Two-Step Method for Classification of Spatial Data[10] L Anselin. Spatial Econometrics :Methods and Models, Kluwer, Dordrecht, Netherlands, 1998 [11] B. Flurry. A Firs Course in Multivariate Statistics, Springer, 1997