Clustering and Regression using WEKA

VGSOM

WEKA – Data Mining Techniques

Clustering and Regression

BY M.P.Vijaya Prabhu

10BM60097

Contents 1. INTRODUCTION ............................................................................................................................... 3

2. CLUSTERING .................................................................................................................................... 4

2.1 Data Visualization..................................................................................................................... 8

3. Regression Analysis ........................................................................................................................ 10

3.1 Pricing the house ................................................................................................................... 10

4. References ..................................................................................................................................... 13

1. INTRODUCTION “Data Mining Software in Java”. Weka is the acronym of Waikato Environment for Knowledge

Analysis is a collection of state-of-the-art machine learning algorithms and data preprocessing tools

written in Java, developed at the University of Waikato, New Zealand. It is free software that runs on

almost any platform and is available under the GNU General Public License.

Weka is the next generation Data Mining Tool to complex analysis more interactively and can

visualize it more effectively.

WEKA GUI appears like this

Advantages of using WEKA

1) Built in Advanced algorithm

2) Effective Visualization of results

3) Easy to use GUI

WEKA – DATA MINING TECHNIQUES

Let us demonstrate the use of WEKA using 2 examples each on CLUSTERING (Kmeans) and

Regression.

2. CLUSTERING Data is a sample bank data taken from an online source.It contains the following attributes

1) age numeric

2) {FEMALE,MALE}

3) region {INNER_CITY,TOWN,RURAL,SUBURBAN}

4) income numeric

5) married {NO,YES}

6) children {0,1,2,3}

7) car {NO,YES}

8) save_act {NO,YES}

9) current_act {NO,YES}

10) mortgage {NO,YES}

11) pep {YES,NO}

Based on these data we need to CLUSTER the user groups into 6 and have to find out the

characteristics of each group.

The sample data contains 600 instances. The objective is to cluster based on K-Means algorithm.

Once the preprocessing of the data is done, we can start with clustering the data.

First, the data is loaded into WEKA and preprocessing can be done as shown below.

WEKA SimpleKMeans algorithm automatically handles a mixture of categorical and

numerical attributes. While doing distance computations like in our case, the built in algorithm

will automatically normalizes numerical attributes. Euclidean distance is general measure of

distance between Euclidean and clusters.

After selecting k-Means we can select advance settings in the k-means algorithm. We

have given the CLUSTERs as 6 from 2 ,to get 6 different clusters from the given data.

After the required details are given “Use Training Set” is checked. Then we can click “Start”

The result is available as given below. ================================================================================================ OUTPUT : === Run information === Scheme: weka.clusterers.SimpleKMeans -N 6 -A "weka.core.EuclideanDistance -R first-last" -I 500 -S 10 Relation: bank-data Instances: 600 Attributes: 12 id

age sex region income married children car save_act current_act mortgage pep Test mode: evaluate on training data === Clustering model (full training set) === kMeans ====== Number of iterations: 18 Within cluster sum of squared errors: 1955.4146634784236 Missing values globally replaced with mean/mode Cluster centroids: Cluster# Attribute Full Data 0 1 2 3 4 5 (600) (74) (164) (71) (58) (99) (134) ========================================================================================== id ID12101 ID12107 ID12103 ID12101 ID12104 ID12102 ID12108 age 42.395 42.9324 43.7744 39.0282 37.3103 38.404 47.3433 sex FEMALE FEMALE FEMALE FEMALE FEMALE MALE MALE region INNER_CITY RURAL INNER_CITY INNER_CITY TOWN INNER_CITY TOWN income 27524.0312 28838.7605 28586.4063 20463.1273 20600.8528 25720.037 33568.3929 married YES NO YES YES YES YES NO children 1.0117 1.973 0.628 0.6901 1.6207 0.899 0.9403 car NO NO NO NO NO YES YES save_act YES YES YES NO NO NO YES current_act YES YES YES YES YES YES YES mortgage NO NO NO NO NO YES NO pep NO NO NO YES NO YES YES Time taken to build model (full training data) : 0.16 seconds === Model and evaluation on training set === Clustered Instances 0 74 ( 12%) 1 164 ( 27%) 2 71 ( 12%) 3 58 ( 10%) 4 99 ( 17%) 5 134 ( 22%) ================================================================================================

The result window shows the centroid of each cluster as well as statistics on the number and

percentage of instances assigned to different clusters.

0 74 ( 12%) 1 164 ( 27%) 2 71 ( 12%) 3 58 ( 10%) 4 99 ( 17%) 5 134 ( 22%)

The put put of this clustering can be found in the form of cluster centroid

Cluster 0 1 2 3 4 5 6 age 42.395 42.9324 43.7744 39.0282 37.3103 38.404 47.3433 sex FEMALE FEMALE FEMALE FEMALE FEMALE MALE MALE

region INNER_CIT

Y RURAL INNER_CIT

Y INNER_CIT

Y TOWN INNER_CIT

Y TOWN

income 27524.031

2 28838.760

5 28586.406

3 20463.127

3 20600.852

8 25720.037 33568.392

9 married YES NO YES YES YES YES NO children 1.0117 1.973 0.628 0.6901 1.6207 0.899 0.9403

car NO NO NO NO NO YES YES save_act YES YES YES NO NO NO YES

current_act YES YES YES YES YES YES YES mortgage NO NO NO NO NO YES NO

pep NO NO NO YES NO YES YES

For example, the centroid for cluster 0 shows that this is a segment of cases representing middle aged

(approx. 42) females living in inner city with an average income of approx. $27,500, who are married

with one child, etc. Furthermore, this group has on average said YES to the NO product.

2.1 Data Visualization

The result can be viewed more intuitively by the advanced VISUALIZATION built in WEKA.

The visualization of the distribution of male and female in each cluster can be found by using the

following methods.

Step 1 : Right click on the output and select “Visualise Cluster alignment”

Step 2 : Select the different cluster as the X axis.

Step 3 : SelectInstance_Nbr as Y Axis

Step 4 : Select “ Sex “ as colour.It means it will differentiate sex based on colour.

This will result in a visualization of the distribution of males and females in each cluster.

3. Regression Analysis

Regression can be done effectively with more options via WEKA software.Lets explain it using a

simple “LinearRegression”

3.1 Pricing the house

Data is taken from an online source .The selling price of the house needs to be determined

based on the data given. The data contains the following attributes.

1) houseSize NUMERIC

2) lotSize NUMERIC

3) bedrooms NUMERIC

4) granite NUMERIC

5) bathroom NUMERIC

6) sellingPrice NUMERIC

So, based on the size of the house, Lot size ,number of bedrooms it has ,whether it is furnished

with Granite, number of bathroom ,we need to predict the DEPENDANT VARIABLE ,i.e. the

SELLING PRICE.

First, the data is loaded into WEKA and necessary preprocess is done. Since, our data is already

processed .We proceed to selecting the type of REGRESSION

In the picture given above select the “Linear Regression” tab. Then Select “Use Training Set” in

the Test Options.

There are three other choices available while doing simple Linear Regression they are

Supplied test set: Supply test data to do model

Cross-validation : which lets WEKA build a model based on subsets of the supplied data

and then average them out to create a final model

Percentage split: where WEKA takes a percentile subset to build a final model.

Here the column “Selling Price” is chosen. This means with the available data we are going to

predict the DEPENDANT VARIABLE (Selling Price).

Then click on the “Start” button to build a model using WEKA.

OUTPUT:

================================================================================================ === Run information === Scheme: weka.classifiers.functions.LinearRegression -S 0 -R 1.0E-8 Relation: house Instances: 700 Attributes: 6 houseSize lotSize bedrooms granite bathroom sellingPrice Test mode: evaluate on training data === Classifier model (full training set) === Linear Regression Model sellingPrice = 22.6582 * houseSize + 9.1242 * lotSize + 42145.0767 * bedrooms + 42562.0901 * bathroom + -20981.3142 Time taken to build model: 0.04 seconds === Evaluation on training set === === Summary === Correlation coefficient 0.9945 Mean absolute error 4790.821 Root mean squared error 4245.4125 Relative absolute error 11.9082 % Root relative squared error 11.21 % Total Number of Instances 700 ================================================================================================

The output predicts that the Selling price will be

sellingPrice= (22.6582*houseSize) + (9.1242 * lotSize) + (42145.0767 * bedrooms) +

(42562.0901 * bathroom) -20981.3142.

If we want to determine the “selling price” of the house based on given data just “Plug in” the

values and find it easily.

The output predicts that the “Granite” doesn’t matter much regarding the SELLING PRICE of the

house.

4. References

http://www.stat.yale.edu/Courses/1997-98/101/linreg.htm

www.cs.waikato.ac.nz/ml/weka/

http://www.laits.utexas.edu/~norman/BUS.FOR/course.mat/Alex/

http://maya.cs.depaul.edu/classes/ect584/weka/k-means.html

http://www.cs.utexas.edu/users/ml/tutorials/Weka-tut/

http://www.stat.yale.edu/Courses/1997-98/101/linreg.htm

http://www.laits.utexas.edu/~norman/BUS.FOR/course.mat/Alex/

http://maya.cs.depaul.edu/classes/ect584/weka/k-means.html

http://www.cs.utexas.edu/users/ml/tutorials/Weka-tut/

Clustering and Regression using WEKA

Business

Transcript of Clustering and Regression using WEKA