Project1-15 (1) (3).doc

24
Dr. Eick COSC 4335 “Data Mining” Fall 2015 Draft Project1: (Exploratory) Data Analysis Group Project (Groups of 2 or 3) Due: Monday, February 23, 11p (electronic Submission) Last Updated: January 26, 2015; 5p Download Statlog (Vehicle Silhouettes) Data Set dataset from http://archive.ics.uci.edu/ml/datasets/ s/Statlog+ (Vehicle+Silhouettes) limiting yourself to analyzing to the following subset of the dataset involving just 5 attributes; use all examples to create the subset: COMPACTNESS (average perim)**2/area (1 st attribute) CIRCULARITY (average radius)**2/area (2 nd attribute) SCATTER RATIO (inertia about minor axis)/(inertia about major axis) (7 th attribute) ELONGATEDNESS area/(shrink width)**2 (8 th attribute) HOLLOWS RATIO (area of hollows)/(area of bounding polygon) (18 th attribute) Class: OPEL, SAAB, BUS, VAN Apply the following exploratory data analysis techniques using R to your dataset: 1. Compute the mean value and standard deviation of the 5 numerical attributes 1 . 1 point Mean Standard Deviation COMPACTNESS 93.6784 9 8.234474 CIRCULARITY 44.8617 6.169866 SCATTER RATIO 168.839 2 33.24498 ELONGATEDNESS 40.9338 1 7.81156 HOLLOWS RATIO 195.632 4 7.438797 2. Compute the covariance matrix for the five numerical attributes you are analyzing; also compute the 1 This is more a verification of that you have the correct dataset! 1

Transcript of Project1-15 (1) (3).doc

Page 1: Project1-15 (1) (3).doc

Dr. EickCOSC 4335 “Data Mining” Fall 2015

Draft Project1: (Exploratory) Data AnalysisGroup Project (Groups of 2 or 3)

Due: Monday, February 23, 11p (electronic Submission)Last Updated: January 26, 2015; 5pDownload Statlog (Vehicle Silhouettes) Data Set dataset from http://archive.ics.uci.edu/ml/datasets/ s/Statlog+(Vehicle+Silhouettes) limiting yourself to analyzing to the following subset of the dataset involving just 5 attributes; use all examples to create the subset:

COMPACTNESS (average perim)**2/area (1st attribute)CIRCULARITY (average radius)**2/area (2nd attribute)SCATTER RATIO (inertia about minor axis)/(inertia about major axis) (7th attribute)ELONGATEDNESS area/(shrink width)**2 (8th attribute)HOLLOWS RATIO (area of hollows)/(area of bounding polygon) (18th attribute)Class: OPEL, SAAB, BUS, VAN

Apply the following exploratory data analysis techniques using R to your dataset:1. Compute the mean value and standard deviation of the 5 numerical attributes1. 1

point

Mean Standard Deviation

COMPACTNESS 93.67849

8.234474

CIRCULARITY 44.8617 6.169866SCATTER RATIO 168.839

233.24498

ELONGATEDNESS 40.93381

7.81156

HOLLOWS RATIO 195.6324

7.438797

2. Compute the covariance matrix for the five numerical attributes you are analyzing; also compute the correlation for each of the three pairs of attributes. Interpret the statistical findings! 2 points

Covariance matrix COMPACTNESS CIRCULARITY SCATTER_RATIO ELONGATEDNESS HOLLOWS_RATIOCOMPACTNESS 67.80657 35.201637 222.56364 -50.72900 22.391727CIRCULARITY 35.20164 38.067242 176.47597 -39.94289 1.775135SCATTER_RATIO 222.56364 176.475966 1105.22856 -252.78343 29.663911

1 This is more a verification of that you have the correct dataset!

1

Page 2: Project1-15 (1) (3).doc

ELONGATEDNESS -50.72900 -39.942893 -252.78343 61.02047 -12.593593HOLLOWS_RATIO 22.39173 1.775135 29.66391 -12.59359 55.335707

The Compactness, Circularity, Scatter_Ratio, and Hollows_Ratio have positive linear relationships with all other attributes, except Elonggatedness. It means that vehicles with high value of one of Compactness, Circularity, Scatter_Ratio, Hollows_Ratio will usually have values of the other three. In contrast, vehicles with high value of Elonggatedness will usually have low value of Compactness, Circularity, Scatter_Ratio, and Hollows_Ratio.

  COMPACTNESS CIRCULARITYSCATTER RATIO

ELONGATEDNESSHOLLOWS RATIO

COMPACTNESS 1 0.69286923 0.8130033 -0.788647 0.3655519CIRCULARITY 0.6928692 1 0.8603671 -0.8287548 0.038677SCATTER RATIO 0.8130033 0.86036714 1 -0.9733853 0.1199498

ELONGATEDNESS-0.788647 -0.8287548

-0.9733853 1 -0.216725

HOLLOWS RATIO 0.3655518 0.03867702 0.1199498 -0.2167251 1

The positive linear relationship between Compactness and Hollows_Ratio are pretty weak, whereas the positive relationships between Compactness Circularity and Scatter_Ratio are strong. The negative relationship between Compactness and Elongatedness is also strong.

3. Create a scatter plot for the last two numerical attributes of your dataset. Interpret the scatter plot! 2 points

2

Page 3: Project1-15 (1) (3).doc

It seems like the linear relationship is positve as Elongatedness in the range[25,35] , and negative as Elongatedness in the range(35,45]. It looks like there is not linear relationship between the two attributes as Elongatedness > 45.

In general, the linear relationship between the two attributes is very weak based on the scattered plot.

4. Create histograms for each of the 5 numerical attributes. Then create a histogram for the ELONGATEDNESS attribute for instances of OPEL, instances of SAAB, instances of BUS, and instances of VAN (4 Histograms); interpret the 5 histograms you generated for the ELONGATEDNESS attribute. 6 points

3

Page 4: Project1-15 (1) (3).doc

4

Page 5: Project1-15 (1) (3).doc

5

Page 6: Project1-15 (1) (3).doc

It looks like the Elongatedness has two modes. The distribution of the data is not symmetric. Somehow the graph is right-skewed.

It looks like the Elongatedness for Opel class has only one mode. Also it is clearly to see that the graph is right-skewed.

6

Page 7: Project1-15 (1) (3).doc

Elongatedness for Saab appears to have only one mode. Also the graph is right-skewed.

It seems like Elongatedness for Bus has two modes. If the outer-left bar is not considered, the graph appears to be left-skewed.

7

Page 8: Project1-15 (1) (3).doc

The data for Elongatedness for Van appears to fluctuate pretty much. That suggests that it is multi-modal. However in general view, it looks like the graph is right-skewed.

5. Create box plots for the COMPACTNESS attribute for the instances of each class and a fifth box plot for all instances in the dataset. Do the same for the HOLLOWS RATIO attribute. Interpret and compare the 5 box plots for each attribute! 5 points

It looks like the box plots for opel, saab, van and all for the Compactness have the 50th percentile at the middle of 25th-75th percentiles. Whereas the 50th percentile of the box plot for bus seems to be closer to the 25th. While the data for van is quite compact between values 87 and 93, the data for the other box plots seem to have wider spread.

8

Page 9: Project1-15 (1) (3).doc

The box plots show that the 50th percentile of opel, saab, and van appear to be at the middle of the 25th and 75th percentiles. Whereas, the 50th percentile of the box plot for bus looks to be closer to the 25th percentile, and the 50th percentile of the box plot for all looks to be closer to the 75th percentile. Moreover the data for opel, saab, van, and all seem to be compacted between values 192 and 202 while the data for bus seems to spread wider.

6. Create supervised scatter plots/supervised density plots for 4 pairs2 of the 5 attributes (for each pair of attributes visualize it using a traditional scatter plot and a density plot) and the class variable; use different colors for the class variable. Interpret the scatter plots! 5 points

2 In general, there are 10 pairs, but you only need to visualize 4 of them!

9

Page 10: Project1-15 (1) (3).doc

+saab: it apears that the circularity and compactness has a positive linear relationship.+van: it appears that the circularity and compactness does not have strong linear relationship because the compactness seems to stay unchanged as the circularity changes.+bus: it appears that the circularity and compactness has a negative linear relationship on the interval [35, 42] on the x-axis, and positive linear relationship on the interval [42, 60] on the x-axis.+opel: it apears that the circularity and compactness has a positive linear relationship.

10

Page 11: Project1-15 (1) (3).doc

+saab: it apears that the scatter ratio and compactness has a pretty strong positive linear relationship.+van: it appears that the scatter ratio and compactness does not have strong linear relationship because the data points look concentrated in a rectangular shape.+bus: it appears that the scatter ratio and compactness does not have linear relationship around value of 150 on the x-axis. However, from 150 to 250 on the x-axis, the positive linear relationship seems to exist.+opel: it apears that the scatter ratio and compactness has a pretty strong positive linear relationship.

11

Page 12: Project1-15 (1) (3).doc

+It appears that there exist positive linear relationships between elongatedness and compactness for classes of bus, opel, and saab. However, it seems like there is not linear relationship for van class.

12

Page 13: Project1-15 (1) (3).doc

+saab: it looks like there is a pretty weak linear relationship between Hollows_Ratio and Compactness.+van: it seems like there exists no linear relationship between Hollows_Ratio and Compactness.+bus: it seems like there exists a negative linear relationship between Hollows_Ratio and Compactness on the interval [100, 95) on the y-axis, but a positive linear relationship on the interval [80,95] on the y-axis.

13

Page 14: Project1-15 (1) (3).doc

7. Create a Star plot for the first 10 instances of class BUS and the first 20 instances of SAAB (based on the order in the file); interpret the 20 stat plots—star plots should be constructed for the 4 numerical attributes! 3 points

It looks like the dominant shapes of 20 instances of saab are those number 3, 19, 25, 28, 39, 45, 91, 93. Also the shapes of instance with numbers 10, 25, 44,50, 52, 57, 77, 78 are also dominant shapes. In general, it seems like these 20 instances could be divided into two major groups (similar shapes will be grouped together).

14

Page 15: Project1-15 (1) (3).doc

8. Fit a linear model that predicts the class attribute (treat it as a numerical attribute that takes values 0, 1, 2 and 3 with OPEL=0, SAAB=1, BUS=2, and VAN=3) using the 5 attributes as the independent variables. Report the R2 of the linear model and the coefficients of each attribute in the obtained regression function. Do the coefficients tell you anything about the importance of the attribute in predicting the class variable; if yes, what? Repeat the experiment using OPEL=0, SAAB=0, BUS=2, and VAN=0, and answer the same questions! 6 points

 OPEL=0, SAAB=1, BUS=2, and VAN=3 Coefficient y-intercept Interpretation

class~compactness

-0.02352 3.70878

The coefficient indicates that there is a weak negative relationship between class and compactness. It means it is not appropriate to use compactness to predict class.

class~circularity

-0.01936 2.40081

 The coefficient indicates that there is a weak negative relationship between class and circularity. It means it is not appropriate to use circularity to predict class.

class~scatter_ratio

-0.005386 2.461936

The coefficient indicates that there is a very weak negative relationship between class and scatter_ratio. It means it is not appropriate to use scatter_ratio to predict class.

class~elongatedness

0.02157 0.67257

The coefficient indicates that there is a weak positive relationship between class and elongatedness. It means it is not appropriate to use elongatedness to predict class.

class~hollow_ratio

-0.05095 11.40752

The coefficient indicates that there is a weak negative relationship between class and hollow_ratio. It means it is not appropriate to use hollow_ratio to predict class.

 OPEL=0, SAAB=0, BUS=2, and VAN=0 Coefficient y-intercept Interpretation

class~compactness-0.01588 2.0029

The negative linear relationship is even weaker.

class~circularity0.002807 0.389432

 The linear relationship now is positive, but very weak.

class~scatter_ratio 0.0005526 0.4220647  The linear relationship now is 

15

Page 16: Project1-15 (1) (3).doc

positive, but very weak.

class~elongatedness-0.006926 0.798889

 The linear relationship now is negative, but is very weak.

class~hollow_ratio-0.04016 8.37151

 The negative linear relationship is weaker.

9. Create 3 decision tree models with 20 or less nodes for the dataset (leaf nodes count; do not submit models with more than 20 nodes! Explain how the 3 decision tree models were obtained. Report the training accuracy and the testing accuracy of this decision tree; interpret the learnt decision tree. What does it tell you about the importance of the 5 attributes for the classification problem? 6 points

16

Page 17: Project1-15 (1) (3).doc

10. Write a conclusion (at most 18 sentences!) that assesses the difficulty of predicting the class attribute using the selected 5 attributes and assesses which of the 5 attributes is more important/less important for the classification task based on the findings you obtained answering questions 1-9! Moreover, if you discovered “something else interesting” about the dataset, also mention it in your conclusion. 7 points total (and possibly up to 5 extra points)

According to all questions above, we can properly assess the difficulty of predicting the class attributes using the selected five attributes. Especially, in the question 6 we see that we can choose 2 of 5 attribute to analyze the relationship for class sabb, van, bus, opel. Each pair of attributes chosen from 5 attributes which help us to see the different level of linear relationship for class attribute. However, in the question 8 if we use “ fit the linear that predicts the class attribute using five attributes as the independent variable”, this is not a properly way. From the result from question 8, there is a weak relationship between class and one independent attribute of five attributes. Thus, using the attributes as the independent variables to predict class is inappropriate way. And the role of individual attribute is not equal, they depend on each other.

11) Similarity Assessment (7 points!)

Design a distance function to assess the similarity of customers of a supermarket; each customer in a supermarket is characterized by the following attributes3:

a) Ssnb) Items_Bought (The set of items the bought last month; this is a set)c) Age (is an ordinal attribute having values: old, medium, young, and teenager)

3 E.g. (111234232, {Coke, 2%-milk, apple}, old, 33.39) is an example of a customer description.

17

Page 18: Project1-15 (1) (3).doc

d) Amount_spend (Average amount spent per purchase in dollars and cents; it has a mean of 50.00 a standard deviation of 25, the minimum is 0.02 and the maximum is 398)

Assume that Items_Bought and Amount_Spend are of major importance and Age is of a minor importance when assessing the similarity of the customers. Assess the distance between the following 3 customers:

a1= (111111111, {A,B,C}, old, 12.20)a2= (22222222, {B,C,D,E,F, G}, medium, 50.20)a3=(333333333, {C,D,E,F,H}, young, 28.00).

We assume that the values of attribute age are converted to numbers as ‘old’ =3, ‘medium’ = 2, ‘young’ =1, ‘teenager’ =0.

Let a, b be 2 customers,

D Items_Bought(a,b)=1*(1-|(a.Items_Bought n b.Items_Bought)| / |a.Items_Bought U b.Items_Bought|)

D Age(a,b)=0.2*( |(a.age-b.age)/3|)

D Amount_spend(a,b)=1*|(a.Amount_spend-50)/25-(b.Amount_spend-50)/25|

The distance of a and b :

D(a,b)=( D Items_Bought(a,b) + D Age(a,b)+ D Amount_spend(a,b))/2.2

D(a1,a2)=((1-2/7)+0.2*(1/3)+|(-37.8/25-0.2/25)|)/2.2=(5/7+0.2/3+38/25)/2.2=1.045D(a1,a3)=((1-1/8)+0.2*(2/3)+|(-37.8/25+22/25)|)/2.2=(7/8+0.4/3+15.8/25)/2.2=0.745D(a2,a3)=((1-4/7)+0.2*(1/3)+|(0.2/25+22/25)|)/2.2=(3/7+0.2/3+22.2/25)/2.2=0.63

12) Data Analysis (4 POINTS)a) What is the role and purpose of exploratory data analysis in a data mining project? b) Interpret the following 2 histograms and analyze their relationships which describe the male and female age distribution in the US, based on Census Data.

18

Page 19: Project1-15 (1) (3).doc

The role and purpose of exploratory data analysis in a mining project:

Getting the necessary background information for the task Providing knowledge to help in tool selection Assessing difficulty of the task to be solved Validating data Forming the hypotheses Finding the potential issues, error, patterns in the data

Interpreting 2 histograms and their relationship:

Two diagrams above are continuous, and there are no gaps between them. They are binomial (2 peaks at 5-9 ages and 35-39 ages). The values of two diagrams start going down significantly beyond 55-59(skewed distribution). In general, both diagrams are similar until 55-59 ages. After this point, the male curve is significantly steeper than female curve. From here, we can reach to a conclusion: female live longer than male.

19