1 1 Slide Using Weka. 2 2 Slide Data Mining Using Weka n What’s Data Mining? We are overwhelmed...

34
1 Using Weka

Transcript of 1 1 Slide Using Weka. 2 2 Slide Data Mining Using Weka n What’s Data Mining? We are overwhelmed...

Page 1: 1 1 Slide Using Weka. 2 2 Slide Data Mining Using Weka n What’s Data Mining? We are overwhelmed with data We are overwhelmed with data Data mining is.

1 1 Slide

Slide

Using Weka

Page 2: 1 1 Slide Using Weka. 2 2 Slide Data Mining Using Weka n What’s Data Mining? We are overwhelmed with data We are overwhelmed with data Data mining is.

2 2 Slide

Slide

Data Mining Using Weka What’s Data Mining?

• We are overwhelmed with data

• Data mining is about going from data to

information, information that can give you

useful predictions

Examples??

• You’re at the supermarket checkout.

• You’re happy with your bargains … and … the

supermarket is happy you’ve bought some

more stuff

• Say you want a child, but you and your partner

can’t have one. Can data mining help?

Data mining vs. machine learning

Page 3: 1 1 Slide Using Weka. 2 2 Slide Data Mining Using Weka n What’s Data Mining? We are overwhelmed with data We are overwhelmed with data Data mining is.

3 3 Slide

Slide

Data Mining Using Weka What’s Weka?

• A bird found only in New Zealand?

Data mining workbench

• Waikato Environment for Knowledge Analysis

Machine learning algorithms for data mining tasks

• 100+ algorithms for classification

• 75 for data preprocessing

• 25 to assist with feature selection

• 20 for clustering, finding association rules, etc

Page 4: 1 1 Slide Using Weka. 2 2 Slide Data Mining Using Weka n What’s Data Mining? We are overwhelmed with data We are overwhelmed with data Data mining is.

4 4 Slide

Slide

Data Mining Using Weka What will you learn?

• Load data into Weka and look at it

• Use filters to preprocess it

• Explore it using interactive visualization

• Apply classification algorithms

• Interpret the output

• Understand evaluation methods and their

implications

• Understand various representations for models

• Explain how popular machine learning

algorithms work

Page 5: 1 1 Slide Using Weka. 2 2 Slide Data Mining Using Weka n What’s Data Mining? We are overwhelmed with data We are overwhelmed with data Data mining is.

5 5 Slide

Slide

Data Mining Using Weka What will you learn? (cont.)

• Be aware of common pitfalls with data mining

• Use Weka on your own data … and understand

what you are doing!

Page 6: 1 1 Slide Using Weka. 2 2 Slide Data Mining Using Weka n What’s Data Mining? We are overwhelmed with data We are overwhelmed with data Data mining is.

6 6 Slide

Slide

Data Mining Using Weka Getting started with Weka

• Install Weka

• Explore the “Explorer” interface

• Explore some datasets

• Build a classifier

• Interpret the output

• Use filters

• Visualize your data set

Page 7: 1 1 Slide Using Weka. 2 2 Slide Data Mining Using Weka n What’s Data Mining? We are overwhelmed with data We are overwhelmed with data Data mining is.

7 7 Slide

Slide

Data Mining Using Weka Install Weka

• Download links available on Course Page

• http://chouc.people.cofc.edu/SCU/DM/

index.html

Platform:

• Windows X86

• Windows X64

• Mac OSX

Version: 3.6.10

• the latest stable version of Weka

• datasets for the course

Page 8: 1 1 Slide Using Weka. 2 2 Slide Data Mining Using Weka n What’s Data Mining? We are overwhelmed with data We are overwhelmed with data Data mining is.

8 8 Slide

Slide

Data Mining Using Weka Exercise

• Install Weka

• Get datasets along with the installation

• Load the Weka program

• Open Explorer

• Open a dataset (weather.nominal.arff)

• Look at attributes

• Edit the dataset

• Save it if you need to make changes to the

dataset

Page 9: 1 1 Slide Using Weka. 2 2 Slide Data Mining Using Weka n What’s Data Mining? We are overwhelmed with data We are overwhelmed with data Data mining is.

9 9 Slide

Slide

Command line ‐interface

Graphical interface

Performance comparisons

Exploring the Explorer

Page 10: 1 1 Slide Using Weka. 2 2 Slide Data Mining Using Weka n What’s Data Mining? We are overwhelmed with data We are overwhelmed with data Data mining is.

10 10 Slide

Slide

Exploring the Explorer

Page 11: 1 1 Slide Using Weka. 2 2 Slide Data Mining Using Weka n What’s Data Mining? We are overwhelmed with data We are overwhelmed with data Data mining is.

11 11 Slide

Slide

attributes

1

2

3

4

5

6

7

8

9

10

11

12

13

14

instances

Outlook Temp Humidity Windy Play

Sunny Hot High False No

Sunny Hot High True No

Overcast Hot High False Yes

Rainy Mild High False Yes

Rainy Cool Normal False Yes

Rainy Cool Normal True No

Overcast Cool Normal True Yes

Sunny Mild High False No

Sunny Cool Normal False Yes

Rainy Mild Normal False Yes

Sunny Mild Normal True Yes

Overcast Mild High True Yes

Overcast Hot Normal False Yes

Rainy Mild High True No

Exploring the Explorer

Open a dataset (weather.nominal.arff)

Page 12: 1 1 Slide Using Weka. 2 2 Slide Data Mining Using Weka n What’s Data Mining? We are overwhelmed with data We are overwhelmed with data Data mining is.

12 12 Slide

Slide

19

open file weather.nominal.arff

Exploring the Explorer

Page 13: 1 1 Slide Using Weka. 2 2 Slide Data Mining Using Weka n What’s Data Mining? We are overwhelmed with data We are overwhelmed with data Data mining is.

13 13 Slide

Slide

attributes

attribute values

Exploring the Explorer

Page 14: 1 1 Slide Using Weka. 2 2 Slide Data Mining Using Weka n What’s Data Mining? We are overwhelmed with data We are overwhelmed with data Data mining is.

14 14 Slide

Slide

attributes

1

2

3

4

5

6

7

8

9

10

11

12

13

14

instances

Outlook Temp Humidity Windy Play

Sunny Hot High False No

Sunny Hot High True No

Overcast Hot High False Yes

Rainy Mild High False Yes

Rainy Cool Normal False Yes

Rainy Cool Normal True No

Overcast Cool Normal True Yes

Sunny Mild High False No

Sunny Cool Normal False Yes

Rainy Mild Normal False Yes

Sunny Mild Normal True Yes

Overcast Mild High True Yes

Overcast Hot Normal False Yes

Rainy Mild High True No

Exploring the Explorer

Page 15: 1 1 Slide Using Weka. 2 2 Slide Data Mining Using Weka n What’s Data Mining? We are overwhelmed with data We are overwhelmed with data Data mining is.

15 15 Slide

Slide

attributes

class

attribute values

open file weather.nominal.arff

Exploring the Explorer

Page 16: 1 1 Slide Using Weka. 2 2 Slide Data Mining Using Weka n What’s Data Mining? We are overwhelmed with data We are overwhelmed with data Data mining is.

16 16 Slide

Slide

Classification

Dataset: classified examples

“Model” that classifies new examples

instance:fixed set of features

classifiedexample

class discrete: “classification” problemcontinuous: “regression” problem

discrete (“nominal”)continuous (“numeric”)

attribute 1attribute 2

attribute n

sometimes called “supervised learning”

Exploring the Explorer

Page 17: 1 1 Slide Using Weka. 2 2 Slide Data Mining Using Weka n What’s Data Mining? We are overwhelmed with data We are overwhelmed with data Data mining is.

17 17 Slide

Slide

attributes

class

attribute values

open file weather.numeric.arff

Exploring the Explorer

Page 18: 1 1 Slide Using Weka. 2 2 Slide Data Mining Using Weka n What’s Data Mining? We are overwhelmed with data We are overwhelmed with data Data mining is.

18 18 Slide

Slide

open file glass.arff

Exploring the Explorer

Page 19: 1 1 Slide Using Weka. 2 2 Slide Data Mining Using Weka n What’s Data Mining? We are overwhelmed with data We are overwhelmed with data Data mining is.

19 19 Slide

Slide

Exploring the Explorer Exercise on the classification problem

• Datasets: weather.nominal, weather.numeric

• Nominal vs numeric attributes

• ARFF file format

• Checking attributes

Page 20: 1 1 Slide Using Weka. 2 2 Slide Data Mining Using Weka n What’s Data Mining? We are overwhelmed with data We are overwhelmed with data Data mining is.

20 20 Slide

Slide

Exploring the Explorer File format

• ARFF file format

• Native in Weka

• More information

• CSV file format

• Compatible with Excel and Weka

Page 21: 1 1 Slide Using Weka. 2 2 Slide Data Mining Using Weka n What’s Data Mining? We are overwhelmed with data We are overwhelmed with data Data mining is.

21 21 Slide

Slide

Exploring the Explorer Excise on File Preparation

• Prepare ARFF file

• Specialized format

• Need to follow ARFF syntax

• CSV file format

• Comma separated format

• Notepad compatible

• Excel compatible

Page 22: 1 1 Slide Using Weka. 2 2 Slide Data Mining Using Weka n What’s Data Mining? We are overwhelmed with data We are overwhelmed with data Data mining is.

22 22 Slide

Slide

Exploring the Explorer Excise on File Preparation (cont.)

• ARFFCSV

• Easy

• In Weka Explorer, use Save… feature after

loading the dataset and change file format to

CSV data files

• CSVARFF

• Easy

• In Weka Explorer, use Open File… feature

and change the file format to CSV data files

• Next, use Save… feature and change the file

format to Arff data files

Page 23: 1 1 Slide Using Weka. 2 2 Slide Data Mining Using Weka n What’s Data Mining? We are overwhelmed with data We are overwhelmed with data Data mining is.

23 23 Slide

Slide

Building a classifier Use J48 to analyze the glass dataset

• Open file glass.arff

• Check the available classifiers

• Choose the J48 decision tree learner

(trees>J48)

• Run it

• Examine the output

• Look at the correctly classified instances … and

the confusion matrix

Page 24: 1 1 Slide Using Weka. 2 2 Slide Data Mining Using Weka n What’s Data Mining? We are overwhelmed with data We are overwhelmed with data Data mining is.

24 24 Slide

Slide

Building a classifier Investigate J48

• Open the configuration panel

• Check the More information

• Examine the options

• Use an unpruned tree

• Look at leaf sizes

• Set minNumObj to 15 to avoid small leaves

• Visualize tree using right‐click menu

Page 25: 1 1 Slide Using Weka. 2 2 Slide Data Mining Using Weka n What’s Data Mining? We are overwhelmed with data We are overwhelmed with data Data mining is.

25 25 Slide

Slide

Building a classifier From C4.5 to J48

• ID3 (1979)

• C4.5 (1993)

• C4.8 (1996)

• C5.0 (commercial)

J48

Page 26: 1 1 Slide Using Weka. 2 2 Slide Data Mining Using Weka n What’s Data Mining? We are overwhelmed with data We are overwhelmed with data Data mining is.

26 26 Slide

Slide

Building a classifier Investigate J48

• Classifiers in Weka

• Classifying the glass dataset

• Interpreting J48 output

• J48 configuration panel

• … option: pruned vs unpruned trees

• … option: avoid small leaves

Page 27: 1 1 Slide Using Weka. 2 2 Slide Data Mining Using Weka n What’s Data Mining? We are overwhelmed with data We are overwhelmed with data Data mining is.

27 27 Slide

Slide

Using a filter Use a filter to remove an attribute (3rd attribute)

• Open weather.nominal.arff

• Check the filters

• supervised vs unsupervised

• attribute vs instance

• Choose the unsupervised attribute filter

Remove

• Check the More information; look at the options

• Set attributeIndices to 3 and click OK (to

remove the 3rd attribute)

• Apply the filter

• Save the result or press Undo to skip the

change

Page 28: 1 1 Slide Using Weka. 2 2 Slide Data Mining Using Weka n What’s Data Mining? We are overwhelmed with data We are overwhelmed with data Data mining is.

28 28 Slide

Slide

Using a filter Use Remove button to remove attributes

• Open weather.nominal.arff

• Use check boxes and Remove button

Page 29: 1 1 Slide Using Weka. 2 2 Slide Data Mining Using Weka n What’s Data Mining? We are overwhelmed with data We are overwhelmed with data Data mining is.

29 29 Slide

Slide

Using a filter Remove instances where humidity is high

• Open weather.nominal.arff

• Supervised or unsupervised?

• Attribute or instance?

• Look at them

• Select RemoveWithValues

• Set attributeIndex to 3 (3rd attribute)

• Set nominalIndices to 1 (first value: high)

• Apply

• Undo

Page 30: 1 1 Slide Using Weka. 2 2 Slide Data Mining Using Weka n What’s Data Mining? We are overwhelmed with data We are overwhelmed with data Data mining is.

30 30 Slide

Slide

Using a filter Fewer attributes, better classification!

• Open glass.arff

• Run J48 (trees>J48)

• Remove Fe

• Remove all attributes except RI and MG

• Look at the decision trees

• Use right‐click menu to visualize decision trees

Page 31: 1 1 Slide Using Weka. 2 2 Slide Data Mining Using Weka n What’s Data Mining? We are overwhelmed with data We are overwhelmed with data Data mining is.

31 31 Slide

Slide

Using a filter Summary

• Filters in Weka

• Supervised vs unsupervised, attribute vs

instance

• To find the right one, you need to look

• Filters can be very powerful

• Smartly removing attributes

• improve performance

• increase comprehensibility

Page 32: 1 1 Slide Using Weka. 2 2 Slide Data Mining Using Weka n What’s Data Mining? We are overwhelmed with data We are overwhelmed with data Data mining is.

32 32 Slide

Slide

Visualizing your data Using the Visualize panel

• Open iris.arff

• Bring up Visualize panel

• Click one of the plots; examine some instances

• Set x axis to petalwidth and y axis to

petallength

• Click on Class color to change the color

• Bars on the right change correspond to

attributes: click for x axis; right‐click for y axis

• Jitter slider (to see the overlapped instances)

• Show Select Instance: Rectangle option

• Submit, Reset, Clear and Save

Page 33: 1 1 Slide Using Weka. 2 2 Slide Data Mining Using Weka n What’s Data Mining? We are overwhelmed with data We are overwhelmed with data Data mining is.

33 33 Slide

Slide

Visualizing your data Visualizing classification errors

• Open iris.arff

• Run J48 (trees>J48)

• Visualize classifier errors (from Results list)

• Plot predictedclass against class

• Identify errors shown by confusion matrix

Page 34: 1 1 Slide Using Weka. 2 2 Slide Data Mining Using Weka n What’s Data Mining? We are overwhelmed with data We are overwhelmed with data Data mining is.

34 34 Slide

Slide

Visualizing your data Summary

• Get down and dirty with your data

• Visualize it

• Clean it up by deleting outliers

• Look at classification errors

• (there’s a filter that allows you to add

classifications as a new attribute)