Data Mining (and machine learning)

Data Mining(and machine learning)

ROC curves

Rule InductionCW3

Two classes is a common and special case

Medical applications: cancer, or not?Computer Vision applications: landmine, or not?Security applications: terrorist, or not?Biotech applications: gene, or not?… …

Two classes is a common and special case

Medical applications: cancer, or not?Computer Vision applications: landmine, or not?Security applications: terrorist, or not?Biotech applications: gene, or not?… …

Predicted Y Predicted N

Actually Y True Positive False Negative

Actually N False Positive True Negative

Two classes is a common and special caseTrue Positive: these are ideal. E.g. we correctly detect cancer

False Positive: to be minimised – cause false alarm – can be better to be safe than sorry, but can be very costly.

False Negative: also to be minimised – miss a landmine / cancer very bad in many applications

True Negative?:

Sensitivity and Specificity: common measures of accuracy in this kind of 2-class tasks

Sensitivity and Specificity: common measures of accuracy in this kind of 2-class task

Sensitivity = TP/(TP+FN) - how much of the real ‘Yes’ cases are detected? How well can it detect the condition? Specificity = TN/(FP+TN) - how much of the real ‘No’ cases are correctly classified? How well can it rule out the condition?

Sensitivity: 100%Specificity: 25%

YES NO

Sensitivity: 93.8%Specificity: 50%

Sensitivity: 81.3%Specificity: 83.3%

YES NO

Sensitivity: 100%Specificity: 25%

YES NO

100% Sensitivity means: detects all cancer cases (or whatever) but possibly with many false positives

YES NO

100% Specificity means: misses some cancer cases (or whatever) but no false positives

Sensitivity and Specificity: common measures of accuracy in this kind of 2-class tasks

Sensitivity = TP/(TP+FN) - how much of the real TRUE cases are detected? How sensitive is the classifier to TRUE cases?A highly sensitive test for cancer: if “NO” then you be sure it’s “NO”

Specificity = TN/(TN+FP) - how sensitive is the classifier to the negative cases? A highly specific test for cancer: if “Y” then you be sure it’s “Y”.

With many trained classifiers, you can ‘move the line’ in this way.E.g. with NB, we could use a threshold indicating how much higherthe log likelihood for Y should be than for N

ROC curves

David Corne, and Nick Taylor, Heriot-Watt University - dwcorne@gmail.comThese slides and related resources: http://www.macs.hw.ac.uk/~dwcorne/Teaching/dmml.html

Rule Induction• Rules are useful when you want to learn a

clear / interpretable classifier, and are less worried about squeezing out as much accuracy as possible

• There are a number of different ways to ‘learn’ rules or rulesets.

• Before we go there, what is a rule / ruleset?

IF Condition … Then Class Value is …

00 1 2 3 4 5 6 7 8 9 10 11 12

Rules are Rectangular

IF (X>0)&(X<5)&(Y>0.5)&(Y<5) THEN YES

00 1 2 3 4 5 6 7 8 9 10 11 12

Rules are Rectangular

IF (X>5)&(X<11)&(Y>4.5)&(Y<5.1) THEN NO

A Ruleset

IF Condition1 … Then Class = A

IF Condition3 … Then Class = B

IF Condition4 … Then Class = C

00 1 2 3 4 5 6 7 8 9 10 11 12

What’s wrong with this ruleset?(two things)

00 1 2 3 4 5 6 7 8 9 10 11 12

What about this ruleset?

Two ways to interpret a ruleset:

As a Decision List

ELSE IF Condition2 … Then Class = A

ELSE IF Condition3 … Then Class = B

ELSE IF Condition4 … Then Class = C

ELSE … predict Background Majority Class

Two ways to interpret a ruleset:

As an unordered set

IF Condition3 … Then Class = B

IF Condition4 … Then Class = C

Check each rule and gather votes for each class

If no winner, predict background majority class

Three broad ways to learn rulesets

1. Just build a decision tree with ID3 (or something else) and you can translate the tree into rules!

2. Use any good search/optimisation algorithm.

Evolutionary (genetic) algorithms are the most

common. You will do this coursework 3.

This means simply guessing a ruleset at random,

and then trying mutations and variants, gradually

improving them over time.

3. A number of ‘old’ AI algorithms exist that still work well, and/or can be engineered to work with an evolutionary algorithm. The basic idea is: iterated coverage

00 1 2 3 4 5 6 7 8 9 10 11 12

Take each class in turn ..

00 1 2 3 4 5 6 7 8 9 10 11 12

Pick a random member of that class in the training set

00 1 2 3 4 5 6 7 8 9 10 11 12

Extend it as much as possible without including another class

00 1 2 3 4 5 6 7 8 9 10 11 12

Next class

00 1 2 3 4 5 6 7 8 9 10 11 12

Next class

00 1 2 3 4 5 6 7 8 9 10 11 12

And so on…

CW3• Run expts program that evolves a ruleset

• Try different sizes of training and test set

• Observe ‘overfitting’ and report

Data Mining (and machine learning)

Documents

Transcript of Data Mining (and machine learning)

Data Mining/Machine Learning/ Big Data

Data Mining and Machine Learning

Machine Learning and Data Mining I · Machine Learning and Data Mining I. Welcome to DS 4400! Machine Learning and Data Mining I 2. Introductions 3 ... –Presentation at end of class

Machine Learning & Data Mining

Experiences and Lessons in Developing Machine Learning and Data Mining …cjlin/talks/chinar.pdf · 2013. 11. 2. · Machine Learning and Data Mining Software Most machine learning

Data Mining (and machine learning)

Data Mining and Machine Learningpeople.scs.carleton.ca/~boyanbejanov/data5000/lecture4a.pdf · Machine Learning vs Data Mining I Machine Learning is the design of algorithms that

Data Mining & Machine Learning

Machine Learning and Data Mining Clustering

Machine Learning, Data Mining, and

Data Mining and Machine Learning- in a nutshell Arizona State University Data Mining and Machine Learning Lab Arizona State University Data Mining and.

Machine Learning and Data Mining Reinforcement Learning ...

Mining Social Media: Looking Ahead Arizona State University Data Mining and Machine Learning Lab Arizona State University Data Mining and Machine Learning.

Machine Learning and Data Mining

SAS Visual Data Mining and Machine Learning€¦ · SAS Visual Data Mining and Machine Learning 27 июня 2019 | Алматы

Machine Learning Techniques for Data Mining

Tugas 3 Data Mining(Learning Machine)

Machine Learning and Data Mining I

CPSC 340: Data Mining Machine Learning

Data Mining and Machine Learning