Introduction to Machine Learning & Classification

26
Machine Learning Chris Sharkey today @shark2900

Transcript of Introduction to Machine Learning & Classification

Page 1: Introduction to Machine Learning & Classification

Machine LearningChris Sharkeytoday @shark2900

Page 2: Introduction to Machine Learning & Classification

What do you think of when we say machine

learning?

Page 3: Introduction to Machine Learning & Classification
Page 4: Introduction to Machine Learning & Classification

big words• Hadoop• Terabyte• Petabyte• NoSQL• Data Science• D3• Visualization• Machine learning

Page 5: Introduction to Machine Learning & Classification

What is machine learning?

Page 6: Introduction to Machine Learning & Classification

“Predictive or descriptive modeling which learns from past experience or data to build models which can predict the future”

Page 7: Introduction to Machine Learning & Classification

Past Data (known outcome) Machine Learning

ModelNew Data (unknown outcome) Predicted Outcome

Page 8: Introduction to Machine Learning & Classification

Will John play golf?Date Weather Temperat

ureSally going?

Did John Golf ?

Sept 1 Sunny 92o F Yes YesSept 2 Cloudy 84o F No NoSept 3 Raining 84o F No YesSept 4 Sunny 95o F Yes Yes

Date Weather Temperature

Sally going?

Will John Golf ?

Sept 5 Cloudy 87o F No ?

We want a model based on John’s past behavior to predict what he will do in the future. Can we use ML?

Page 9: Introduction to Machine Learning & Classification

Yes. This is a classification problem

Page 10: Introduction to Machine Learning & Classification

ZeroR

Establishes a base line

Naïve Bayes

Probabilistic model

OneR

Single Rule

J4.5 / C4.5

Decision Tree

Page 11: Introduction to Machine Learning & Classification

Upgrade our example

age blood pressure

specific gravity

albumin sugar

red blood cells

pus cell pus cell clumps

potassium blood glucose

blood urea serum creatinine

sodium hemoglobin packed cell volume

white blood cell count

red blood cell count

hypertension  diabetes mellitus

coronary artery 

Heart disease appetite pedal edema anemia stage

Data Set • 319 instances or people • 25 attributes or variables

Machine Learning• ZeroR • OneR • Naïve Bayes • J4.5 / C4.5

Model

Blood test data for new

individuals with unknown disease

status

Predict if induvial has CKD and if so the stage of there disease

status

Page 12: Introduction to Machine Learning & Classification

ZeroR

Past data (known outcome)

New instance

Classified

Classify new data as the most

‘popular’ class

Build frequency table

Choice ‘most popular’ or most frequent class

Page 13: Introduction to Machine Learning & Classification

How did ZeroR do? • Correctly classified 28.2% of the time • Rule: always guess a new instance (person) has stage three kidney disease

• 28.2% correct classfication rate is our base line • Correct classification rates above 28.2% are better than guessing

Page 14: Introduction to Machine Learning & Classification

OneR

Past data (known outcome)

New instance

Classified

Choose attribute which rule has the

highest correct classification rate

Build frequency table for each attribute. This

generates a rule for value of each

attribute.

Page 15: Introduction to Machine Learning & Classification

How did OneR do? • Correctly classified 80.2% of the time • Rule based on serum creatinine

• < 0.85 is healthy • < 1.15 is stage 2 • < 2.25 is stage 3 • > = 2.25 is stage 5

• Single rule is created and responsible for classification• High classification rate indicates a single value has high influence in predicting class

Page 16: Introduction to Machine Learning & Classification

Naïve Bayes

Past data (known outcome)

New instance

Classified

For each attribute multiply

conditional probability for

each of the values with probability of

value

Multiply all prior calculated

probabilities

Choose most probable class

Build frequency table for each

attribute.

Determine probabilities for values of each

attribute.

Determine conditional

probabilities for values of each

attribute.

Page 17: Introduction to Machine Learning & Classification

How did Naïve Bayes do? • Correctly classified 56.6% of the time • Conditional and overall probabilities constitute a rule • High classification rate indicates attributes have ‘equaler’ influence • No iterative process, faster on larger data sets

Page 18: Introduction to Machine Learning & Classification

J4.5 / C4.5

Past data (known outcome)

New instance

Classified

Follow decision tree to a leaf or class

Top down recursive algorithm

determining splitting points

based on information gains

Page 19: Introduction to Machine Learning & Classification
Page 20: Introduction to Machine Learning & Classification

How did J4.5 do? • Correctly classified 88.4% of the time • Decision tree generated • Balance between discrimination of OneR and fairness of Naïve Bayes

• Decision trees are popular, intuitive, easy to create and easy to interpret

• People like decision trees. They tell a nice story

Page 21: Introduction to Machine Learning & Classification

ZeroR • Correct classification rate – 28.2% • Established base line accuracy • Always guess stage 3 ckd

Naïve Bayes • Correct classification rate – 56.6% • Established over all probabilities to pick most probable class

OneR • Correct classification rate – 80.2% • Serum Creatinine • < 0.85 – Healthy • < 1.15 – Stage 2 • < 2.25 – Stage 3 • > = 2.25 – Stage 5

J4.5 / C4.5 • Correct classification rate – 88.4%

Page 22: Introduction to Machine Learning & Classification

Does this make sense?

Page 23: Introduction to Machine Learning & Classification

Other important concepts in machine learning.

Page 24: Introduction to Machine Learning & Classification

Cross Validation• Hold out one of ten slices and build the model on the other nine slices • Test on the ‘held out’ slice• Hold out a different slice, build the models on the now other nine slices and test on the new ‘held out’ slice

Page 25: Introduction to Machine Learning & Classification

Overfitting • Classification rule that is ‘over fit’ or so specific to the training data set that it does not generalize to the broader population

• Limiting the complexity or rules can help prevent overfitting • Large representative data sets can help fight overfitting • A problem in machine learning • Must be a suspicious data scientist

Page 26: Introduction to Machine Learning & Classification

Question?