Introduction to Machine Learning & Classification

Post on 14-Feb-2017

330 views 1 download

Transcript of Introduction to Machine Learning & Classification

Machine LearningChris Sharkeytoday @shark2900

What do you think of when we say machine

learning?

big words• Hadoop• Terabyte• Petabyte• NoSQL• Data Science• D3• Visualization• Machine learning

What is machine learning?

“Predictive or descriptive modeling which learns from past experience or data to build models which can predict the future”

Past Data (known outcome) Machine Learning

ModelNew Data (unknown outcome) Predicted Outcome

Will John play golf?Date Weather Temperat

ureSally going?

Did John Golf ?

Sept 1 Sunny 92o F Yes YesSept 2 Cloudy 84o F No NoSept 3 Raining 84o F No YesSept 4 Sunny 95o F Yes Yes

Date Weather Temperature

Sally going?

Will John Golf ?

Sept 5 Cloudy 87o F No ?

We want a model based on John’s past behavior to predict what he will do in the future. Can we use ML?

Yes. This is a classification problem

ZeroR

Establishes a base line

Naïve Bayes

Probabilistic model

OneR

Single Rule

J4.5 / C4.5

Decision Tree

Upgrade our example

age blood pressure

specific gravity

albumin sugar

red blood cells

pus cell pus cell clumps

potassium blood glucose

blood urea serum creatinine

sodium hemoglobin packed cell volume

white blood cell count

red blood cell count

hypertension  diabetes mellitus

coronary artery 

Heart disease appetite pedal edema anemia stage

Data Set • 319 instances or people • 25 attributes or variables

Machine Learning• ZeroR • OneR • Naïve Bayes • J4.5 / C4.5

Model

Blood test data for new

individuals with unknown disease

status

Predict if induvial has CKD and if so the stage of there disease

status

ZeroR

Past data (known outcome)

New instance

Classified

Classify new data as the most

‘popular’ class

Build frequency table

Choice ‘most popular’ or most frequent class

How did ZeroR do? • Correctly classified 28.2% of the time • Rule: always guess a new instance (person) has stage three kidney disease

• 28.2% correct classfication rate is our base line • Correct classification rates above 28.2% are better than guessing

OneR

Past data (known outcome)

New instance

Classified

Choose attribute which rule has the

highest correct classification rate

Build frequency table for each attribute. This

generates a rule for value of each

attribute.

How did OneR do? • Correctly classified 80.2% of the time • Rule based on serum creatinine

• < 0.85 is healthy • < 1.15 is stage 2 • < 2.25 is stage 3 • > = 2.25 is stage 5

• Single rule is created and responsible for classification• High classification rate indicates a single value has high influence in predicting class

Naïve Bayes

Past data (known outcome)

New instance

Classified

For each attribute multiply

conditional probability for

each of the values with probability of

value

Multiply all prior calculated

probabilities

Choose most probable class

Build frequency table for each

attribute.

Determine probabilities for values of each

attribute.

Determine conditional

probabilities for values of each

attribute.

How did Naïve Bayes do? • Correctly classified 56.6% of the time • Conditional and overall probabilities constitute a rule • High classification rate indicates attributes have ‘equaler’ influence • No iterative process, faster on larger data sets

J4.5 / C4.5

Past data (known outcome)

New instance

Classified

Follow decision tree to a leaf or class

Top down recursive algorithm

determining splitting points

based on information gains

How did J4.5 do? • Correctly classified 88.4% of the time • Decision tree generated • Balance between discrimination of OneR and fairness of Naïve Bayes

• Decision trees are popular, intuitive, easy to create and easy to interpret

• People like decision trees. They tell a nice story

ZeroR • Correct classification rate – 28.2% • Established base line accuracy • Always guess stage 3 ckd

Naïve Bayes • Correct classification rate – 56.6% • Established over all probabilities to pick most probable class

OneR • Correct classification rate – 80.2% • Serum Creatinine • < 0.85 – Healthy • < 1.15 – Stage 2 • < 2.25 – Stage 3 • > = 2.25 – Stage 5

J4.5 / C4.5 • Correct classification rate – 88.4%

Does this make sense?

Other important concepts in machine learning.

Cross Validation• Hold out one of ten slices and build the model on the other nine slices • Test on the ‘held out’ slice• Hold out a different slice, build the models on the now other nine slices and test on the new ‘held out’ slice

Overfitting • Classification rule that is ‘over fit’ or so specific to the training data set that it does not generalize to the broader population

• Limiting the complexity or rules can help prevent overfitting • Large representative data sets can help fight overfitting • A problem in machine learning • Must be a suspicious data scientist

Question?