START OF DAY 2 Reading: Chap. 6 & 12. Decision Tree Learning.

103
START OF DAY 2 Reading: Chap. 6 & 12

Transcript of START OF DAY 2 Reading: Chap. 6 & 12. Decision Tree Learning.

Page 1: START OF DAY 2 Reading: Chap. 6 & 12. Decision Tree Learning.

START OF DAY 2Reading: Chap. 6 & 12

Page 2: START OF DAY 2 Reading: Chap. 6 & 12. Decision Tree Learning.

Decision Tree Learning

Page 3: START OF DAY 2 Reading: Chap. 6 & 12. Decision Tree Learning.

Decision Tree

• Internal nodes tests on some property• Branches from internal nodes values of the

associated property• Leaf nodes classifications• An individual is classified by traversing the

tree from its root to a leaf

Page 4: START OF DAY 2 Reading: Chap. 6 & 12. Decision Tree Learning.

Sample Decision Tree

Is Your Health at Risk?

Page 5: START OF DAY 2 Reading: Chap. 6 & 12. Decision Tree Learning.

Decision Tree Learning

• Learning consists of constructing a decision tree that allows the classification of objects– A test on attribute A partitions a set of instances

into {C1, C2, ..., C|A|} – Start with training set and find a good A for root– Continue recursively until subsets are

unambiguously classified, or you run out of attributes, or some stopping criterion is met

Page 6: START OF DAY 2 Reading: Chap. 6 & 12. Decision Tree Learning.

ID3

• Function Induce-Tree(Example-set, Properties)– If all elements in Example-set are in the same class, then return a leaf

node labeled with that class– Else if Properties is empty, then return a leaf node labeled with the

majority class in Example-set– Else

• Select P from Properties (*)• Remove P from Properties• Make P the root of the current tree• For each value V of P

– Create a branch of the current tree labeled by V– Partition_V Elements of Example-set with value V for P– Induce-Tree(Partition_V, Properties)– Attach result to branch V

Page 7: START OF DAY 2 Reading: Chap. 6 & 12. Decision Tree Learning.

Illustrative Training Set

Page 8: START OF DAY 2 Reading: Chap. 6 & 12. Decision Tree Learning.

ID3 Example (I)

Page 9: START OF DAY 2 Reading: Chap. 6 & 12. Decision Tree Learning.

ID3 Example (II)

Page 10: START OF DAY 2 Reading: Chap. 6 & 12. Decision Tree Learning.

ID3 Example (III)

Page 11: START OF DAY 2 Reading: Chap. 6 & 12. Decision Tree Learning.

• Assume A1 is nominal binary feature (Gender: M/F)

• Assume A2 is nominal 3-value feature (Color: R/G/B)

Another Example

A1

A2

A2

R

G

B

A1

M F

A2

R

G

B

A1

M F

Decision surfaces are axis-aligned Hyper-Rectangles

Page 12: START OF DAY 2 Reading: Chap. 6 & 12. Decision Tree Learning.

Non-Uniqueness

• Decision trees are not unique:– Given a set of training instances T, there

generally exists a number of decision trees that are consistent with T

• The learning problem states that we should seek not only consistency but also generalization. So, …

Page 13: START OF DAY 2 Reading: Chap. 6 & 12. Decision Tree Learning.

ID3’s Question

Given a training set, which of all of the decision trees consistent with that training set has the

greatest likelihood of correctly classifying unseen instances of the population?

Page 14: START OF DAY 2 Reading: Chap. 6 & 12. Decision Tree Learning.

ID3’s (Approximate) Bias

• ID3 (and family) prefers the simplest decision tree that is consistent with the training set.

• Occam’s Razor Principle:– “It is vain to do with more what can be done with

less...Entities should not be multiplied beyond necessity.”

– i.e., always accept the simplest answer that fits the data / avoid unnecessary constraints.

Page 15: START OF DAY 2 Reading: Chap. 6 & 12. Decision Tree Learning.

ID3’s Property Selection

• Each property of an instance may be thought of as contributing a certain amount of information to its classification.– Think Twenty Questions: what are good questions? Ones

that when asked decrease the information remaining.– For example, determine shape of an object: number of

sides contributes a certain amount of information to the goal; color contributes a different amount of information.

• ID3 measures the information gained by making each property the root of the current subtree, and subsequently chooses the property that produces the greatest information gain.

Page 16: START OF DAY 2 Reading: Chap. 6 & 12. Decision Tree Learning.

Entropy (I)

• Let S be a set examples from c classes

Where pi is the proportion of examples of S belonging to class i. (Note, we define 0log0=0)

p0 1

Info

log2(|c|)

Page 17: START OF DAY 2 Reading: Chap. 6 & 12. Decision Tree Learning.

Entropy (II)

• Intuitively, the smaller the entropy, the purer the partition

• Based on Shannon’s information theory (c=2):– If p1=1 (resp. p2=1), then receiver knows example is

positive (resp. negative). No message need be sent.– If p1=p2=0.5, then receiver needs to be told the class of the

example. 1-bit message must be sent.– If 0<p1<1, then receiver needs a less than 1 bit on average

to know the class of the example.

Page 18: START OF DAY 2 Reading: Chap. 6 & 12. Decision Tree Learning.

Information Gain

• Let p be a property with n outcomes• The information gained by partitioning a set S

according to p is:

Where Si is the subset of S for which property p has its ith value

Page 19: START OF DAY 2 Reading: Chap. 6 & 12. Decision Tree Learning.

ID3’s Splitting Criterion

• The objective of ID3 at each split is to increase information gain, or equivalently, to lower entropy. It does so as much as possible– Pros: Easy to do– Cons: May lead to overfitting

Page 20: START OF DAY 2 Reading: Chap. 6 & 12. Decision Tree Learning.

Practice ExercisesOUTLOOK TEMERATUR

EHUMIDITY WIND PLAY

TENNIS

Overcast Hot High Weak Yes

Overcast Hot Normal Weak Yes

Sunny Hot High Weak No

Sunny Mild Normal Strong Yes

Rain Cool Normal Strong No

Sunny Mild High Weak No

What is the ID3 induced tree?

Page 21: START OF DAY 2 Reading: Chap. 6 & 12. Decision Tree Learning.

Overfitting

Given a hypothesis space H, a hypothesis hH is said to overfit the training data if there exists some alternative hypothesis h’H, such that h

has smaller error than h’ over the training examples, but h’ has smaller error than h over

the entire distribution of instances

Page 22: START OF DAY 2 Reading: Chap. 6 & 12. Decision Tree Learning.

Avoiding Overfitting

• Two alternatives for decision trees:– Stop growing the tree, before it begins to overfit

(e.g., when data split is not statistically significant)– Grow the tree to full (overfitting) size and post-

prune it• Either way, when do I stop? What is the

correct final tree size?

Page 23: START OF DAY 2 Reading: Chap. 6 & 12. Decision Tree Learning.

Approaches

• Use only training data and a statistical test to estimate whether expanding/pruning is likely to produce an improvement beyond the training set

• Use MDL to minimize size(tree) + size(misclassifications(tree))

• Use a separate validation set to evaluate utility of pruning

• Use richer node conditions and accuracy

Page 24: START OF DAY 2 Reading: Chap. 6 & 12. Decision Tree Learning.

Reduced Error Pruning

• Split dataset into training and validation sets• Induce a full tree from the training set• While the accuracy on the validation set increases

– Evaluate the impact of pruning each subtree, replacing its root by a leaf labeled with the majority class for that subtree

– Remove the subtree that most increases validation set accuracy (greedy approach)

Page 25: START OF DAY 2 Reading: Chap. 6 & 12. Decision Tree Learning.

Rule Post-pruning

• Split dataset into training and validation sets• Induce a full tree from the training set• Convert the tree into an equivalent set of rules• For each rule

– Remove any preconditions that result in increased rule accuracy on the validation set

• Sort the rules by estimated accuracy• Classify new examples using the new ordered set of

rules

Page 26: START OF DAY 2 Reading: Chap. 6 & 12. Decision Tree Learning.

Discussion

• Reduced-error pruning produces the smallest version of the most accurate subtree

• Rule post-pruning is more fine-grained and possibly the most used method

• In all cases, pruning based on a validation set is problematic when the amount of available data is limited

Page 27: START OF DAY 2 Reading: Chap. 6 & 12. Decision Tree Learning.

Accuracy vs. Entropy (I)

• ID3 uses entropy to build the tree and accuracy to prune it

• Why not use accuracy in the first place?– How?

Page 28: START OF DAY 2 Reading: Chap. 6 & 12. Decision Tree Learning.

Practice ExercisesOUTLOOK TEMERATUR

EHUMIDITY WIND PLAY

TENNIS

Overcast Hot High Weak Yes

Overcast Hot Normal Weak Yes

Sunny Hot High Weak No

Sunny Mild Normal Strong Yes

Rain Cool Normal Strong No

Sunny Mild High Weak No

What is the induced tree with accuracy split?

Page 29: START OF DAY 2 Reading: Chap. 6 & 12. Decision Tree Learning.

Accuracy vs. Entropy (II)

• How does accuracy compare with entropy?– What does the accuracy function look like?– What does it take for a split to cause accuracy to

increase?• Is there a way to make it work?

– See Programming Assignment

Page 30: START OF DAY 2 Reading: Chap. 6 & 12. Decision Tree Learning.

Discussion (I)

• In terms of learning as search, ID3 works as follows:– Search space = set of all possible decision trees– Operations = adding tests to a tree– Form of hill-climbing: ID3 adds a subtree to the current

tree and continues its search (no backtracking, local minima)

• It follows that ID3 is very efficient, but its performance depends on the criteria for selecting properties to test (and their form)

Page 31: START OF DAY 2 Reading: Chap. 6 & 12. Decision Tree Learning.

Discussion (II)

• ID3 handles only discrete attributes. Extensions to numerical attributes have been proposed, the most famous being C5.0

• Experience shows that TDIDT learners tend to produce very good results on many problems

• Trees are most attractive when end users want interpretable knowledge from their data

Page 32: START OF DAY 2 Reading: Chap. 6 & 12. Decision Tree Learning.

Data Preparation

Page 33: START OF DAY 2 Reading: Chap. 6 & 12. Decision Tree Learning.

Accuracy Depends on the Data• What data is available for the task?

• Is this data relevant? • Is additional relevant data available?• Who are the data experts?

• How much data is available for the task?– How many instances?– How many attributes?– How many targets?

• What is the quality of the data?– Noise– Missing values– Skew

Page 34: START OF DAY 2 Reading: Chap. 6 & 12. Decision Tree Learning.

Data Types• Continuous• Categorical/Symbolic

Nominal – No natural orderingOrdinalSpecial cases: Time, Date, Addresses, Names,

IDs, etc.

Page 35: START OF DAY 2 Reading: Chap. 6 & 12. Decision Tree Learning.

Type Conversion

• Some tools can deal with nominal values internally, other methods (neural nets, regression, nearest neighbors) require/fare better with numeric inputs

• Some methods require discrete values (most versions of Naïve Bayes)

• Different encodings likely to produce different results

• Only show some here

Page 36: START OF DAY 2 Reading: Chap. 6 & 12. Decision Tree Learning.

Conversion: Binary to Numeric

• Allows binary attribute to be coded as a number

• Example: attribute “gender”– Original data: gender = {M, F} – Transformed data: genderN = {0, 1}

Page 37: START OF DAY 2 Reading: Chap. 6 & 12. Decision Tree Learning.

Conversion: Ordinal to Boolean

• Allows ordinal attribute with n values to be coded using n–1 boolean attributes

• Example: attribute “temperature”

Temperature

Cold

Medium

Hot

Temperature > cold Temperature > medium

False False

True False

True True

Original data Transformed data

Page 38: START OF DAY 2 Reading: Chap. 6 & 12. Decision Tree Learning.

Conversion: Ordinal to Numeric

• Allows ordinal attribute to be coded as a number, preserving natural order

• Example: attribute “grade”– Original data: grade = {A, A-, B+, …}– Transformed data: GPA = {4.0, 3.7, 3.3, …

• Why preserve natural order?– To allow comparisons, e.g., grade > 3.5

Page 39: START OF DAY 2 Reading: Chap. 6 & 12. Decision Tree Learning.

Discretization: Equal-Width

• May produce clumping if data is skewed

[64,67) [67,70) [70,73) [73,76) [76,79) [79,82) [82,85]

Temperature values: 64 65 68 69 70 71 72 72 75 75 80 81 83 85

2 2

Count

42 2 20

Page 40: START OF DAY 2 Reading: Chap. 6 & 12. Decision Tree Learning.

Discretization: Equal-Height

• Gives more intuitive breakpoints– don’t split frequent values across bins– create separate bins for special values (e.g., 0)

[64 .. .. .. .. 69] [70 .. 72] [73 .. .. .. .. .. .. .. .. 81] [83 .. 85]

Temperature values: 64 65 68 69 70 71 72 72 75 75 80 81 83 85

4

Count

4 4

2

Page 41: START OF DAY 2 Reading: Chap. 6 & 12. Decision Tree Learning.

Discretization: Class-dependent

• Eibe – min of 3 values per bucket64 85

64 65 68 69 70 71 72 72 75 75 80 81 83 85Yes No Yes Yes Yes No No No Yes Yes No Yes Yes No

Page 42: START OF DAY 2 Reading: Chap. 6 & 12. Decision Tree Learning.

Other Transformations

• Standardization– Transform values into the number of standard deviations

from the mean– New value = (current value - average) / standard deviation

• Normalization– All values are made to fall within a certain range– Typically: new value = (current value - min value) / range

• Neither one affects ordering!

Page 43: START OF DAY 2 Reading: Chap. 6 & 12. Decision Tree Learning.

Missing Values (I)

• Types: unknown, unrecorded, irrelevant• malfunctioning equipment• changes in experimental design• collation of different datasets• measurement not possible

Name Age Sex Pregnant

Mary 25 F N

Jane 27 F -

Joe 30 M -

Anna 2 F -

In medical data, value for Pregnant attribute for Jane is missing, while for Joe or Anna should be considered Not applicable

Page 44: START OF DAY 2 Reading: Chap. 6 & 12. Decision Tree Learning.

Missing Values (II)

• Handling methods:– Remove records with missing values– Treat as separate value– Treat as don’t know– Treat as don’t care– Use imputation techniques

• Mode, Median, Average• Regression

– Danger: BIAS

Page 45: START OF DAY 2 Reading: Chap. 6 & 12. Decision Tree Learning.

Outliers and Errors

• Outliers are values thought to be out of range• Approaches:

– Do nothing– Enforce upper and lower bounds– Let binning handle the problem

Page 46: START OF DAY 2 Reading: Chap. 6 & 12. Decision Tree Learning.

Useless Attributes

• Attributes with no or little variability– Rule of thumb: remove a field where almost all

values are the same (e.g., null), except possibly in minp% or less of all records

• Attributes with maximum variability– E.g., key fields– Rule of thumb: remove a field where almost all

values are different for each instance

Page 47: START OF DAY 2 Reading: Chap. 6 & 12. Decision Tree Learning.

Dangerous Attributes

• Highly correlated with another feature – In this case the attribute may be redundant and

only one is needed• Highly correlated with the target

– Check this case as the attribute may just be a synonym with the target (data leak) and will thus lead to overfitting (e.g., the output target was bundled with another product so they always occur together)

Page 48: START OF DAY 2 Reading: Chap. 6 & 12. Decision Tree Learning.

Class Skew• If occurrence of certain output classes are rare

– Machine learner might just learn to predict the majority class• Approaches to deal with Skew

– Undersampling: if 100,000 instances and only 1,000 of the minority class, keep all 1,000 of the minority class, and sample majority class to reach your desired distribution (50/50?) – but lose data

– Oversampling: create duplicates of every minority instance and add it to the data to reach your desired distribution – but overfit possible

– Have learning algorithm weigh the minority class higher, or class with higher misclassification cost

– Use ensemble technique (e.g., boosting)– Use Precision/Recall or ROC rather than just accuracy

Page 49: START OF DAY 2 Reading: Chap. 6 & 12. Decision Tree Learning.

Feature Creation• Transform initial data features into better ones• Transforms of individual variables

– Use area code rather than full phone number– Determine the vehicle make from a VIN (vehicle id no.)

• Combining/deriving variables– BMI– Household income– Difference of two dates

• Features based on other instances in the set– This instance is in the top quartile of price/quality tradeoff

• This approach requires creativity and some doamin knowledge, but it can be very effective in improving accuracy

Page 50: START OF DAY 2 Reading: Chap. 6 & 12. Decision Tree Learning.

Dimensionality Reduction

• Two typical solutions:– Feature selection

• Considers only a subset of available features• Requires some selection function

– Feature extraction/transformation• Creates new features from existing ones• Requires some combination function

Page 51: START OF DAY 2 Reading: Chap. 6 & 12. Decision Tree Learning.

Feature Selection

• Goal: Find “best” subset of features• Two approaches

– Wrapper-based• Uses learning algorithm• Accuracy used as “goodness” criterion

– Filter-based• Is independent of the learning algorithm• Merit heuristic used as “goodness” criterion

• Problem: can’t try all subsets!

Page 52: START OF DAY 2 Reading: Chap. 6 & 12. Decision Tree Learning.

1-Field Accuracy Feature Selection

• Select top N fields using 1-field predictive accuracy (e.g., using Decision Stump)

• What is a good N? – Rule of thumb: keep top 50 fields

• Ignores interactions among features

Page 53: START OF DAY 2 Reading: Chap. 6 & 12. Decision Tree Learning.

Wrapper-based Feature Selection

• Split dataset into training and test sets• Using training set only:

• BestF = {} and MaxAcc = 0• While accuracy improves or stopping condition not met

– Fsub = subset of features [often best-first search]– Project training set onto Fsub– CurAcc = cross-validation estimate of accuracy of learner on

transformed training set– If CurAcc > MaxAcc then BestF = Fsub

• Project both training and test sets onto BestF

Page 54: START OF DAY 2 Reading: Chap. 6 & 12. Decision Tree Learning.

Filter-based Feature Selection

• Split dataset into training and test sets• Using training set only:

• BestF = {} and MaxMerit = 0• While Merit improves or stopping condition not met

– Fsub = subset of features– CurMerit = heuristic value of goodness of Fsub– If CurMerit > MaxMerit then BestF = Fsub

• Project both training and test sets onto BestF

Page 55: START OF DAY 2 Reading: Chap. 6 & 12. Decision Tree Learning.

Feature Extraction

• Goal: Create a smaller set of new features by combining existing ones

• Better to have a fair modeling method and good variables, than to have the best modeling method and poor variables

• Look at one method here

Page 56: START OF DAY 2 Reading: Chap. 6 & 12. Decision Tree Learning.

Variance

• A measure of the spread of the data in a data set

• Variance is claimed to be the original statistical measure of spread of data.

s2 Xi X

2

i1

n

n 1

Page 57: START OF DAY 2 Reading: Chap. 6 & 12. Decision Tree Learning.

Covariance

• Variance – measure of the deviation from the mean for points in one dimension, e.g., heights

• Covariance – a measure of how much each of the dimensions varies from the mean with respect to each other.

• Covariance is measured between 2 dimensions to see if there is a relationship between the 2 dimensions, e.g., number of hours studied & grade obtained.

• The covariance between one dimension and itself is the variance

Page 58: START OF DAY 2 Reading: Chap. 6 & 12. Decision Tree Learning.

Covariance

• So, if you had a 3-dimensional data set (x,y,z), then you could measure the covariance between the x and y dimensions, the y and z dimensions, and the x and z dimensions.

var(X) Xi X

i1

n

Xi X

n 1

cov( X,Y) X i X Yi Y

i1

n

n 1

Page 59: START OF DAY 2 Reading: Chap. 6 & 12. Decision Tree Learning.

Covariance

• What is the interpretation of covariance calculations?

• Say you have a 2-dimensional data set– X: number of hours studied for a subject – Y: marks obtained in that subject

• And assume the covariance value (between X and Y) is: 104.53

• What does this value mean?

Page 60: START OF DAY 2 Reading: Chap. 6 & 12. Decision Tree Learning.

Covariance• Exact value is not as important as its sign.

• A positive value of covariance indicates that both dimensions increase or decrease together, e.g., as the number of hours studied increases, the grades in that subject also increase.

• A negative value indicates while one increases the other decreases, or vice-versa, e.g., active social life at BYU vs. performance in CS Dept.

• If covariance is zero: the two dimensions are independent of each other, e.g., heights of students vs. grades obtained in a subject.

Page 61: START OF DAY 2 Reading: Chap. 6 & 12. Decision Tree Learning.

Covariance

• Why bother with calculating (expensive) covariance when we could just plot the 2 values to see their relationship?

Covariance calculations are used to find relationships between dimensions in high dimensional data sets (usually greater than 3) where visualization is difficult.

Page 62: START OF DAY 2 Reading: Chap. 6 & 12. Decision Tree Learning.

Covariance Matrix

• Representing covariance among dimensions as a matrix, e.g., for 3 dimensions:

• Properties:– Diagonal: variances of the variables– cov(X,Y)=cov(Y,X), hence matrix is symmetrical

about the diagonal (upper triangular)– n-dimensional data will result in nxn covariance

matrix

C cov(X,X) cov(X,Y ) cov(X,Z)

cov(Y , X) cov(Y ,Y) cov(Y ,Z)

cov(Z, X) cov(Z ,Y) cov(Z ,Z)

Page 63: START OF DAY 2 Reading: Chap. 6 & 12. Decision Tree Learning.

Transformation Matrices• Consider the following:

• The square (transformation) matrix scales (3,2)• Now assume we take a multiple of (3,2)

2 3

2 1

3

2

12

8

4

3

2

23

2

6

4

2 3

2 1

6

4

24

16

4

6

4

Page 64: START OF DAY 2 Reading: Chap. 6 & 12. Decision Tree Learning.

Transformation Matrices

• The result is still scaled by 4.WHY?A vector consists of both length and direction. Scaling a vector only changes its length and not its direction. This is an important observation in the transformation of matrices leading to formation of eigenvectors and eigenvalues. Irrespective of how much we scale (3,2) by, the solution (under the given transformation matrix) is always a multiple of 4.

Page 65: START OF DAY 2 Reading: Chap. 6 & 12. Decision Tree Learning.

Eigenvalue Problem

• The eigenvalue problem is any problem having the following form:

A . v = λ . vA: n x n matrixv: n x 1 non-zero vectorλ: scalar

• Any value of λ for which this equation has a solution is called the eigenvalue of A and the vector v which corresponds to this value is called the eigenvector of A.

Page 66: START OF DAY 2 Reading: Chap. 6 & 12. Decision Tree Learning.

Eigenvalue Problem• Going back to our example:

A . v = λ . v

• Therefore, (3,2) is an eigenvector of the square matrix A and 4 is an eigenvalue of A

• The question is:Given matrix A, how can we calculate the eigenvector and eigenvalues for A?

2 3

2 1

3

2

12

8

4

3

2

Page 67: START OF DAY 2 Reading: Chap. 6 & 12. Decision Tree Learning.

Calculating Eigenvectors & Eigenvalues

• Simple matrix algebra shows that:A . v = λ . v

A . v - λ . I . v = 0 (A - λ . I ). v = 0

• Finding the roots of |A - λ . I| will give the eigenvalues and for each of these eigenvalues there will be an eigenvectorExample …

Page 68: START OF DAY 2 Reading: Chap. 6 & 12. Decision Tree Learning.

Calculating Eigenvectors & Eigenvalues

• Let

• Then:

• And setting the determinant to 0, we obtain 2 eigenvalues:

λ1 = -1 and λ2 = -2

A .I 0 1

2 3

1 0

0 1

0 1

2 3

0

0

1

2 3

3 21 2 3 2

A 0 1

2 3

Page 69: START OF DAY 2 Reading: Chap. 6 & 12. Decision Tree Learning.

Calculating Eigenvectors & Eigenvalues

• For λ1 the eigenvector is:

• Therefore the first eigenvector is any column vector in which the two elements have equal magnitude and opposite sign.

A 1.I .v1 0

1 1

2 2

.

v1:1

v1:2

0

v1:1 v1:20 and 2v1:1 2v1:2 0

v1:1 v1:2

Page 70: START OF DAY 2 Reading: Chap. 6 & 12. Decision Tree Learning.

Calculating Eigenvectors & Eigenvalues

• Therefore eigenvector v1 is

where k1 is some constant.

• Similarly we find that eigenvector v2

where k2 is some constant.

v1 k1

1

1

v2 k2

1

2

Page 71: START OF DAY 2 Reading: Chap. 6 & 12. Decision Tree Learning.

Properties of Eigenvectors and Eigenvalues

• Eigenvectors can only be found for square matrices and not every square matrix has eigenvectors.

• Given an n x n matrix (with eigenvectors), we can find n eigenvectors.

• All eigenvectors of a symmetric* matrix are perpendicular to each other, no matter how many dimensions we have.

• In practice eigenvectors are normalized to have unit length.

*Note: covariance matrices are symmetric!

Page 72: START OF DAY 2 Reading: Chap. 6 & 12. Decision Tree Learning.

PCA

• Principal components analysis (PCA) is a linear transformation that chooses a new coordinate system for the data set such that – The greatest variance by any projection of the

data set comes to lie on the first axis (then called the first principal component)

– The second greatest variance on the second axis– Etc.

• PCA can be used for reducing dimensionality by eliminating the later principal components

Page 73: START OF DAY 2 Reading: Chap. 6 & 12. Decision Tree Learning.

PCA

• By finding the eigenvalues and eigenvectors of the covariance matrix, we find that the eigenvectors with the largest eigenvalues correspond to the dimensions that have the strongest correlation in the dataset

• These are the principal components

Page 74: START OF DAY 2 Reading: Chap. 6 & 12. Decision Tree Learning.

PCA Process – STEP 1

• Subtract the mean from each of the dimensions• This produces a data set whose mean is zero.• Subtracting the mean makes variance and covariance

calculation easier by simplifying their equations.• The variance and co-variance values are not affected

by the mean value.

Page 75: START OF DAY 2 Reading: Chap. 6 & 12. Decision Tree Learning.

PCA Process – STEP 1

http://kybele.psych.cornell.edu/~edelman/Psych-465-Spring-2003/PCA-tutorial.pdf

X Y

2.5 2.4

0.5 0.7

2.2 2.9

1.9 2.2

3.1 3.0

2.3 2.7

2.0 1.6

1.0 1.1

1.5 1.6

1.2 0.9

X 1.81

Y 1.91

X Y

0.69 0.49

1.31 1.21

0.39 0.99

0.09 0.29

1.29 1.09

0.49 0.79

0.19 0.31

0.81 0.81

0.31 0.31

0.71 1.01

Page 76: START OF DAY 2 Reading: Chap. 6 & 12. Decision Tree Learning.

PCA Process – STEP 2

• Calculate the covariance matrix

• Since the non-diagonal elements in this covariance matrix are positive, we should expect that both the X and Y variables increase together.

• Since it is symmetric, we expect the eigenvectors to be orthogonal.

cov 0.616555556 0.615444444

0.615444444 0.716555556

Page 77: START OF DAY 2 Reading: Chap. 6 & 12. Decision Tree Learning.

PCA Process – STEP 3

• Calculate the eigenvectors and eigenvalues of the covariance matrix

eigenvalues0.490833989

1.28402771

eigenvectors 0.735178656 0.677873399

0.677873399 0.735178656

Page 78: START OF DAY 2 Reading: Chap. 6 & 12. Decision Tree Learning.

PCA Process – STEP 3

•Eigenvectors are plotted as diagonal dotted lines on the plot. (note: they are perpendicular to each other). •One of the eigenvectors goes through the middle of the points, like drawing a line of best fit. •The second eigenvector gives us the other, less important, pattern in the data, that all the points follow the main line, but are off to the side of the main line by some amount.

Page 79: START OF DAY 2 Reading: Chap. 6 & 12. Decision Tree Learning.

PCA Process – STEP 4

• Reduce dimensionality and form feature vectorThe eigenvector with the highest eigenvalue is the principal component of the data set.

In our example, the eigenvector with the largest eigenvalue is the one that points down the middle of the data.

Once eigenvectors are found from the covariance matrix, the next step is to order them by eigenvalue, highest to lowest. This gives the components in order of significance.

Page 80: START OF DAY 2 Reading: Chap. 6 & 12. Decision Tree Learning.

PCA Process – STEP 4

Now, if you’d like, you can decide to ignore the components of lesser significance.

You do lose some information, but if the eigenvalues are small, you don’t lose much

• n dimensions in your data • calculate n eigenvectors and eigenvalues• choose only the first p eigenvectors• final data set has only p dimensions.

Page 81: START OF DAY 2 Reading: Chap. 6 & 12. Decision Tree Learning.

PCA Process – STEP 4

• When the λi’s are sorted in descending order, the proportion of variance explained by the p principal components is:

• If the dimensions are highly correlated, there will be a small number of eigenvectors with large eigenvalues and p will be much smaller than n.

• If the dimensions are not correlated, p will be as large as n and PCA does not help.

i

i1

p

i

i1

n

1 2 Kp

1 2 Kp Kn

Page 82: START OF DAY 2 Reading: Chap. 6 & 12. Decision Tree Learning.

PCA Process – STEP 4

• Feature VectorFeatureVector = (λ1 λ2 λ3 … λp)

(take the eigenvectors to keep from the ordered list of eigenvectors,

and form a matrix with these eigenvectors in the columns)

We can either form a feature vector with both of the eigenvectors:

or, we can choose to leave out the smaller, less significant component and only have a single column:

0.677873399 0.735178656

0.735178656 0.677873399

0.677873399

0.735178656

Page 83: START OF DAY 2 Reading: Chap. 6 & 12. Decision Tree Learning.

PCA Process – STEP 5

• Derive the new dataFinalData = RowFeatureVector x RowZeroMeanData

RowFeatureVector is the matrix with the eigenvectors in the columns transposed so that the eigenvectors are now in the rows, with the most significant eigenvector at the topRowZeroMeanData is the mean-adjusted data transposed, i.e., the data items are in each column, with each row holding a separate dimension

Page 84: START OF DAY 2 Reading: Chap. 6 & 12. Decision Tree Learning.

PCA Process – STEP 5

• FinalData is the final data set, with data items in columns, and dimensions along rows.

• What does this give us? The original data solely in terms of the vectors we chose.

• We have changed our data from being in terms of the axes X and Y, to now be in terms of our 2 eigenvectors.

Page 85: START OF DAY 2 Reading: Chap. 6 & 12. Decision Tree Learning.

PCA Process – STEP 5

FinalData (transpose: dimensions along columns)

newX newY

0.827870186 0.175115307

1.77758033 0.142857227

0.992197494 0.384374989

0.274210416 0.130417207

1.67580142 0.209498461

0.912949103 0.175282444

0.0991094375 0.349824698

1.14457216 0.0464172582

0.438046137 0.0177646297

1.22382956 0.162675287

Page 86: START OF DAY 2 Reading: Chap. 6 & 12. Decision Tree Learning.

PCA Process – STEP 5

Page 87: START OF DAY 2 Reading: Chap. 6 & 12. Decision Tree Learning.

Reconstruction of Original Data

• Recall that:FinalData = RowFeatureVector x RowZeroMeanData

• Then: RowZeroMeanData = RowFeatureVector-1 x FinalData

• And thus:RowOriginalData = (RowFeatureVector-1 x FinalData) +

OriginalMean

• If we use unit eigenvectors, the inverse is the same as the transpose (hence, easier).

Page 88: START OF DAY 2 Reading: Chap. 6 & 12. Decision Tree Learning.

Reconstruction of Original Data

• If we reduce the dimensionality (i.e., p<n), obviously, when reconstructing the data we lose those dimensions we chose to discard.

• In our example let us assume that we considered only a single eigenvector.

• The final data is newX only and the reconstruction yields…

Page 89: START OF DAY 2 Reading: Chap. 6 & 12. Decision Tree Learning.

Reconstruction of original Data

•The variation along the principal component is preserved.

•The variation along the other component has been lost.

Page 90: START OF DAY 2 Reading: Chap. 6 & 12. Decision Tree Learning.

Bias in Data

• Selection/sampling bias– E.g., collect data from BYU students on college drinking

• Sponsor’s bias– E.g., PLoS Medicine article: 111 studies of soft drinks, juice, and milk that cited

funding sources (22% all industry, 47% no industry, 32% mixed). The proportion with unfavorable [to industry] conclusions was 0% for all industry funding versus 37% for no industry funding

• Publication bias– E.g., Positive results more likely to be published

• Data manipulation bias– E.g., Imputation (replacing missing values by mean in skewed data)

– E.g., Record selection (removing records with missing values)

Page 91: START OF DAY 2 Reading: Chap. 6 & 12. Decision Tree Learning.

Impact on Learning

• If there is bias in the data collection or handling processes– You are likely to learn the bias– Conclusions become useless/tainted

• If there is no bias– What you learn will be “valid”

Note: Recall that, unlike data, learning should be biased

Page 92: START OF DAY 2 Reading: Chap. 6 & 12. Decision Tree Learning.

Take Home Message

• Be thorough• Ensure you have sufficient, relevant data

before you go further• Consider potential data transformation• Uncover existing data biases and do your best

to remove them (do not add new sources of data bias, maliciously or inadvertently)

Page 93: START OF DAY 2 Reading: Chap. 6 & 12. Decision Tree Learning.

Twyman’s Law

Page 94: START OF DAY 2 Reading: Chap. 6 & 12. Decision Tree Learning.

Cool Findings

• 5% of our customers were born in the same day (including year)

• There is a sales decline on April 2nd, 2006 on all US e-commerce sites

• Customers willing to receive emails are also heavy spenders

Page 95: START OF DAY 2 Reading: Chap. 6 & 12. Decision Tree Learning.

What Is Happening?

• 11/11/11 is the easiest way to satisfy the mandatory birth date field!

• Due to daylight saving starting, the hour from 1AM to 2AM does not exist and hence nothing will be sold during that period!

• The default value at registration time is “Accept Emails”!

Page 96: START OF DAY 2 Reading: Chap. 6 & 12. Decision Tree Learning.

Take Home Message

• Cautious optimism• Twyman’s Law: Any statistic that appears

interesting is almost certainly a mistake• Many “amazing” discoveries are the result of

some (not always readily apparent) business process

• Validate all discoveries in different ways

Page 97: START OF DAY 2 Reading: Chap. 6 & 12. Decision Tree Learning.

Simpson’s Paradox

Page 98: START OF DAY 2 Reading: Chap. 6 & 12. Decision Tree Learning.

“Weird”Findings• Kidney stone treatment: overall treatment B is better; when split by

stone size (large/small), treatment A is better• Gender bias at UC Berkeley: overall, a higher percentage of males

than females are accepted; when split by departments, the situation is reversed

• Purchase channel: overall, multi-channel customers spend more than single-channel customers; when split by number of purchases per customer, the opposite is true

• Email campaign performance: overall, revenue per email is decreasing; when split by subscriber type (engaged/others), productivity per email campaign is increasing

• Presidential election: overall, candidate X’s tally of individual votes is highest; when split by states, candidate Y wins the election

Page 99: START OF DAY 2 Reading: Chap. 6 & 12. Decision Tree Learning.

What Is Happening?

• Kidney stone treatment: neither treatment worked well against large stone, but treatment A was heavily tested on those

• Gender bias at UC Berkeley: departments differed in their acceptance rates and female students applied more to departments were such rates were lower

• Purchase channel: customers that visited often spent more on average and multi-channel customers visited more

• Email campaign: file mix issue, number of disinterested prospects grows faster than number of engaged customers

• Presidential election: winner-take-all favors large states

Page 100: START OF DAY 2 Reading: Chap. 6 & 12. Decision Tree Learning.

Take Home Message• These effects are due to confounding variables• Combining segments weighted average

• if it is possible that

• Lack of awareness of the phenomenon may lead to mistaken/misleading conclusions• Must be careful not to infer causality from what are only correlations• Only sure cure/gold standard (for causality inference): controlled

experiments• Careful with randomization• Not always desirable/possible (e.g., parachutes)

• Confounding variables may not be among the ones we are collecting (latent/hidden)

• Watch out for them!

Page 101: START OF DAY 2 Reading: Chap. 6 & 12. Decision Tree Learning.

Group Project

Page 102: START OF DAY 2 Reading: Chap. 6 & 12. Decision Tree Learning.

Intro to Weka/R

Page 103: START OF DAY 2 Reading: Chap. 6 & 12. Decision Tree Learning.

END OF DAY 2Homework: Data Issues