START OF DAY 2 Reading: Chap. 6 & 12. Decision Tree Learning.

START OF DAY 2Reading: Chap. 6 & 12

Decision Tree Learning

Decision Tree

• Internal nodes tests on some property• Branches from internal nodes values of the

associated property• Leaf nodes classifications• An individual is classified by traversing the

tree from its root to a leaf

Sample Decision Tree

Is Your Health at Risk?

Decision Tree Learning

• Learning consists of constructing a decision tree that allows the classification of objects– A test on attribute A partitions a set of instances

into {C1, C2, ..., C|A|} – Start with training set and find a good A for root– Continue recursively until subsets are

unambiguously classified, or you run out of attributes, or some stopping criterion is met

ID3

• Function Induce-Tree(Example-set, Properties)– If all elements in Example-set are in the same class, then return a leaf

node labeled with that class– Else if Properties is empty, then return a leaf node labeled with the

majority class in Example-set– Else

• Select P from Properties (*)• Remove P from Properties• Make P the root of the current tree• For each value V of P

– Create a branch of the current tree labeled by V– Partition_V Elements of Example-set with value V for P– Induce-Tree(Partition_V, Properties)– Attach result to branch V

Illustrative Training Set

ID3 Example (I)

ID3 Example (II)

ID3 Example (III)

• Assume A1 is nominal binary feature (Gender: M/F)

• Assume A2 is nominal 3-value feature (Color: R/G/B)

Another Example

A1

A2

A2

R

G

B

A1

M F

A2

R

G

B

A1

M F

Decision surfaces are axis-aligned Hyper-Rectangles

Non-Uniqueness

• Decision trees are not unique:– Given a set of training instances T, there

generally exists a number of decision trees that are consistent with T

• The learning problem states that we should seek not only consistency but also generalization. So, …

ID3’s Question

Given a training set, which of all of the decision trees consistent with that training set has the

greatest likelihood of correctly classifying unseen instances of the population?

ID3’s (Approximate) Bias

• ID3 (and family) prefers the simplest decision tree that is consistent with the training set.

• Occam’s Razor Principle:– “It is vain to do with more what can be done with

less...Entities should not be multiplied beyond necessity.”

– i.e., always accept the simplest answer that fits the data / avoid unnecessary constraints.

ID3’s Property Selection

• Each property of an instance may be thought of as contributing a certain amount of information to its classification.– Think Twenty Questions: what are good questions? Ones

that when asked decrease the information remaining.– For example, determine shape of an object: number of

sides contributes a certain amount of information to the goal; color contributes a different amount of information.

• ID3 measures the information gained by making each property the root of the current subtree, and subsequently chooses the property that produces the greatest information gain.

Entropy (I)

• Let S be a set examples from c classes

Where pi is the proportion of examples of S belonging to class i. (Note, we define 0log0=0)

p0 1

Info

log2(|c|)

Entropy (II)

• Intuitively, the smaller the entropy, the purer the partition

• Based on Shannon’s information theory (c=2):– If p1=1 (resp. p2=1), then receiver knows example is

positive (resp. negative). No message need be sent.– If p1=p2=0.5, then receiver needs to be told the class of the

example. 1-bit message must be sent.– If 0<p1<1, then receiver needs a less than 1 bit on average

to know the class of the example.

Information Gain

• Let p be a property with n outcomes• The information gained by partitioning a set S

according to p is:

Where Si is the subset of S for which property p has its ith value

ID3’s Splitting Criterion

• The objective of ID3 at each split is to increase information gain, or equivalently, to lower entropy. It does so as much as possible– Pros: Easy to do– Cons: May lead to overfitting

Practice ExercisesOUTLOOK TEMERATUR

EHUMIDITY WIND PLAY

TENNIS

Overcast Hot High Weak Yes

Overcast Hot Normal Weak Yes

Sunny Hot High Weak No

Sunny Mild Normal Strong Yes

Rain Cool Normal Strong No

Sunny Mild High Weak No

What is the ID3 induced tree?

Overfitting

Given a hypothesis space H, a hypothesis hH is said to overfit the training data if there exists some alternative hypothesis h’H, such that h

has smaller error than h’ over the training examples, but h’ has smaller error than h over

the entire distribution of instances

Avoiding Overfitting

• Two alternatives for decision trees:– Stop growing the tree, before it begins to overfit

(e.g., when data split is not statistically significant)– Grow the tree to full (overfitting) size and post-

prune it• Either way, when do I stop? What is the

correct final tree size?

Approaches

• Use only training data and a statistical test to estimate whether expanding/pruning is likely to produce an improvement beyond the training set

• Use MDL to minimize size(tree) + size(misclassifications(tree))

• Use a separate validation set to evaluate utility of pruning

• Use richer node conditions and accuracy

Reduced Error Pruning

• Split dataset into training and validation sets• Induce a full tree from the training set• While the accuracy on the validation set increases

– Evaluate the impact of pruning each subtree, replacing its root by a leaf labeled with the majority class for that subtree

– Remove the subtree that most increases validation set accuracy (greedy approach)

Rule Post-pruning

• Split dataset into training and validation sets• Induce a full tree from the training set• Convert the tree into an equivalent set of rules• For each rule

– Remove any preconditions that result in increased rule accuracy on the validation set

• Sort the rules by estimated accuracy• Classify new examples using the new ordered set of

rules

Discussion

• Reduced-error pruning produces the smallest version of the most accurate subtree

• Rule post-pruning is more fine-grained and possibly the most used method

• In all cases, pruning based on a validation set is problematic when the amount of available data is limited

Accuracy vs. Entropy (I)

• ID3 uses entropy to build the tree and accuracy to prune it

• Why not use accuracy in the first place?– How?

Practice ExercisesOUTLOOK TEMERATUR

EHUMIDITY WIND PLAY

TENNIS

Overcast Hot High Weak Yes

Overcast Hot Normal Weak Yes

Sunny Hot High Weak No

Sunny Mild Normal Strong Yes

Rain Cool Normal Strong No

Sunny Mild High Weak No

What is the induced tree with accuracy split?

Accuracy vs. Entropy (II)

• How does accuracy compare with entropy?– What does the accuracy function look like?– What does it take for a split to cause accuracy to

increase?• Is there a way to make it work?

– See Programming Assignment

Discussion (I)

• In terms of learning as search, ID3 works as follows:– Search space = set of all possible decision trees– Operations = adding tests to a tree– Form of hill-climbing: ID3 adds a subtree to the current

tree and continues its search (no backtracking, local minima)

• It follows that ID3 is very efficient, but its performance depends on the criteria for selecting properties to test (and their form)

Discussion (II)

• ID3 handles only discrete attributes. Extensions to numerical attributes have been proposed, the most famous being C5.0

• Experience shows that TDIDT learners tend to produce very good results on many problems

• Trees are most attractive when end users want interpretable knowledge from their data

Data Preparation

Accuracy Depends on the Data• What data is available for the task?

• Is this data relevant? • Is additional relevant data available?• Who are the data experts?

• How much data is available for the task?– How many instances?– How many attributes?– How many targets?

• What is the quality of the data?– Noise– Missing values– Skew

Data Types• Continuous• Categorical/Symbolic

Nominal – No natural orderingOrdinalSpecial cases: Time, Date, Addresses, Names,

IDs, etc.

Type Conversion

• Some tools can deal with nominal values internally, other methods (neural nets, regression, nearest neighbors) require/fare better with numeric inputs

• Some methods require discrete values (most versions of Naïve Bayes)

• Different encodings likely to produce different results

• Only show some here

Conversion: Binary to Numeric

• Allows binary attribute to be coded as a number

• Example: attribute “gender”– Original data: gender = {M, F} – Transformed data: genderN = {0, 1}

Conversion: Ordinal to Boolean

• Allows ordinal attribute with n values to be coded using n–1 boolean attributes

• Example: attribute “temperature”

Temperature

Cold

Medium

Hot

Temperature > cold Temperature > medium

False False

True False

True True

Original data Transformed data

Conversion: Ordinal to Numeric

• Allows ordinal attribute to be coded as a number, preserving natural order

• Example: attribute “grade”– Original data: grade = {A, A-, B+, …}– Transformed data: GPA = {4.0, 3.7, 3.3, …

• Why preserve natural order?– To allow comparisons, e.g., grade > 3.5

Discretization: Equal-Width

• May produce clumping if data is skewed

[64,67) [67,70) [70,73) [73,76) [76,79) [79,82) [82,85]

Temperature values: 64 65 68 69 70 71 72 72 75 75 80 81 83 85

2 2

Count

42 2 20

Discretization: Equal-Height

• Gives more intuitive breakpoints– don’t split frequent values across bins– create separate bins for special values (e.g., 0)

[64 .. .. .. .. 69] [70 .. 72] [73 .. .. .. .. .. .. .. .. 81] [83 .. 85]

Temperature values: 64 65 68 69 70 71 72 72 75 75 80 81 83 85

4

Count

4 4

2

Discretization: Class-dependent

• Eibe – min of 3 values per bucket64 85

64 65 68 69 70 71 72 72 75 75 80 81 83 85Yes No Yes Yes Yes No No No Yes Yes No Yes Yes No

Other Transformations

• Standardization– Transform values into the number of standard deviations

from the mean– New value = (current value - average) / standard deviation

• Normalization– All values are made to fall within a certain range– Typically: new value = (current value - min value) / range

• Neither one affects ordering!

Missing Values (I)

• Types: unknown, unrecorded, irrelevant• malfunctioning equipment• changes in experimental design• collation of different datasets• measurement not possible

Name Age Sex Pregnant

…

Mary 25 F N

Jane 27 F -

Joe 30 M -

Anna 2 F -

In medical data, value for Pregnant attribute for Jane is missing, while for Joe or Anna should be considered Not applicable

Missing Values (II)

• Handling methods:– Remove records with missing values– Treat as separate value– Treat as don’t know– Treat as don’t care– Use imputation techniques

• Mode, Median, Average• Regression

– Danger: BIAS

Outliers and Errors

• Outliers are values thought to be out of range• Approaches:

– Do nothing– Enforce upper and lower bounds– Let binning handle the problem

Useless Attributes

• Attributes with no or little variability– Rule of thumb: remove a field where almost all

values are the same (e.g., null), except possibly in minp% or less of all records

• Attributes with maximum variability– E.g., key fields– Rule of thumb: remove a field where almost all

values are different for each instance

Dangerous Attributes

• Highly correlated with another feature – In this case the attribute may be redundant and

only one is needed• Highly correlated with the target

– Check this case as the attribute may just be a synonym with the target (data leak) and will thus lead to overfitting (e.g., the output target was bundled with another product so they always occur together)

Class Skew• If occurrence of certain output classes are rare

– Machine learner might just learn to predict the majority class• Approaches to deal with Skew

– Undersampling: if 100,000 instances and only 1,000 of the minority class, keep all 1,000 of the minority class, and sample majority class to reach your desired distribution (50/50?) – but lose data

– Oversampling: create duplicates of every minority instance and add it to the data to reach your desired distribution – but overfit possible

– Have learning algorithm weigh the minority class higher, or class with higher misclassification cost

– Use ensemble technique (e.g., boosting)– Use Precision/Recall or ROC rather than just accuracy

Feature Creation• Transform initial data features into better ones• Transforms of individual variables

– Use area code rather than full phone number– Determine the vehicle make from a VIN (vehicle id no.)

• Combining/deriving variables– BMI– Household income– Difference of two dates

• Features based on other instances in the set– This instance is in the top quartile of price/quality tradeoff

• This approach requires creativity and some doamin knowledge, but it can be very effective in improving accuracy

Dimensionality Reduction

• Two typical solutions:– Feature selection

• Considers only a subset of available features• Requires some selection function

– Feature extraction/transformation• Creates new features from existing ones• Requires some combination function

Feature Selection

• Goal: Find “best” subset of features• Two approaches

– Wrapper-based• Uses learning algorithm• Accuracy used as “goodness” criterion

– Filter-based• Is independent of the learning algorithm• Merit heuristic used as “goodness” criterion

• Problem: can’t try all subsets!

1-Field Accuracy Feature Selection

• Select top N fields using 1-field predictive accuracy (e.g., using Decision Stump)

• What is a good N? – Rule of thumb: keep top 50 fields

• Ignores interactions among features

Wrapper-based Feature Selection

• Split dataset into training and test sets• Using training set only:

• BestF = {} and MaxAcc = 0• While accuracy improves or stopping condition not met

– Fsub = subset of features [often best-first search]– Project training set onto Fsub– CurAcc = cross-validation estimate of accuracy of learner on

transformed training set– If CurAcc > MaxAcc then BestF = Fsub

• Project both training and test sets onto BestF

Filter-based Feature Selection

• Split dataset into training and test sets• Using training set only:

• BestF = {} and MaxMerit = 0• While Merit improves or stopping condition not met

– Fsub = subset of features– CurMerit = heuristic value of goodness of Fsub– If CurMerit > MaxMerit then BestF = Fsub

• Project both training and test sets onto BestF

Feature Extraction

• Goal: Create a smaller set of new features by combining existing ones

• Better to have a fair modeling method and good variables, than to have the best modeling method and poor variables

• Look at one method here

Variance

• A measure of the spread of the data in a data set

• Variance is claimed to be the original statistical measure of spread of data.

s2 Xi X

2

i1

n

n 1

Covariance

• Variance – measure of the deviation from the mean for points in one dimension, e.g., heights

• Covariance – a measure of how much each of the dimensions varies from the mean with respect to each other.

• Covariance is measured between 2 dimensions to see if there is a relationship between the 2 dimensions, e.g., number of hours studied & grade obtained.

• The covariance between one dimension and itself is the variance

Covariance

• So, if you had a 3-dimensional data set (x,y,z), then you could measure the covariance between the x and y dimensions, the y and z dimensions, and the x and z dimensions.

var(X) Xi X

i1

n

Xi X

n 1

cov( X,Y) X i X Yi Y

i1

n

n 1

Covariance

• What is the interpretation of covariance calculations?

• Say you have a 2-dimensional data set– X: number of hours studied for a subject – Y: marks obtained in that subject

• And assume the covariance value (between X and Y) is: 104.53

• What does this value mean?

Covariance• Exact value is not as important as its sign.

• A positive value of covariance indicates that both dimensions increase or decrease together, e.g., as the number of hours studied increases, the grades in that subject also increase.

• A negative value indicates while one increases the other decreases, or vice-versa, e.g., active social life at BYU vs. performance in CS Dept.

• If covariance is zero: the two dimensions are independent of each other, e.g., heights of students vs. grades obtained in a subject.

Covariance

• Why bother with calculating (expensive) covariance when we could just plot the 2 values to see their relationship?

Covariance calculations are used to find relationships between dimensions in high dimensional data sets (usually greater than 3) where visualization is difficult.

Covariance Matrix

• Representing covariance among dimensions as a matrix, e.g., for 3 dimensions:

• Properties:– Diagonal: variances of the variables– cov(X,Y)=cov(Y,X), hence matrix is symmetrical

about the diagonal (upper triangular)– n-dimensional data will result in nxn covariance

matrix

C cov(X,X) cov(X,Y ) cov(X,Z)

cov(Y , X) cov(Y ,Y) cov(Y ,Z)

cov(Z, X) cov(Z ,Y) cov(Z ,Z)

Transformation Matrices• Consider the following:

• The square (transformation) matrix scales (3,2)• Now assume we take a multiple of (3,2)

2 3

2 1

3

2

12

8

4

3

2

23

2

6

4

2 3

2 1

6

4

24

16

4

6

4

Transformation Matrices

• The result is still scaled by 4.WHY?A vector consists of both length and direction. Scaling a vector only changes its length and not its direction. This is an important observation in the transformation of matrices leading to formation of eigenvectors and eigenvalues. Irrespective of how much we scale (3,2) by, the solution (under the given transformation matrix) is always a multiple of 4.

Eigenvalue Problem

• The eigenvalue problem is any problem having the following form:

A . v = λ . vA: n x n matrixv: n x 1 non-zero vectorλ: scalar

• Any value of λ for which this equation has a solution is called the eigenvalue of A and the vector v which corresponds to this value is called the eigenvector of A.

Eigenvalue Problem• Going back to our example:

A . v = λ . v

• Therefore, (3,2) is an eigenvector of the square matrix A and 4 is an eigenvalue of A

• The question is:Given matrix A, how can we calculate the eigenvector and eigenvalues for A?

2 3

2 1

3

2

12

8

4

3

2

Calculating Eigenvectors & Eigenvalues

• Simple matrix algebra shows that:A . v = λ . v

A . v - λ . I . v = 0 (A - λ . I ). v = 0

• Finding the roots of |A - λ . I| will give the eigenvalues and for each of these eigenvalues there will be an eigenvectorExample …


• Let

• Then:

• And setting the determinant to 0, we obtain 2 eigenvalues:

λ1 = -1 and λ2 = -2

A .I 0 1

2 3

1 0

0 1

0 1

2 3

0

0

1

2 3

3 21 2 3 2

A 0 1

2 3


• For λ1 the eigenvector is:

• Therefore the first eigenvector is any column vector in which the two elements have equal magnitude and opposite sign.

A 1.I .v1 0

1 1

2 2

.

v1:1

v1:2

0

v1:1 v1:20 and 2v1:1 2v1:2 0

v1:1 v1:2


• Therefore eigenvector v1 is

where k1 is some constant.

• Similarly we find that eigenvector v2

where k2 is some constant.

v1 k1

1

1

v2 k2

1

2

Properties of Eigenvectors and Eigenvalues

• Eigenvectors can only be found for square matrices and not every square matrix has eigenvectors.

• Given an n x n matrix (with eigenvectors), we can find n eigenvectors.

• All eigenvectors of a symmetric* matrix are perpendicular to each other, no matter how many dimensions we have.

• In practice eigenvectors are normalized to have unit length.

*Note: covariance matrices are symmetric!

PCA

• Principal components analysis (PCA) is a linear transformation that chooses a new coordinate system for the data set such that – The greatest variance by any projection of the

data set comes to lie on the first axis (then called the first principal component)

– The second greatest variance on the second axis– Etc.

• PCA can be used for reducing dimensionality by eliminating the later principal components

PCA

• By finding the eigenvalues and eigenvectors of the covariance matrix, we find that the eigenvectors with the largest eigenvalues correspond to the dimensions that have the strongest correlation in the dataset

• These are the principal components

PCA Process – STEP 1

• Subtract the mean from each of the dimensions• This produces a data set whose mean is zero.• Subtracting the mean makes variance and covariance

calculation easier by simplifying their equations.• The variance and co-variance values are not affected

by the mean value.


http://kybele.psych.cornell.edu/~edelman/Psych-465-Spring-2003/PCA-tutorial.pdf

X Y

2.5 2.4

0.5 0.7

2.2 2.9

1.9 2.2

3.1 3.0

2.3 2.7

2.0 1.6

1.0 1.1

1.5 1.6

1.2 0.9

X 1.81

Y 1.91

X Y

0.69 0.49

1.31 1.21

0.39 0.99

0.09 0.29

1.29 1.09

0.49 0.79

0.19 0.31

0.81 0.81

0.31 0.31

0.71 1.01


• Calculate the covariance matrix

• Since the non-diagonal elements in this covariance matrix are positive, we should expect that both the X and Y variables increase together.

• Since it is symmetric, we expect the eigenvectors to be orthogonal.

cov 0.616555556 0.615444444

0.615444444 0.716555556


• Calculate the eigenvectors and eigenvalues of the covariance matrix

eigenvalues0.490833989

1.28402771

eigenvectors 0.735178656 0.677873399

0.677873399 0.735178656


•Eigenvectors are plotted as diagonal dotted lines on the plot. (note: they are perpendicular to each other). •One of the eigenvectors goes through the middle of the points, like drawing a line of best fit. •The second eigenvector gives us the other, less important, pattern in the data, that all the points follow the main line, but are off to the side of the main line by some amount.


• Reduce dimensionality and form feature vectorThe eigenvector with the highest eigenvalue is the principal component of the data set.

In our example, the eigenvector with the largest eigenvalue is the one that points down the middle of the data.

Once eigenvectors are found from the covariance matrix, the next step is to order them by eigenvalue, highest to lowest. This gives the components in order of significance.


Now, if you’d like, you can decide to ignore the components of lesser significance.

You do lose some information, but if the eigenvalues are small, you don’t lose much

• n dimensions in your data • calculate n eigenvectors and eigenvalues• choose only the first p eigenvectors• final data set has only p dimensions.


• When the λi’s are sorted in descending order, the proportion of variance explained by the p principal components is:

• If the dimensions are highly correlated, there will be a small number of eigenvectors with large eigenvalues and p will be much smaller than n.

• If the dimensions are not correlated, p will be as large as n and PCA does not help.

i

i1

p

i

i1

n

1 2 Kp

1 2 Kp Kn


• Feature VectorFeatureVector = (λ1 λ2 λ3 … λp)

(take the eigenvectors to keep from the ordered list of eigenvectors,

and form a matrix with these eigenvectors in the columns)

We can either form a feature vector with both of the eigenvectors:

or, we can choose to leave out the smaller, less significant component and only have a single column:

0.677873399 0.735178656

0.735178656 0.677873399

0.677873399

0.735178656


• Derive the new dataFinalData = RowFeatureVector x RowZeroMeanData

RowFeatureVector is the matrix with the eigenvectors in the columns transposed so that the eigenvectors are now in the rows, with the most significant eigenvector at the topRowZeroMeanData is the mean-adjusted data transposed, i.e., the data items are in each column, with each row holding a separate dimension


• FinalData is the final data set, with data items in columns, and dimensions along rows.

• What does this give us? The original data solely in terms of the vectors we chose.

• We have changed our data from being in terms of the axes X and Y, to now be in terms of our 2 eigenvectors.


FinalData (transpose: dimensions along columns)

newX newY

0.827870186 0.175115307

1.77758033 0.142857227

0.992197494 0.384374989

0.274210416 0.130417207

1.67580142 0.209498461

0.912949103 0.175282444

0.0991094375 0.349824698

1.14457216 0.0464172582

0.438046137 0.0177646297

1.22382956 0.162675287

Reconstruction of Original Data

• Recall that:FinalData = RowFeatureVector x RowZeroMeanData

• Then: RowZeroMeanData = RowFeatureVector-1 x FinalData

• And thus:RowOriginalData = (RowFeatureVector-1 x FinalData) +

OriginalMean

• If we use unit eigenvectors, the inverse is the same as the transpose (hence, easier).

Reconstruction of Original Data

• If we reduce the dimensionality (i.e., p<n), obviously, when reconstructing the data we lose those dimensions we chose to discard.

• In our example let us assume that we considered only a single eigenvector.

• The final data is newX only and the reconstruction yields…

Reconstruction of original Data

•The variation along the principal component is preserved.

•The variation along the other component has been lost.

Bias in Data

• Selection/sampling bias– E.g., collect data from BYU students on college drinking

• Sponsor’s bias– E.g., PLoS Medicine article: 111 studies of soft drinks, juice, and milk that cited

funding sources (22% all industry, 47% no industry, 32% mixed). The proportion with unfavorable [to industry] conclusions was 0% for all industry funding versus 37% for no industry funding

• Publication bias– E.g., Positive results more likely to be published

• Data manipulation bias– E.g., Imputation (replacing missing values by mean in skewed data)

– E.g., Record selection (removing records with missing values)

Impact on Learning

• If there is bias in the data collection or handling processes– You are likely to learn the bias– Conclusions become useless/tainted

• If there is no bias– What you learn will be “valid”

Note: Recall that, unlike data, learning should be biased

Take Home Message

• Be thorough• Ensure you have sufficient, relevant data

before you go further• Consider potential data transformation• Uncover existing data biases and do your best

to remove them (do not add new sources of data bias, maliciously or inadvertently)

Twyman’s Law

Cool Findings

• 5% of our customers were born in the same day (including year)

• There is a sales decline on April 2nd, 2006 on all US e-commerce sites

• Customers willing to receive emails are also heavy spenders

What Is Happening?

• 11/11/11 is the easiest way to satisfy the mandatory birth date field!

• Due to daylight saving starting, the hour from 1AM to 2AM does not exist and hence nothing will be sold during that period!

• The default value at registration time is “Accept Emails”!

Take Home Message

• Cautious optimism• Twyman’s Law: Any statistic that appears

interesting is almost certainly a mistake• Many “amazing” discoveries are the result of

some (not always readily apparent) business process

• Validate all discoveries in different ways

Simpson’s Paradox

“Weird”Findings• Kidney stone treatment: overall treatment B is better; when split by

stone size (large/small), treatment A is better• Gender bias at UC Berkeley: overall, a higher percentage of males

than females are accepted; when split by departments, the situation is reversed

• Purchase channel: overall, multi-channel customers spend more than single-channel customers; when split by number of purchases per customer, the opposite is true

• Email campaign performance: overall, revenue per email is decreasing; when split by subscriber type (engaged/others), productivity per email campaign is increasing

• Presidential election: overall, candidate X’s tally of individual votes is highest; when split by states, candidate Y wins the election

What Is Happening?

• Kidney stone treatment: neither treatment worked well against large stone, but treatment A was heavily tested on those

• Gender bias at UC Berkeley: departments differed in their acceptance rates and female students applied more to departments were such rates were lower

• Purchase channel: customers that visited often spent more on average and multi-channel customers visited more

• Email campaign: file mix issue, number of disinterested prospects grows faster than number of engaged customers

• Presidential election: winner-take-all favors large states

Take Home Message• These effects are due to confounding variables• Combining segments weighted average

• if it is possible that

• Lack of awareness of the phenomenon may lead to mistaken/misleading conclusions• Must be careful not to infer causality from what are only correlations• Only sure cure/gold standard (for causality inference): controlled

experiments• Careful with randomization• Not always desirable/possible (e.g., parachutes)

• Confounding variables may not be among the ones we are collecting (latent/hidden)

• Watch out for them!

Group Project

Intro to Weka/R

END OF DAY 2Homework: Data Issues

START OF DAY 2 Reading: Chap. 6 & 12. Decision Tree Learning.

Documents

Transcript of START OF DAY 2 Reading: Chap. 6 & 12. Decision Tree Learning.