1 CSI5388: Functional Elements of Statistics for Machine Learning Part II.
-
Upload
dominick-ray -
Category
Documents
-
view
244 -
download
0
Transcript of 1 CSI5388: Functional Elements of Statistics for Machine Learning Part II.
11
CSI5388:CSI5388:Functional Elements of Functional Elements of Statistics for Machine Statistics for Machine
Learning Learning
Part IIPart II
22
Part I (This set of lecture notes):Part I (This set of lecture notes):• Definition and PreliminariesDefinition and Preliminaries• Hypothesis Testing: Parametric ApproachesHypothesis Testing: Parametric Approaches
Part II (The next set of lecture notes)Part II (The next set of lecture notes)• Hypothesis Testing: Non-Parametric Hypothesis Testing: Non-Parametric
ApproachesApproaches• Power of a TestPower of a Test• Statistical Tests for Comparing Multiple Statistical Tests for Comparing Multiple
ClassifiersClassifiers
Contents of the LectureContents of the Lecture
33
Non-parametric approaches to Non-parametric approaches to Hypothesis testingHypothesis testing
The hypothesis testing procedures discussed in the The hypothesis testing procedures discussed in the previous lecture are called parametric. previous lecture are called parametric.
This means that they are based on assumptions This means that they are based on assumptions regarding the distribution of the population for which regarding the distribution of the population for which the test was ran, and rely on the estimation of the test was ran, and rely on the estimation of parameters from these distributions. parameters from these distributions.
In our cases, we assumed that the distributions were In our cases, we assumed that the distributions were either normal or followed a Student t distribution. either normal or followed a Student t distribution. The parameters we estimated were the mean and The parameters we estimated were the mean and the variance. the variance.
The problem we now turn to is the issue of The problem we now turn to is the issue of hypothesis testing that is not based on assumptions hypothesis testing that is not based on assumptions regarding the distribution and do not rely on the regarding the distribution and do not rely on the estimation of parameters. estimation of parameters.
44
The Different Types of non-parametric The Different Types of non-parametric hypothesis testing approaches Ihypothesis testing approaches I
There are two important families of tests that do There are two important families of tests that do not involve distributional assumptions and not involve distributional assumptions and parameter estimations:parameter estimations:
• Nonparametric testsNonparametric tests, which rely on ranking , which rely on ranking the data and performing a statistical test on the data and performing a statistical test on the ranks.the ranks.
• Resampling statisticsResampling statistics which consist of which consist of drawing samples repeatedly from a population drawing samples repeatedly from a population and evaluating the distribution of the result. and evaluating the distribution of the result. Resampling Statistics will be discussed in the Resampling Statistics will be discussed in the next lecture.next lecture.
55
The nonparametric tests are quite useful The nonparametric tests are quite useful in populations for which outliers skew the in populations for which outliers skew the distribution too much. distribution too much.
Ranking eliminates the problem. Ranking eliminates the problem. However, they typically are less powerful However, they typically are less powerful
(see further) than parametric tests.(see further) than parametric tests. Resampling statistics are useful when the Resampling statistics are useful when the
statistics of interest cannot be derived statistics of interest cannot be derived analytically (e.g., statistics about the analytically (e.g., statistics about the median of a population), unless we median of a population), unless we assume a normal distribution.assume a normal distribution.
The Different Types of non-parametric The Different Types of non-parametric hypothesis testing approaches IIhypothesis testing approaches II
66
Non-Parametric TestsNon-Parametric Tests
Wilcoxon’s Rank-Sum TestWilcoxon’s Rank-Sum Test The case of independent samplesThe case of independent samples The case of matched pairsThe case of matched pairs
77
Wilcoxon’s Rank-Sum TestsWilcoxon’s Rank-Sum Tests Wilcoxon’s Rank-Sum Tests are equivalent Wilcoxon’s Rank-Sum Tests are equivalent
to the t-test, but apply when the normality to the t-test, but apply when the normality assumption of the distribution is not met. assumption of the distribution is not met.
As a result of their non-parametric nature, As a result of their non-parametric nature, however, power is lost (see further for a however, power is lost (see further for a formal discussion of power). In particular, formal discussion of power). In particular, the tests are not as specific as their the tests are not as specific as their parametric equivalent. parametric equivalent.
This means that, although we interpret the This means that, although we interpret the result of these non-parametric tests to result of these non-parametric tests to mean one thing of a central nature to the mean one thing of a central nature to the distributions under study, they could mean distributions under study, they could mean something else.something else.
88
Wilcoxon’s Rank-Sum Test Wilcoxon’s Rank-Sum Test (for two independent samples) (for two independent samples)
Informal Description IInformal Description I Given two populations with n1 observations Given two populations with n1 observations
in group 1 and n2 observations in group 2. in group 1 and n2 observations in group 2. The null hypothesis we are trying to reject is: The null hypothesis we are trying to reject is: “H0: The two samples come from identical “H0: The two samples come from identical populations (not just populations with the populations (not just populations with the same mean)”. same mean)”.
We consider two cases:We consider two cases:• Case 1: Case 1: The null hypothesis is false (to a The null hypothesis is false (to a
substantial degree) and the scores from substantial degree) and the scores from population 1 are generally lower than those of population 1 are generally lower than those of population 2. population 2.
• Case 2: Case 2: The null hypothesis is true. This means The null hypothesis is true. This means that the two samples came from the same that the two samples came from the same population. population.
99
Wilcoxon’s Rank-Sum Test Wilcoxon’s Rank-Sum Test (for two independent samples) (for two independent samples)
Informal Description IIInformal Description II In both cases, the procedure consists of In both cases, the procedure consists of
ranking the scores of the two populations ranking the scores of the two populations taken together. taken together. • Case 1:Case 1: In the first case, we assume that the In the first case, we assume that the
ranks from population 1 should be generally ranks from population 1 should be generally lower than those of population 2. Actually, we lower than those of population 2. Actually, we could also expect that the sum of the ranks in could also expect that the sum of the ranks in group 1 is smaller than the sum of the ranks in group 1 is smaller than the sum of the ranks in group 2.group 2.
• Case 2: In the second case, we assume that Case 2: In the second case, we assume that the sum of ranks of the first group is about the sum of ranks of the first group is about equal to the sum of ranks of the second group.equal to the sum of ranks of the second group.
1010
Wilcoxon’s Rank-Sum Test Wilcoxon’s Rank-Sum Test (for two independent samples) (for two independent samples)
n1 and n2 n1 and n2 ≤≤ 25: 25: Consider the two groups of data sets of size n1 and n2, Consider the two groups of data sets of size n1 and n2,
respectively, where n1 is the smallest sample size.respectively, where n1 is the smallest sample size. Rank their scores together from lowest to highest. Rank their scores together from lowest to highest. In case of an x-way tie just after rank y, then assign (y+1 + In case of an x-way tie just after rank y, then assign (y+1 +
y +2 + … + y+ x)/x to all the tied elements. y +2 + … + y+ x)/x to all the tied elements. Add the scores of the group containing the smallest number Add the scores of the group containing the smallest number
of samples (n1) (if both groups contain as many samples, of samples (n1) (if both groups contain as many samples, choose the smallest value). Call this sum Ws.choose the smallest value). Call this sum Ws.
Find the value V in the Wilcoxon table, for n1 and n2 and Find the value V in the Wilcoxon table, for n1 and n2 and the significance level s required, where n1 in the table the significance level s required, where n1 in the table corresponds to the smallest value, as well. corresponds to the smallest value, as well.
Compare Ws to V and conclude that the difference between Compare Ws to V and conclude that the difference between the two groups at the chosen level, L1 for a one-tailed test the two groups at the chosen level, L1 for a one-tailed test or 2*L1 for the two-tailed test is significant only if Ws < V. If or 2*L1 for the two-tailed test is significant only if Ws < V. If Ws ≥ V, the null hypothesis cannot be rejected.Ws ≥ V, the null hypothesis cannot be rejected.
1111
Wilcoxon’s Rank-Sum Test Wilcoxon’s Rank-Sum Test (for two independent samples) (for two independent samples)
n1 and n2 n1 and n2 >> 25: 25: Compute Ws as beforeCompute Ws as before Use the fact that Ws approaches a normal Use the fact that Ws approaches a normal
distribution as size increases with: distribution as size increases with: • A mean of A mean of
m= n1(n1+n2+1)/2, and m= n1(n1+n2+1)/2, and • A standard error of A standard error of
std= sqrt(n1n2(n1+n2+1)/12) std= sqrt(n1n2(n1+n2+1)/12) Compute the z statistic Compute the z statistic
z = (Ws – m)/stdz = (Ws – m)/std Use the tables of the normal distribution.Use the tables of the normal distribution.
1212
Wilcoxon’s Matched Pairs Signed Wilcoxon’s Matched Pairs Signed Ranks Test (for paired scores)Ranks Test (for paired scores)
Informal DescriptionInformal Description
Logic of the Test:Logic of the Test: Given the same population tested under Given the same population tested under
different circumstances C1 and C2. different circumstances C1 and C2. If there is improvement in C2, then most of If there is improvement in C2, then most of
the results recorded in C2 will be greater the results recorded in C2 will be greater than those recorded in C1 and those that than those recorded in C1 and those that are not greater will be smaller by only a are not greater will be smaller by only a small amount.small amount.
1313
Wilcoxon’s Matched Pairs Signed Wilcoxon’s Matched Pairs Signed Ranks Test (for paired scores)Ranks Test (for paired scores)
n n ≤≤ 50 50 We calculate the difference score for each pair of We calculate the difference score for each pair of
measurementmeasurement We rank all the difference scores without paying We rank all the difference scores without paying
attention to their signs (i.e., we rank their attention to their signs (i.e., we rank their absolute values)absolute values)
We assign the algebraic sign of the differences to We assign the algebraic sign of the differences to the ranksthe ranks
We sum the positive and negative ranks We sum the positive and negative ranks separatelyseparately
We choose as test statistic T, the smaller of the We choose as test statistic T, the smaller of the absolute values of the two sums.absolute values of the two sums.
We compare T to a Wilcoxon T tableWe compare T to a Wilcoxon T table
1414
Wilcoxon’s Matched Pairs Signed Wilcoxon’s Matched Pairs Signed Ranks Test (for paired scores)Ranks Test (for paired scores)
n n >> 50 50 Compute T as beforeCompute T as before Use the fact that T approaches a normal Use the fact that T approaches a normal
distribution as size increases with:distribution as size increases with:• A mean of A mean of
m= n(n+1)/4 and m= n(n+1)/4 and • A standard error A standard error
std= sqrt(n(n+1)(2n+1)/24) std= sqrt(n(n+1)(2n+1)/24) And compute the z statistic And compute the z statistic
z = (T – m)/stdz = (T – m)/std Use the tables of the normal distribution.Use the tables of the normal distribution.
1515
Power AnalysisPower Analysis
1616
Type I and Type II ErrorsType I and Type II Errors
Definition:Definition: A A Type I error (α)Type I error (α) corresponds to the error of rejecting corresponds to the error of rejecting H0, the null hypothesis, when it is, in H0, the null hypothesis, when it is, in fact, true. A fact, true. A Type II errorType II error (β)(β) corresponds to the error of failing to corresponds to the error of failing to reject H0 when it is false.reject H0 when it is false.
Definition:Definition: The power of a test is The power of a test is the probability of rejecting H0 given the probability of rejecting H0 given that it is false. Power = 1- βthat it is false. Power = 1- β
1717
Why does Power Matter? IWhy does Power Matter? I
All the hypothesis tests described in the All the hypothesis tests described in the previous three sections are only concerned previous three sections are only concerned about reducing the Type I error. about reducing the Type I error.
i.e., they try to ascertain the conditions i.e., they try to ascertain the conditions under which we are rejecting a hypothesis under which we are rejecting a hypothesis rightly. rightly.
They are not at all concerned about the They are not at all concerned about the case where the null hypothesis is really case where the null hypothesis is really false, but we do not reject it.false, but we do not reject it.
1818
Why does Power Matter? IIWhy does Power Matter? II In the case of Machine Learning, reducing the In the case of Machine Learning, reducing the
type I error means reducing the probability of us type I error means reducing the probability of us saying that there is a difference in the saying that there is a difference in the performance of the 2 classifiers, when in fact, performance of the 2 classifiers, when in fact, there isn’t.there isn’t.
Reducing the type II error means reducing the Reducing the type II error means reducing the probability of us saying that there is no difference probability of us saying that there is no difference in the performance of the two classifiers, when, in in the performance of the two classifiers, when, in fact, there is.fact, there is.
Power matters because we do not want to discard Power matters because we do not want to discard a classifier that shouldn’t have been discarded. If a classifier that shouldn’t have been discarded. If a test does not have enough power, then this a test does not have enough power, then this kind of situation can arisekind of situation can arise
1919
What is the Effect Size?What is the Effect Size? The effect size measures how strong the The effect size measures how strong the
relationship between two entities is. relationship between two entities is. In particular, if we consider a particular procedure, In particular, if we consider a particular procedure,
in addition to knowing how statistically significant in addition to knowing how statistically significant the effect of that procedure is, we may want to the effect of that procedure is, we may want to know what the size of this effect is.know what the size of this effect is.
There are different measures of effect sizes, There are different measures of effect sizes, including:including:• Pearsons’ correlation coefficient Pearsons’ correlation coefficient • Odd’s ratioOdd’s ratio• Cohen’s d statisticsCohen’s d statistics
Cohen's Cohen's dd statistic is appropriate in the context of a statistic is appropriate in the context of a t-testt-test on means. It is thus the effect size measure on means. It is thus the effect size measure we concentrate on here.we concentrate on here.
[Wikipedia: http://en.wikipedia.org/wiki/Effect_size][Wikipedia: http://en.wikipedia.org/wiki/Effect_size]
2020
Cohen’s d-statisticsCohen’s d-statistics Cohen’s d-statistic is expressed as:Cohen’s d-statistic is expressed as: d = (X1 – X2)/ sp d = (X1 – X2)/ sp Where spWhere sp22, the pooled variance estimate is:, the pooled variance estimate is:
spsp22= ((n1-1)*s= ((n1-1)*s1122 + (n2-1)*s + (n2-1)*s22
22) ) (n1+n2-2)(n1+n2-2) and sp, its square root.and sp, its square root. [Note this is not exactly Cohen’s d measure [Note this is not exactly Cohen’s d measure
which was expressed in terms of which was expressed in terms of parameters. What we show above is an parameters. What we show above is an estimate of d].estimate of d].
2121
Usefulness of the d statisticUsefulness of the d statistic d is useful in that it standardizes the difference d is useful in that it standardizes the difference
between the two means. We can talk about between the two means. We can talk about deviations in terms of proportions of standard deviations in terms of proportions of standard deviation points that are more useful than actual deviation points that are more useful than actual differences that are domain dependent.differences that are domain dependent.
Cohen came up with a set of guidelines Cohen came up with a set of guidelines concerning d:concerning d:
• d=.2 has a small effect, but is probably d=.2 has a small effect, but is probably meaningful; meaningful;
• d= .5 is a medium effect that is d= .5 is a medium effect that is noticeable. noticeable.
• d= .8 shows a large effect size.d= .8 shows a large effect size.
2222
Statistical Tests for Statistical Tests for Comparing Multiple Comparing Multiple
ClassifiersClassifiers
2323
What is the Analysis of Variance What is the Analysis of Variance (ANOVA)?(ANOVA)?
The analysis of variance is similar to the t-test in The analysis of variance is similar to the t-test in that it deals with differences between sample that it deals with differences between sample means. means.
However, unlike the t-test that is restricted to the However, unlike the t-test that is restricted to the difference between two means, ANOVA allows us difference between two means, ANOVA allows us to assess whether the differences observed to assess whether the differences observed between between any number of meansany number of means are statistically are statistically significant. significant.
In addition, ANOVA allows us to deal with more In addition, ANOVA allows us to deal with more than one independent variable. For example, we than one independent variable. For example, we could choose, as two independent variables, 1) the could choose, as two independent variables, 1) the learning algorithm and 2) the domain to which the learning algorithm and 2) the domain to which the learning algorithm is applied.learning algorithm is applied.
2424
Why is ANOVA useful?Why is ANOVA useful?
One may wonder why ANOVA is useful in One may wonder why ANOVA is useful in the context of classifier evaluation. the context of classifier evaluation.
Very simply, if we want to answer the Very simply, if we want to answer the following common question : "How do following common question : "How do various classifiers fare on different data various classifiers fare on different data sets?", then we have 2 independent sets?", then we have 2 independent variables: the learning algorithm and the variables: the learning algorithm and the domain, and a lot of results. domain, and a lot of results.
ANOVA makes it easy to tell whether the ANOVA makes it easy to tell whether the difference observed are indeed significant.difference observed are indeed significant.
2525
Variations on the ANOVA ThemeVariations on the ANOVA Theme
There are different implementations of There are different implementations of ANOVA:ANOVA:• One-way ANOVAOne-way ANOVA is a linear model trying to is a linear model trying to
assess if the difference in the performance assess if the difference in the performance measures of classifiers over different datasets is measures of classifiers over different datasets is statistically significant, but does not distinguish statistically significant, but does not distinguish between the performance measures’ variability between the performance measures’ variability within-datasets and the performance measure within-datasets and the performance measure variability between-datasets.variability between-datasets.
• Two-way/Multi-way ANOVATwo-way/Multi-way ANOVA can deal with more can deal with more than one independent variable. For instance, two than one independent variable. For instance, two performance measures over different classifiers performance measures over different classifiers over various datasets.over various datasets.
Then there are other related tests as well:Then there are other related tests as well:• Friedman’s test, Post-hoc tests, Tukey Test, etc…Friedman’s test, Post-hoc tests, Tukey Test, etc…
2626
How does One-Way ANOVA work? IHow does One-Way ANOVA work? I
It considers various groups of observations It considers various groups of observations and sets as a hypothesis that all the and sets as a hypothesis that all the means are equal.means are equal.
The opposite hypothesis is that they are The opposite hypothesis is that they are not all equal.not all equal.
The ANOVA model is as follows:The ANOVA model is as follows: xxijij = = μμii + e + eijij
• where xwhere xijij is the j is the jthth observation from group i, observation from group i, μμii is is the mean of group i and ethe mean of group i and eij ij is the noise that is is the noise that is normally distributed with mean 0 and common normally distributed with mean 0 and common standard deviation standard deviation σσ
2727
How does One-Way ANOVA work? IIHow does One-Way ANOVA work? II
ANOVA monitors three different kinds of variation in ANOVA monitors three different kinds of variation in the data:the data:• Within-group variation Within-group variation • Between-group variation Between-group variation • Total variation = within-group variation + between-group Total variation = within-group variation + between-group
variation variation Each of the above variations are represented by Each of the above variations are represented by
sums of squares (SS) of the variations.sums of squares (SS) of the variations. The statistics of interest in ANOVA is F, where The statistics of interest in ANOVA is F, where
F = Between-group variation F = Between-group variation Within-group variationWithin-group variation Larger F’s demonstrate greater statistical Larger F’s demonstrate greater statistical
significance than smaller ones. Like for z and t, significance than smaller ones. Like for z and t, there are tables of significance levels associated there are tables of significance levels associated with the F-ratio.with the F-ratio.
2828
The goal of ANOVA is to find out whether or not The goal of ANOVA is to find out whether or not the differences in means between different the differences in means between different groups are statistically significant. groups are statistically significant.
To do so, ANOVA partitions the total variance into To do so, ANOVA partitions the total variance into variance caused by random error (the within variance caused by random error (the within group SS) and variance caused by actual group SS) and variance caused by actual differences between means (the between-group differences between means (the between-group SS). SS).
If the null hypothesis holds, then the within-group If the null hypothesis holds, then the within-group SS should be about the same as the between-SS should be about the same as the between-groups SS.groups SS.
We can compare these two SS using the We can compare these two SS using the FF test, test, which checks whether the ratio of the two SSs is which checks whether the ratio of the two SSs is significantly greater than 1. significantly greater than 1.
How does One-Way ANOVA work? IIIHow does One-Way ANOVA work? III
2929
What is Multi-Way ANOVA? What is Multi-Way ANOVA?
In One-Way ANOVA, we simply considered several In One-Way ANOVA, we simply considered several groups.groups.
For example this could correspond the comparing For example this could correspond the comparing the performance of 10 different classifiers on one the performance of 10 different classifiers on one domain.domain.
How about the case where we compare the How about the case where we compare the performance of these same 10 different classifiers performance of these same 10 different classifiers on 5 domains?on 5 domains?
Two-Way ANOVA can help with thatTwo-Way ANOVA can help with that If we were to use an additional dimension such as If we were to use an additional dimension such as
the consideration of 6 different (but matched) the consideration of 6 different (but matched) threshold levels (as in AUC) for each classifier on the threshold levels (as in AUC) for each classifier on the same 5 domains, then Three-way ANOVA could be same 5 domains, then Three-way ANOVA could be used, and so on…used, and so on…
3030
How Does Multi-Way ANOVA work?How Does Multi-Way ANOVA work?
In our example, the difference between One-Way In our example, the difference between One-Way ANOVA and Two-Way ANOVA can be illustrated as ANOVA and Two-Way ANOVA can be illustrated as follows:follows:• in One-Way ANOVA, we would calculate the within-group in One-Way ANOVA, we would calculate the within-group
SS by collapsing the results obtained on all the data sets SS by collapsing the results obtained on all the data sets together within each classifier results.together within each classifier results.
• In Two-Way ANOVA, with would calculate all the within-In Two-Way ANOVA, with would calculate all the within-classifier, within-domain variances separately and group classifier, within-domain variances separately and group the results together.the results together.
• As a result, the spooled within-group SS of two-way As a result, the spooled within-group SS of two-way ANOVA would be smaller than the spooled within-group SS ANOVA would be smaller than the spooled within-group SS of one-way ANOVA.of one-way ANOVA.
Multi-way ANOVA is thus a more statistically Multi-way ANOVA is thus a more statistically powerful test than One-way ANOVA since we need powerful test than One-way ANOVA since we need fewer observations to find significant effects. fewer observations to find significant effects.