SOS Package Statistics for the Sciencesmdahn.com/Statistics/SOS2.pdf · 2019. 12. 24. · AN#...

12
SOS Package Statistics for the Sciences Tutor Abdullah Nasser [email protected] Course Coordinator Olivia Hensz Annie Kanwar

Transcript of SOS Package Statistics for the Sciencesmdahn.com/Statistics/SOS2.pdf · 2019. 12. 24. · AN#...

  • SOS Package

    Statistics for the Sciences

    Tutor Abdullah Nasser

    [email protected]

    Course Coordinator Olivia Hensz

    Annie Kanwar

  • AN#

    Statistics for the Sciences

    Discrete versus Continuous random variables • Two important things to note here:

    1. You cannot list all the values a continuous random variable can assume; 2. Prob(X=x) for a continuous random variable is always zero for any value of x.

    This is unlike our discrete variable distributions. Take a minute to make sure you understand this point.

    • For continuous random variables, we only care about the probability of ranges of X, and not exact values of X. E.g. P(X≤x), which is exactly the same as P(X

  • AN#

    o It gets trickier if you are asked about the probability that your friend’s TA weighs more than 65 Kg. The N(0,1) table gives you Prob(Zz). But remember that:

    Prob(Z>z)= 1- Prob(Z

  • AN#

    normal. In other words, the sampling distribution of any mean becomes Normal as the sample size grows. All we need is for the observations to be independent and collected with randomization.

    o This way, we know that: x̄~N(µ,σ2/n)

    o If you are asked about probability of x̄ being bigger than or smaller than a certain

    number, follow the same procedure as calculating the z-score above, but remember that we are using σ/√n instead of just σ, such that:

    Z = (x̄-µ) / (σ/√n)

    Example: It’s your birthday and you decide to buy your friends lunch (total 15 of you). Each of the 15 of you decide to order your entrees randomly from a menu that has a mean price of $10 per dish, and a standard deviation of $2 (and normally distributed). You only have $100 in your wallet. What is the probability you will spend more $100? Answer: Let’s see the central limit theorem in action! We know that the price of a single dish is:

    X~N(10, 22) The average cost, !, is therefore distributed as:

    !~!(10, 215) If the cost exceeds $100, then the cost per plate is 100/15= 6.667. The question now becomes about the probability of ! exceeding 6.667. We can calculate the z-score as usual:

    6.667− 102/ 15 = !−6.45!

    With a z-score this extreme, we know the probability is essentially 1.

    • The sampling distribution of the sample proportion o Similarly, if you take N samples of n individuals each, and figure out the proportion

    p of some characteristic of interest in each sample and average them out, then: ! E(p)̂=p ! Var(p)̂= pq/n

  • AN#

    o The formal definition: p~̂N(p, p(1-p)/n)

    which tells us that for big n, the sample proportion follows a bell curve. Large Sample Estimation • Recall that we are usually interested in a certain population parameter, say the average

    weight of all students or the average number of Apple products Windsorites own, but we cannot usually measure everyone’s weight or asking them about their Apple products. We usually have to resort to sampling some of them, and infer information about the population.

    • In other words, we can measure sample parameters and use them as a close approximation for population parameters. For instance, we have ! and ! from the sample, and use that as an approximation for ! and !, respectively.

    • If the sampling distribution centers on the population mean, we say that the estimator is unbiased.

    What is a confidence interval o Commonly, it is described as the range inside which you are X% sure the real value

    of the sample mean is. More accurately, X% of point estimates are inside that particular range.

    o The general formula for calculation:

    100 1− ! %!!" = !"#$%'(!± !!/!!!"

    o Essentially, you multiply the standard error by the !!/! value. The latter is chosen based on the confidence level you desire. Here is a table to help you out:

    CI !!/!

    90% 1.645 95% 1.96 98% 2.33 99% 2.58

    Example: If you want to increase the confidence level from 90% to 98%, would you have a wider or narrower confidence interval? Answer: Wider! Think about it: if you want to be more confident your range includes the true value, you’d need to include more values in your range.

  • AN#

    • How to calculate CI’s for proportions

    o It’s the same idea. Just multiply the SE for proportions by the appropriate !!/!!value. For instance, the 95% confidence interval is calculated as follows:

    p ̂± 1.96 √(pq/n) Example: your friend is conducting research on the use of iPhones on campus. He lost some pages from his notebook and wants your help constructing his data. This is what he knows: he found that 72 out of 90 students used iPhones in his sample. He is X% confident that the sample proportion is no less than 0.6912. What was his confidence level? Answer: we will have to solve this backwards. Let’s list what we know:

    o The lower bound of the X% CI is 0.6912, o The number of students who have iPhones is 72, and the sample size is 90.

    This means p is 72/90=0.8 Let’s write out the proportion CI formula, and plug in what we know:

    0.8− !!/! (0.8 ∗ 0.2)/90 = 0.6912 If we solve for !!/!, we find that it is 2.58. Looking at our table above, we find that it corresponds with a CI of 99%. • One and two sided methods for constructing CI’s

    o One-sided intervals are bounded by one number and either infinity or negative infinity, e.g. (76,∞) or (-∞, 356)

    o For a one-sided 95% CI, the interpretation would take a form similar to this example based on the two intervals above:

    " I’m 95% confident there are at least 76 Xbox’s in Windsor; or " I’m 95% confident there are at most 356 students in my section.

    o To construct one-sided intervals, we use the same formula as before, but instead of 1.96 (!!/!), we use 1.64 (!!) for the 95% CI.

    Difference between means

    o If we you want to know if two populations are the same (have the same mean), you need to know this equation. Examples where is might arise is if you want to see if, say, the life expectancy increases in a certain population that exercises vs. another that doesn’t.

    o To do this, we use the same exact equation above for CI.

  • AN#

    o Here is how you get the SE:

    !" = !!!

    !!+ !!

    !

    !!

    o For proportions, this is the SE:

    !" = !!!!!!+ !!!!!!

    Sample size o Before conducting studies, you might want to know how many people you need to

    sample. You based this calculation on the SE you’d like to have, and solving backwards.

    o For binomial distributions, we always take p=q=0.5.

    Example: The distribution of heart rates for adults in the Canada is normally distributed with mean of 69 beats per minute and standard deviation of 6 beats per minute. a) What proportion of adults in the Canada have a heart rate below 65 beats per minute? b) A survey of 20 students found the mean heart rate to be 65.8 beats per minute, with a standard deviation of 7.1. Construct a 95% confidence interval from this sample for the true mean heart rate at Windsor. c) How many students would need to be sampled in order to for the confidence interval to have a margin of error of 1 beat per minute or smaller? Answer:

    a) We need to find the z-score and the associated probability. The z-score is (65-69)/6=-0.6666. The probability associated with that is: 25.27%.

    b) First, we find the SE. SE=7.1/sqrt(20)= 1.588. Next, we find the margin of error as follows: 1.96*1.588=3.112, leading to a 95% CI of: 62.7-68.9

    c) We have to do the reverse operations. 1.96*(7.1/sqrt(n))=1, n=193.66, which we have to round up to 194.

    • Null and alternative hypothesis

    o Null hypothesis: this is the status quo hypothesis. We denote it as Ho o Alternative hypothesis: this is the alternative hypothesis to the null. We denote it

    as Ha

  • AN#

    o Notation: if you are testing whether your training program is making a difference in workers’ performance, then your null hypothesis would be “there is no difference” and your alternative would be “there is a positive difference.” Here is the notation you will use to refer to them: Ho : diff=0 Ha : diff>0 (where diff= after – before)

    o You cannot accept the null hypothesis or prove the alternative hypothesis. You can only:

    a. Reject the null hypothesis and accept the alternative; or b. Fail to reject the null hypothesis.

    o Always state your null and alternative hypotheses. If you are doing a test, please

    tell them what you are testing. Don’t lose points over this.

    • Type I and II error o Type I: when the null is true and we reject it. o Type II: when null is false and we fail to reject it.

    • !!!"#!! • !: Prob(type I error) • !: Prob(type II error)

    • Test-statistic

    o Our general approach will be to calculate the t-statistic for a certain hypothesis then compare it to one of two values to see if that value is statistically significant;

    o For instance, if you were asked to determine whether there is significant evidence the mean of a certain population is at a particular value, you would have to use the t-statistic. E.g. you are interested in the mean number of Apple products Windsor students have. You go and sample a 100 of them. You find the mean of your sample is 2. You can use the t-statistic to test the hypothesis that the real mean of the entire Windsor population is, say, 3, or the hypothesis that Windsor students have more than 5 Apple products.

    o You calculate the t-statistic as follows: ! = ! − !!!/ !

    where x̄ is your sample mean, µ is what you hypothesize the real mean to be (3 in the example above), s is the standard deviation, and the n is the sample size (100 in the example above).

  • AN#

    o For 95% confidence, if your alternative hypothesis is: " µ ≠ [a certain value], then you reject the null hypothesis if the absolute

    value of your t-statistic is greater than 1.96 (two-tailed); " µ < [a certain value], then you reject the null hypothesis if your calculated

    t-statistic is less than -1.64 (one-tailed); " µ > [a certain value], then you reject the null hypothesis if your calculated

    t-statistic is greater than 1.64 (one-tailed). • If you want to increase or decrease your confidence, adjust the critical values

    accordingly.

    • P-values • The part of statistics you are likely to encounter a lot! • It can be anywhere from 0 to 1; • The bigger the p-value, the more consistent the data are with our null

    hypothesis; • We generally want p-values to be small so we can reject the null. A p-value

    of 0.05 or less is what we usually take as convincing evidence that our null hypothesis can be rejected.

    • The p-values are usually specific to the alternative hypothesis being tested, so make sure you quote the right values in your answers.

    Example: The average intake of coffee at the University of Windsor is 3 cups per day per student. A student is interested in whether this number increases during exam period. He surveys 100 students and finds that the average number of coffee cups they consumed during exam period is 3.4 with a standard deviation of 0.7. Is this difference significant at the 95% significance level? Answer using:

    a) Confidence interval for the sample mean b) Test statistic to test if exam time increases coffee consumption c) P-value for the same test

    Answer: First things first, let’s state our null and alternative hypotheses:

    Ho : µnormal=µexams Ha : µnormal

  • AN#

    b) The rule for the test statistic: ! = ! − !!!/ !

    ! = 3.4− 30.7/ 100

    ! = 5.7

    The test statistic is >1.64, which means we can reject the null hypothesis and conclude the increase in coffee is significant at the 95% level.

    c) Let’s look at the p-value of our test statistic. Looking at the table, we find that p is

  • AN#

    Additional Questions Question: A political scientist is studying the number of hours an average person spends surfing the net per day. From a sample of 340, she finds the average number of hours is 4.2 with a standard deviation of 0.5. Because she likes to apply her research to her family, she finds out her granddad ranks at the 30th percentile in terms of Internet usage. How many hours does her granddad spend per day surfing the web? You can assume her results are unbiased and s=σ. Answer: We usually get x, calculate the z-score, and then get the probability. In this question, we get the probability, and have to calculate the z-score, and then get our desired x. Because we know the percentile is 0.30, we get the corresponding z-score, which is -0.5244. Using the z-score formula, we have:

    -0.5244 = ((x-4.2)/0.5) -0.2622 = x-4.2

    x = 3.9378 hours

    Question: A student wants to determine what percentage of college students smoke. How large a sample should she take to be 95% confident that her sample proportion is off by no more than 3.5%? Answer: 0.035 = 1.96 * SE [recall the general formula] 0.035 = 1.96 * (sqrt(0.5*0.5)/(n)) [recall we always use p=q=0.5] x = 784 [solving for n, we get this result] Question: A food inspector tests several samples from the local restaurant and concludes it is safe. State what type of error, if any, he committed if the restaurant in reality:

    a. Does serve safe food b. Serves rotten tomatoes from the 1997 harvest

    Answer:

    a. No error committed b. Type I error

  • AN#

    Addendum

    Parameter Standard Error Sample mean σ/√n Sample proportion √(pq/n) Sample mean difference √(s12/n1+s22/n2) Sample proportion difference √(p1q1/n1+p2q2/n2)