Xiaobo Sheng. Overview CH 1 Introduction CH 2-3 Concepts, Descriptive Statistics of one variable CH...

146
Introduction to Applied Statistics Xiaobo Sheng

Transcript of Xiaobo Sheng. Overview CH 1 Introduction CH 2-3 Concepts, Descriptive Statistics of one variable CH...

  • Slide 1
  • Xiaobo Sheng
  • Slide 2
  • Overview CH 1 Introduction CH 2-3 Concepts, Descriptive Statistics of one variable CH 6-8 Probability, A few common probability distributions and models CH 9-13 Statistical Inference CH 15 Linear Regression
  • Slide 3
  • Introduction What is statistics? A collection of numerical information Or the branch of mathematics dealing with theory and techniques of collecting, organizing, and interpreting numerical information. (We will focus on the first definition)
  • Slide 4
  • Why we need Statistics? Pepsi vs Coca Horse Racing Casino Game
  • Slide 5
  • How do we deal with Statistics? Input: Data Set (a collection of information) Process: Data analysis(Making sense of a data set) Output: Statistical Inference(Drawing conclusion about a population based on a sample from that population) BarneyTedLilyRobinMarshall A-B+ACF
  • Slide 6
  • A few basic definitions need to know Population: the group or collection of interest to us. Usually it will be very huge and messy. Sample : a subset of population. reasonable small and capable of being analyzed using statistical tools. And we use the observations in the sample to learn about the population. Example : income of teachers. Average age, etc.
  • Slide 7
  • Descriptive statistic a number used to summarize information in a set of data values. varies by different problems. Variable : a particular piece of information Two types: quantitative variable : has numerical values that are measurements categorical variable : values can not be interpreted as numbers.
  • Slide 8
  • Slide 9
  • 1 st quartile (25 th percentile) at least three-fourths are greater than or equal to the first quartile 3 rd quartile (75 th percentile) at least three-fourths are less than or equal to the first quartile Page 49
  • Slide 10
  • Range Difference between the largest and smallest values of a data set. Interquartile range Difference between the 3 rd and 1 st quartiles
  • Slide 11
  • Standard Deviation use it to measure variation of values about the mean population standard deviation s sample standard deviation P82
  • Slide 12
  • Lists, Tables, and Plots Data list A listing of the values of a variable in a data set.
  • Slide 13
  • Table
  • Slide 14
  • Table: Usually values in table are ordered or sorted by certain standard. If not, we can use Excel to finish this process.
  • Slide 15
  • Plots Dot Plot
  • Slide 16
  • Frequency Table
  • Slide 17
  • Slide 18
  • Histogram
  • Slide 19
  • Distribution A description of how the values of the variable are positioned along an axis or number line. Symmetric Skewed to the left(negatively skewed) there is a concentration of relatively values, with some scatter over a range of smaller values. Skewed to the right(positively skewed) there is a concentration of relatively values, with some scatter over a range of larger values.
  • Slide 20
  • Slide 21
  • Peak A major concentration of values.
  • Slide 22
  • Unimodal distribution has one major peak Bimodal has two major peaks Multimodal has several major peaks
  • Slide 23
  • Box plot
  • Slide 24
  • Box graph
  • Slide 25
  • CH4 Scatterplot two-dimensional graphical display of two quantitative variables.
  • Slide 26
  • Slide 27
  • Slide 28
  • Transformation of a variable a mathematical manipulation of each value of the variable. logarithmic transformation(common one) square root transformation power transformation
  • Slide 29
  • Logarithmic transformation take the logarithm of each value of the variable.
  • Slide 30
  • Further variables relationship analysis in ch.15 Homework
  • Slide 31
  • Ch 15 Correlation, Regression Study relationship between quantitative variables Linear Correlation Coefficient
  • Slide 32
  • Mathematical Notation (1) Another form (2)
  • Slide 33
  • Formal Definition Correlation Coefficient(Pearsons correlation coefficient) A measure of linear association between two quantitative variables r has no unit, and takes value from -1 to 1.
  • Slide 34
  • A correlation coefficient near 0 suggests there is little or no linear association between those two variables
  • Slide 35
  • Example
  • Slide 36
  • What exactly does the correlation coefficient measure? It measures the extent of clustering of plotted points about a straight line. A correlation coefficient that is large in absolute value suggests strong linear association between the two variables. A correlation coefficient that near zero suggests little linear association between the two variables.
  • Slide 37
  • Can correlation coefficient be misleading? Yes. We should always plot two quantitative variables to get a visual feel for their relationship. Then we can use the correlation coefficient to supplement the plot.
  • Slide 38
  • Slide 39
  • r is 0.66. By itself, this correlation coefficient might suggest linear association between these two variables. But the figure itself suggests a curved relationship. A stronger linear relationship exists between life expectancy and the logarithm of per capita gross national product.(r = 0.84)
  • Slide 40
  • Outlier An observation that is far from the other observations.
  • Slide 41
  • Slide 42
  • Slide 43
  • Simple Linear Regression Method of least squares
  • Slide 44
  • Slide 45
  • Example
  • Slide 46
  • Scatterplot
  • Slide 47
  • Calculation table
  • Slide 48
  • Scatterplot with least square line
  • Slide 49
  • Intercept has no physical meaning here.
  • Slide 50
  • Definition of Linear Regression Simple linear regression refers to fitting a straight line model by the method of least squares and then assessing the model. Application: Find out relationship between two quantitative variables Can be used to predict future.
  • Slide 51
  • Standard deviation line
  • Slide 52
  • Whats the relationship between those two lines and explain why.
  • Slide 53
  • Homework 1. Prove why those two forms (1) and (2) are the same. 2. Prove why r is always between values -1 and 1 3. Show the details of calculating b 0 and b 1 4. After standardize the variable, why the variance becomes 1.
  • Slide 54
  • Project Report Requirement 1.Details of experiment you did(including environment, time, process and variables you want to measure) Or source of the data you picked(either online or somewhere else), describe what those numbers stand for and which two variables you are measuring. 2. Data set ( variable name, unit, trials) 3. Scatter plot(including label of axis, unit of each variable) 4. Linear Regression Line.(y=b o +b 1 x) 5. Detail calculation of b o and b 1 regarding to your data set. 6. Correlation Coefficient 7. Summary (Make it 25 pages)
  • Slide 55
  • Ch6 Probability 50% chance of shower 1/2 chance to get head, 1/2 tail Celtics vs Lakers 60% Celtics will win. Probability of an event is the chance or likelihood of the event occurring.
  • Slide 56
  • Experiment, Outcome, Sample Space, Events Experiment A process leading to a well-defined observation or outcome. Ex: Roll a die, toss a coin Sample space A set of all possible outcomes of the experiment Ex: Head/Tail 1,2,3,4,5,6
  • Slide 57
  • Finite sample space a sample space that contains a finite number of outcomes (Ex: rolling dies) Continuous sample space a sample space that equals an interval of values. Ex: Class length [0, 75], Height Discrete sample space a sample space that contains discontinuous values. Ex: Grades A,B,C,D Age, # of Facebook Friends
  • Slide 58
  • More Examples Measure the snow amount annually. A researcher carries out an experiment as part of a study of cigarette smoking and lung cancer. She selects a male smoker at random from among all male smokers. Then keeps in touch with him until he either develops lung cancer or die with no evidence or lung cancer. Outcomes Discrete sample space : either with cancer or without cancer Continuous sample space: assume with cancer, whats the time of getting the disease?
  • Slide 59
  • Difference between Continuous Sample space and Discrete sample space. Usually discrete sample space are easy to calculate the probability of certain outcome. Continuous sample space may depend on the distribution function.
  • Slide 60
  • Event A subset of the sample space. Example: Roll two dies, the sum of two numbers is greater than 7. The class lasts from 49 to 51 minutes. The grade is B and above.
  • Slide 61
  • Probability Function Assigns a unique number or probability to each outcome in a finite sample space S. probability is always greater than or equal to 0 and less than or equal to 1. P(E) denotes the probability of an event E. P(entire sample space) = ? P(not E) = 1 - P(E)
  • Slide 62
  • Examples Toss a coin three times
  • Slide 63
  • Slide 64
  • P(at least two heads) = ? P(at most two tails) = ? P( at least one tail) = ?
  • Slide 65
  • Comments: Independent events: If outcome of A has no effect on the possible outcome of B, conversely if outcome of B also has no effect on the possible outcome of A, then we say that A and B are independent. P(A and B) = P(A) * P(B)
  • Slide 66
  • Slide 67
  • Homework If 75% probability to get a head, 25% get tail, and each toss are independent P(at least two heads) = ? P(at most two tails) = ? P( at least one tail) = ?
  • Slide 68
  • Conditional Probability Conditional probability of event A given event B, denoted as P(A|B) is the probability that events A and B occur together, divided by the probability of event B:
  • Slide 69
  • Dependent events Ex:
  • Slide 70
  • Outcomes: prisoner black, victim black P 1 = prisoner black, victim white P 2 = prisoner white, victim black P 3 = prisoner white, victim white P 4 = Page 186
  • Slide 71
  • Denote B = {(prisoner white, victim black), (prisoner white, victim white)} C = {(prisoner black, victim white), (prisoner white, victim white)} P(B and C) =? P(B|C) = ?
  • Slide 72
  • Independent Events If knowing that one of the events occurred does not change the calculated probability that the other event occurred. This formula can be extended. P(A and B and C) = P(A)*P(B)*P(C) if A and B and C are independent.
  • Slide 73
  • Example: Toss coins P(three heads in a row) = ? P(two heads) = ? P( one tail) = ?
  • Slide 74
  • Bayes Rule
  • Slide 75
  • Sensitivity of a diagnostic test is the probability that a person with the condition under study will test positive. Specificity of a diagnostic test is the probability that a person without the condition under study will test negative.
  • Slide 76
  • Example 6-11 Page 189 Two different approaches 1.Bayess Rule 2.Tree diagram
  • Slide 77
  • Random Variable A rule that assigns a number to each outcome in the sample space. Finite random variable is a random variable that takes on a finite number of values. Continuous random variable is a random variable that takes on values in an interval of numbers. Examples p192-193
  • Slide 78
  • Apply for a job and either offered or not. Sample space= {success, failure} Assign X(success) = 1 X(failure) = 0 Letter grade sample space S = {A,B,C,D,F} Assign a number(GPA) to each letter grade G(A) =4, G(B) =3,G(C) = 2, G(D) =1,G(F) =0 54) P( 120 Y 150) P( Z = 2)">
  • Based on the assignment, we can write probabilities: P(X > 54) P( 120 Y 150) P( Z = 2)
  • Slide 81
  • Probability Distribution Probability distribution of a random variable is the collection of probability assigned to events defined by the random variable. Details in Chapter 8.
  • Slide 82
  • Mean, Variance and Standard Deviation of Finite Random Variable Mean(or expected value) of a finite random variable X equals Denote as Variance Denote as Var(X), 2 Example : P195-196
  • Slide 83
  • Three statistics students volunteer for as taste test comparing Coke and Pepsi. Each student tastes samples in two identical-looking cups and decides which beverage he ore she prefers. Suppose the students make selections independently of one another. Suppose also that the probability of picking Pepsi is 3/5 and Coke 2/5 for all three students. Random variable Y: number of Pepsi selections. Possible values of Y : (0,1,2,3)
  • Slide 84
  • Questions: Expected value of Y? E(Y) = 3*P(Y=3)+2*P(Y=2)+1*P(Y=1)+0*P(Y=0) P(Y=2) = 54/125 P(Y=1) = 36/125 P(Y =0) = 8/125 E(Y) = 225/125 = 1.8
  • Slide 85
  • Whats variance of Y?
  • Slide 86
  • Homework P(A) = 0.008 P(T|A) = 0.85 P(not T|not A) = 0.90 Find P(A|T) using two different methods. Verify your answers.
  • Slide 87
  • Chapter 7 Permutations, combinations Binomial distribution Hypergeometric distribution
  • Slide 88
  • Permutation An ordered arrangement of a finite number of items. Example: Suppose you are playing FIFA, world cup. And you are in Group A with three other teams: Brazil, Italy, Spain. How many ways are there for those four teams to decide a rank? 1 st : You have 4 choices 2 nd : After you pick one, you have 3 choices to pick. 3 rd : Again, you only have 2 choices now. 4 th : After you decide the first 3, the one left is your only choice. Totally: 4*3*2*1 = 4! = 24
  • Slide 89
  • Generally if you have n objects to arrange in order, there would be n! ways to do it. A different permutation example: Lottery: 5 boxes. In each box, there are 10 balls numbered from 0 to 9. To win: the lottery ticket you bought has to have the exact same number in the exact same order as the winning numbers. Permutation: 10 5 Probability of winning: 1/ 10 5
  • Slide 90
  • Combination A group of objects selected from a larger collection without regard to order of selection. Example: Previous FIFA. Suppose we divide 4 teams in Group A into 2 suits. ( One suit is teams rank as 1 st and 2 nd who will go to quarter final. Another suit is teams rank as 3 rd and 4 th who have to go home) Suppose we only care about which teams are in which suit, and we do not care the order of those 2 teams in the same suit. How many possible ways to arrange this? 4!/2!2!
  • Slide 91
  • Slide 92
  • Examples Celtics and Lakers meet in the championship final. How many ways that Celtics wins 2 games for the first 4? How many ways that Celtics wins the series in exact 6 games.
  • Slide 93
  • Binomial Distribution Pre-requisite Bernoulli experiment(Bernoulli trial) An experiment that has exactly two possible outcomes.
  • Slide 94
  • Example Fair die, roll four times. Win $1 if result is divisible by 3. Probability of win at least $2? Suppose random variable Y equals the number of results that are divisible by 3 in four rolls. Whats the expected gain? If you pay $1 to play. Expect to win or not?
  • Slide 95
  • Binomial experiment Consists of n independent repetitions of Bernoulli experiment(only have two possible outcomes) The probability of success the same for each repetition. Random variable X: count the number of successes in n repetitions. Say X has a binomial distribution, denote as B(n,p) nnumber of repetition. pprobability of success on each repetition.
  • Slide 96
  • Question
  • Slide 97
  • Straight Forward approach:
  • Slide 98
  • What if n is very large? Any other way to do this?
  • Slide 99
  • How?
  • Slide 100
  • Expected Value, Variance
  • Slide 101
  • A few properties about Expected values and Variance E(X+Y) = E(X) + E(Y) E(cX) =cE(X) E(X-Y) = E(X) E(Y) E(X+c) = E(X) + c Var(X+Y) = Var(X) + Var(Y) Var(X-Y) = Var(X) + Var(Y) Var(cX) =c 2 Var(X)
  • Slide 102
  • Example Suppose I am on a vocation(that would be awesome!), and I plan to visit 12 countries. And the probability of I like the country would be 2/3 ( for all 12 of them). What is the probability of I like more than 9 of them? This time straight forward approach(list down all possible outcomes) is kind of impossible. Use Binomial Distribution.
  • Slide 103
  • Hypergeometric Distribution Suppose RV X counts the number of Type 1 objects in a sample selected at random from a finite collection of objects, each classified as either Type 1 or Type 2. Then we say X has a hypergeometric probability distribution, and we call X a hypergeometric random variable.
  • Slide 104
  • General Problem Suppose in a group of N objects, m 1 are type 1 and m 2 are type 2. Select a sample of n at random from objects N. Random Variable X counts the number of type 1 objects in the sample. Whats the probability of in those n objects selected, k of them are type 1? P(X = k)
  • Slide 105
  • Slide 106
  • Slide 107
  • Slide 108
  • Ch8 Gaussian(Normal)Distribution Gaussian(Normal) Distribution Standard Normal Distribution Central Limit Theorem
  • Slide 109
  • Slide 110
  • Normal probability function curve vs Standard normal probability function curve Standard normal distribution has mean of 0 and standard deviation of 1.
  • Slide 111
  • Area under the curve is 1! Now the probability of random variable Z can be represented by the area!
  • Slide 112
  • Slide 113
  • Cumulative probability Has the form P(Xc) where X is a random variable and c is a constant. Tail probability Is a probability that is small( less than 0.5) and has the form P(Xc) or P(Xc) for some number c.
  • Slide 114
  • Do some problems P(Z1) P(Z2) P(-1Z2)
  • Slide 115
  • Standardize
  • Slide 116
  • Suppose we have a random variable X with mean 3, variance of 4. P(1 X 5 )=?
  • Slide 117
  • Approximating Normal Distribution A distribution of data values is approximately Gaussian if the proportion of values in any interval approximately equals the area over that interval under the appropriate Gaussian curve. The distribution of a random variable is approximately Gaussian if the probability that the random variable is in any interval approximately equals the area over that interval under a Gaussian curve.
  • Slide 118
  • Central Limit Theorem Background Random Sample A collection of independent random variables with the same probability distribution.
  • Slide 119
  • Slide 120
  • Central Limit Theorem
  • Slide 121
  • How large does the sample size have to be? Depends on how different the probability distribution of X i s from a Gaussian distribution.
  • Slide 122
  • Slide 123
  • Large-sample result related to the CLM
  • Slide 124
  • Chapter 9 Basic ideas in Statistics Hypothesis testing
  • Slide 125
  • Definitions Statistical inference The process of drawing conclusions about a population based on a sample from that population Population: Hypothetical Has substance Sample: Random Sample in probability sense Random Sample in experimental sense
  • Slide 126
  • Statistical inference Parametric Nonparametric Parameter Mean, Median Standard Deviation, Variance Parameter types Point Estimate Interval Estimate used to estimate population parameter
  • Slide 127
  • Interval Estimate Confidence interval an interval estimate of a parameter, with a probability interpretation Hypothesis testing A formal strategy for comparing two statements about the state of nature in an experimental situation. Null Hypothesis Alternative hypothesis
  • Slide 128
  • Null Hypothesis A statement about the state of nature in an experimental situation. Generally no difference statement. H 0 Alternative hypothesis A statement about the state of nature, providing an alternative to that specified in the null hypothesis. H a or H 1
  • Slide 129
  • Example 1 State a specific goal. Write this goal in terms of hypotheses to be compared. State a specific goal. Write this goal in terms of hypotheses to be compared. Design an experiment to meet this goal. State assumptions and describe how the experiment will be analyzed.(Table 9.1) State assumptions and describe how the experiment will be analyzed.Table 9.1 Carry out the experiment. Analyze and interpret the results.
  • Slide 130
  • Example 2
  • Slide 131
  • Hypothesis Testing General Strategy Significance Level Approach Test Statistic: A measure of how much the sample observations differ from what we would expect if the null hypothesis were true. Significance Level(Error: Type 1, Type 2,Power of test)Type 1, Type 2,Power of test The probability of saying the observations are inconsistent with the null hypothesis, when the null hypothesis is really true. p-value Approach p-value Approach p-value: the probability of seeing a test statistic as extreme as or more extreme(in the direction of the alternative) than the one observed, if the null hypothesis were really true.
  • Slide 132
  • Comments on Hypothesis Testing How do we decide what the hypotheses should be? What assumptions do we make about the sample? What is the assumptions we make do not apply to the actual sampling process? How do we select a test statistic? How do we define an acceptance region and rejection region when using the significance level approach? Why do we consider values of the test statistic more extreme than the one actually observed when calculating p-value? When is a p-value small and when is large? Should we use the significance level or p-value approach to hypothesis testing? How do evaluate the power of a test?
  • Slide 133
  • Comments on Experimental Design Experimental Design the area of statistics concerned with designing an investigation to best meet the study goals, as well as the assumptions for statistical inference. On book P300
  • Slide 134
  • back
  • Slide 135
  • Slide 136
  • Slide 137
  • Slide 138
  • Slide 139
  • Slide 140
  • Slide 141
  • Slide 142
  • Slide 143
  • Chapter 10 Large-sample inference about a population mean Large-sample inference about a proportion t-test and confidence interval on a t distribution(small sample)
  • Slide 144
  • Example 10.1(large sample population mean) Example 10.2(large sample proportion) Example 10.3(t distribution)
  • Slide 145
  • Slide 146