Post on 22-Feb-2016
description
STAT 231 MIDTERM 1Fall 2010
Introduction
• Jeffrey Baer• 3B Actuarial Science• Work terms at Manulife and Towers Watson• Waterloo SOS President, May 2009 – Aug 2010
Agenda
• 8:05 – 8:15 Data Types and Transformations
• 8:15 – 8:35 PPDAC• 8:35 – 9:10 Data Summaries• 9:10 – 9:15 Bivariate Risk Measures• 9:15 – 9:40 Probability Models• 9:40 – 10:00 Likelihood Functions and
MLEs
What is Statistics?
What is Statistics?
Statistics is the science of design and collection of data used to draw conclusions about a larger population.
Data Types
• Discrete: countable (whole numbers), finite– i.e. Number of students in Stat 231 born in 1991
• Continuous: measured data using real number line– i.e. Age of Stat 231 students
• Categorical: non-numerical, pre-determined categories– i.e. Months of birth of Stat 231 students
• Binary: categorical data with two categories– i.e. Born in 1991?
Data Types continued
• Ordinal: data that has an underlying order– i.e. Final Stat 230 grades of students in Stat 231
• Grouped/Frequency: numerical, # of occurrences in a category– i.e. Number of Pure Math/Act Sci/Stats students in Stat 231
• A Dataset is a collection of data– Can include several different data types
Transformations
• Transforming data from one form to another using a transformation function can simplify data and/or solve comparison issues
• Transformation types:– Monotone increasing: preserves ranking, i.e. ranks
of {x1,x2,...,xn} = ranks of {F(x1),F(x2),...,F(xn)}• Monotone decreasing reverses rankings
– Affine: linear transformation (y = Ax + B)– Coding: categorical data to numerical data– Ranking: ordering data from smallest to largest
Example 1If the temperature at which a certain compound melts is a random variable with mean value 120°C and standard deviation 2°C what are the mean temperature and standard deviation measured in °F? (Hint: °F = 1.8°C + 32).
PPDAC
Problem
• “A clear statement of what we are trying to achieve”
• Key Terms:– Unit: individual in the population– Variate: characteristic of a unit– Attribute: characteristic of the population
• The problem is defined in terms of attributes of the population
Aspect• Aspects (type of problem)
– Descriptive (exploring a target population attribute)• What is the average age of death for smokers in Canada?• What are the average marks for STAT 230 and STAT 231?
– Causative (linking explanatory and response variates)• Does smoking lead to lung cancer?• Does a high mark in STAT 230 indicate the individual will get a
high mark in STAT 231?
– Predictive (predicting value of response variate)• Given that a male, age 30, smokes, what is the predicted age of
mortality?• If I know an individual’s mark in STAT 230, can I predict his mark
in STAT 231?
Population
• Target Pop. (units we want to investigate)– University Students
• Study Pop. (units which could have been selected)– Laurier Students
• Sample (units actually selected)– Laurier Students selected for the study
• Subsets– Sample is a subset of study population– Study population not necessarily a subset of target
population
Error and Plan
• Study Error (Study vs. Target)– Possible consequence: making the wrong
conclusion about our target population• Sample Error (Sample vs. Study)
– Is present because we use a subset to make a conclusion on a larger population
– Can only be reduced, but never eliminated
• Plan: how we execute the study– Experimental vs. Observational plans
Example 2
PROBLEM: An auto manufacturer wants to know the average distance cars registered in Ontario go between oil changes.
PLAN: Canadian Tire is asked to collect data on the distance driven since the last oil change for all cars registered in Ontario whose oil they change during the last week in February. If the odometer reading at the last oil change is not available, a car will not be included in the sample.
Data
• After we’ve collected data, it’s important to summarize it in a form that is clear and concise
• Potential Issues:– Outliers: extreme observations– Bias: systematic error from improper data
collection– Missing observations: suspicious -> omitted
Our Collected Data
Observed Data:Ages of 12 individuals randomly selected from a room.
{ 4, 15, 16, 16, 18, 19, 20, 22, 24, 25, 28, 38 }
Sample Size:n = 12
Averages
Measures of Averages• Mean
Arithmetic Geometric
• Median– Q2, 50% of the data lies above, 50% lies below
• Mode– The most frequently occurring data point(s)
n
xx
n
ii
1
{ 4, 15, 16, 16, 18, 19, 20, 22, 24, 25, 28, 40 }
nn
i
ig xx /1
1
)(
Pie Chart
Pie Charts
• Frequency: # of occurrences
• Relative Frequency: proportion of occurrences
Histogram
Histograms• Frequency Histogram
– Height (area) of each bar is the # of occurrences within each interval
• Relative Frequency Histogram– Height (area) of each bar is the proportion of occurrences within
each interval• Determining an interval size
– (Max – Min)/desired # of intervals
Histogram
{ 4, 15, 16, 16, 18, 19, 20, 22, 24, 25, 28, 40 }
Frequency Histogram
Example 3
Estimate the number of electronic components in the sample which took at least 8 hours to fail, if there was a total of 300 items in the sample.
Relative Frequency Histogram
CDFCumulative Frequency PlotX-axis: data pointsY-axis: sum of all relative frequencies for data points up to x
Lorenz Curves
Lorenz Curves• CDF plot used to illustrate income inequality
– Shows percentage (y%) of total income held by poorest x% of households
– 45-degree line: line of perfect equality (LPE)– Gini Co-efficient: Area between Lorenz curve and LPE Area between Lorenz curve and LPI
0%10%
20%30%
40%50%
60%70%
80%90%
100%0%
20%40%60%80%
100%
Income Distribution in Canada
Lorenz CurveLPELPI
Percentage of Households
Perc
enta
ge o
f Inc
ome
Tipping Points
Model of Tipping Points• How many people will do something, given how many
other people are expected to do it• Can be illustrated using a modified Lorenz curve
– Equilibria: points intersecting the 45⁰ line– Stable Equilibria: points at which small deviations from
equilibria will result in a return to equilibria, regardless of the direction of deviation
– Unstable Equilibria: tipping points at which small deviations from equilbria will not result in a return to equilibria
Example 4 (from Asst. 1)100 students are in a class. Let N = the actual number of students clapping and NE be the number of students expected to clap. The relationship between N and NE is given as follows:
N = 0.5NE if NE <= 20N = 2NE – 30 if 20 < NE <= 50N = 0.5NE + 45 if 50 < NE <= 90N = 90 if NE >= 90
Illustrate this graphically. Equilibria? Stable Equilbria? Tipping Points? 0 20 40 60 80 100 120
0
20
40
60
80
100
120
Model of Tipping Points
NLPE
NE
N
Variability and Spread
• Sample Variance Population Variance
• Percentile– The p-th percentile is the data point located at position number
(p/100)*(n + 1)– Use linear interpolation if necessary
• Interquartile Range (IQR) = Q3 (75th percentile) – Q1 (25th percentile)
{ 4, 15, 16, 16, 18, 19, 20, 22, 24, 25, 28, 40 }
n n - 1
How to find Percentiles
Box and Whisker PlotBox and Whisker Plot Steps:• Calculate Q1, Q2 (median), Q3, and IQR• Draw a horizontal line representing scale of
measurement, and a box surrounding Q1 and Q3, with a line drawn for Q2
• Calculate outlier boundaries (dotted lines): – lower fence = Q1 – 1.5*IQR, upper fence = Q3 + 1.5*IQR– Mark any outliers with a * or o on the graph
• Draw whiskers connecting the largest and smallest measurements (upper/lower adjacent values) that are not outliers to the box
Example 5Draw a Box and Whisker Plot for the dataset{ 4, 15, 16, 16, 18, 19, 20, 22, 24, 25, 28, 40 }
QQ Plot
QQ Plot• Theoretical Quantiles
– Quartiles, percentiles, etc. of known distribution– 95th Theoretical quantile: α
• Sample Quantiles• 2 uses of QQ plots
– Sample vs. Theoretical Quantile (45o line = good fit)– Sample vs. Sample Quantile (straight line = similar
distribution)
Measures of Association
• Relative Risk (of event A provided event B occurs or does not occur)
– > 1 : positive association between A and B– Association does not imply causation!
Example 6
Given the following frequency table for individuals grouped according to whether they smoke or not and their education level:
Calculate the relative risk of smoking if a person has a PHD education.
Smoker High School University PHD
No 8 33 42
Yes 51 70 26
Measures of Association
Correlation Coefficient• ρ = Cov(X, Y) or or
σx*σy
• Measures linear relationship between two random variables– ρ > 0 : positive correlation; vice-versa– |ρ| = 1: X and Y are linearly related
n
ii
n
ii
i
n
ii
yyxx
yyxx
1
2
1
2
1
)()(
)()(
n
ii
n
ii
n
iii
ynyxnx
yxnyx
1
22
1
22
1
Example 7
• (47, 41) is called an influential outlier
Time Series
Time Series Graphs• The explanatory variate is time• The response variate is the measured variable
of interest at time t• Neighbouring points are joined by straight
lines rather than a simple scatter plot• Time series graphs can be used to look at
trends, seasonal patterns, etc.
Statistical Science
• Statistics is the science of design and collection of data used to draw conclusions about a larger population.
• When we collect this data, we’re always going to have uncertainty
• We fit our data to known probability models to quantify these uncertainties
Terminology
• Descriptive Statistics (Chapter 1)– Tools and techniques used to describe certain
attributes of a population– Graphs, charts, numerical summaries
• Statistical Inference (Rest of Course)– A problem solving method using data to draw
general conclusions on a population
Statistical Inference
• Estimation Problems– After collection of data, we fit the data to
probability models– Using the collected data, form estimates for the
parameters of the models
• Hypothesis Testing– Accepting or rejecting a statement about the
target population
Probability Models
• Random Variables– Represent what we’re going to measure in our
experiment
• Realizations– Represent the actual data we’ve collected from
our experiment
Probability Functions
• CDF = (discrete) or (cts.)
• E[g(X)] = (discrete) or (cts.)
• Var(X) = E(X^2) – [E(X)]^2
• E(aX + b) = aE(X) + b
• Var(aX + b) = a2 Var(X)
• P(a<=Y<=b) = (discrete) or (cts.)
x
dyyf )(
x
y
yf )(
x
xfxg )()(
dxxfxg )()(
b
a
dyyf )(
b
ax
xXP )(
Example 8
A random variable X has a continuous probability model with a cumulative distribution function (cdf)
Give an expression for the expected value of
Do not evaluate any sums or integrals.
)arctan(121)()( xxXPxF
)sin(XY
)arctan(121)()( xxXPxF
)sin(XY
)arctan(121)()( xxXPxF
)arctan(1
21)()( xxXPxF
Probability Models
• Binomial (binary data)– Fixed number of trials (n) and fixed probability (π) of
success on each (Bernoulli) trial– P(X=x; n, π) = ; x = 0,1,…,n
• Poisson (discrete data)– Events occur at a constant rate (λ)– P(X=x; λ) = ; x = 0,1,2,…
• Exponential (continuous data)– Waiting time between events occuring at rate λ– f(x; λ) = λe- λx ; x > 0
xnx
xn
)1(
!xe x
Gaussian Distribution and CLTGaussian Distribution• f(x; μ, σ) = • If Y ~ G(μ,σ), then Z = ~ G(0,1)• If Y1,Y2,…Yn are G(μ1,σ1), G(μ2,σ2), … , G(μn,σn):
– ~ G( , )
Central Limit Theorem (CLT)• For any iid RVs W1,W2,…Wn with mean μ and s.d. σ:
– If = , then E( ) = μ and SD( ) =
– ~ G(0,1)
2)(21
21
x
e
Y
n
i
iiYb1
n
i
iib1
n
i
iib1
22
W
n
i
iWn 1
1WW n
)/
(limn
Wn
Example 9 We are given that non-diabetics have glucose levels represented by
a random variable which follows a G(5.31, 0.58) distribution. Diabetics have glucose levels represented by a random variable which follows a G(11.74, 3.5) distribution. When taking a test, if the person’s glucose level measures higher than 6.5, they will be diagnosed as diabetic.
• If a person is diabetic, what is the probability that he/she is diagnosed correctly?
• What is the probability that a non-diabetic is diagnosed as diabetic?
Response Model
• Problem: what is μ, the average of the attribute of interest in the target population
• We will use our collected data to estimate μ• Let Y be a random variable that represents the measured
response variate• Y = μ + R R~G(0, σ )
– Y ~ G(μ, σ)– μ is systematic (no risk), while R is random (variable)
Maximum Likelihood Estimation
• Binomial π = ; x = # of successes
• Response μ = ; yi is the ith realization
• Maximum Likelihood Estimation – A procedure used to determine a parameter
estimate given any model
nx
n
yn
ii
1
Maximum Likelihood Estimation
• First, we assume our data collected will follow a distribution
• Before we collect the sample random variables– {Y1, Y2, …, Yn}
• After we collect the sample realizations– {y1, y2, …, yn}
• We know the distribution of Yi (with unknown parameters), hence we know the PDF/PMF
Likelihood Function
• The Likelihood Function:
• Likelihood: the probability of observing the dataset you have– We want to choose an estimate of the parameter θ that
gives the largest such probability– Ω is the parameter space, the set of possible values for θ– Relative Likelihood: R(μ) =
,);()(1
i
n
i
yfL Continuous
Discrete ,);()( yYPL
)(
)(^
L
L
MLE Process
• Step One: Define the likelihood function
• Step Two: Define the log likelihood function ln[L(θ)]
• Step Three: Take the derivative with respect to θ
• Step Four: Solve for zero to arrive at the maximum likelihood estimate
• Step Five: Plug in data values (if given) to arrive at a numerical maximum likelihood estimate
^
Examples 10/11 Discrete:What is the MLE of a geometric distribution with pmf ?
Continuous:Given Y ~ Exp(θ), with realizations y1,y2,…yn , find themaximum likelihood estimate of θ. What is the MLE for therealizations {3, 2, 1, 4}?
1 x,)1();( 1 xxXP