STAT 231 MIDTERM 1 Fall 2010

STAT 231 MIDTERM 1Fall 2010

Introduction

• Jeffrey Baer• 3B Actuarial Science• Work terms at Manulife and Towers Watson• Waterloo SOS President, May 2009 – Aug 2010

Agenda

• 8:05 – 8:15 Data Types and Transformations

• 8:15 – 8:35 PPDAC• 8:35 – 9:10 Data Summaries• 9:10 – 9:15 Bivariate Risk Measures• 9:15 – 9:40 Probability Models• 9:40 – 10:00 Likelihood Functions and

What is Statistics?

Statistics is the science of design and collection of data used to draw conclusions about a larger population.

Data Types

• Discrete: countable (whole numbers), finite– i.e. Number of students in Stat 231 born in 1991

• Continuous: measured data using real number line– i.e. Age of Stat 231 students

• Categorical: non-numerical, pre-determined categories– i.e. Months of birth of Stat 231 students

• Binary: categorical data with two categories– i.e. Born in 1991?

Data Types continued

• Ordinal: data that has an underlying order– i.e. Final Stat 230 grades of students in Stat 231

• Grouped/Frequency: numerical, # of occurrences in a category– i.e. Number of Pure Math/Act Sci/Stats students in Stat 231

• A Dataset is a collection of data– Can include several different data types

Transformations

• Transforming data from one form to another using a transformation function can simplify data and/or solve comparison issues

• Transformation types:– Monotone increasing: preserves ranking, i.e. ranks

of {x1,x2,...,xn} = ranks of {F(x1),F(x2),...,F(xn)}• Monotone decreasing reverses rankings

– Affine: linear transformation (y = Ax + B)– Coding: categorical data to numerical data– Ranking: ordering data from smallest to largest

Example 1If the temperature at which a certain compound melts is a random variable with mean value 120°C and standard deviation 2°C what are the mean temperature and standard deviation measured in °F? (Hint: °F = 1.8°C + 32).

Problem

• “A clear statement of what we are trying to achieve”

• Key Terms:– Unit: individual in the population– Variate: characteristic of a unit– Attribute: characteristic of the population

• The problem is defined in terms of attributes of the population

Aspect• Aspects (type of problem)

– Descriptive (exploring a target population attribute)• What is the average age of death for smokers in Canada?• What are the average marks for STAT 230 and STAT 231?

– Causative (linking explanatory and response variates)• Does smoking lead to lung cancer?• Does a high mark in STAT 230 indicate the individual will get a

high mark in STAT 231?

– Predictive (predicting value of response variate)• Given that a male, age 30, smokes, what is the predicted age of

mortality?• If I know an individual’s mark in STAT 230, can I predict his mark

in STAT 231?

Population

• Target Pop. (units we want to investigate)– University Students

• Study Pop. (units which could have been selected)– Laurier Students

• Sample (units actually selected)– Laurier Students selected for the study

• Subsets– Sample is a subset of study population– Study population not necessarily a subset of target

population

Error and Plan

• Study Error (Study vs. Target)– Possible consequence: making the wrong

conclusion about our target population• Sample Error (Sample vs. Study)

– Is present because we use a subset to make a conclusion on a larger population

– Can only be reduced, but never eliminated

• Plan: how we execute the study– Experimental vs. Observational plans

Example 2

PROBLEM: An auto manufacturer wants to know the average distance cars registered in Ontario go between oil changes.

PLAN: Canadian Tire is asked to collect data on the distance driven since the last oil change for all cars registered in Ontario whose oil they change during the last week in February. If the odometer reading at the last oil change is not available, a car will not be included in the sample.

• After we’ve collected data, it’s important to summarize it in a form that is clear and concise

• Potential Issues:– Outliers: extreme observations– Bias: systematic error from improper data

collection– Missing observations: suspicious -> omitted

Our Collected Data

Observed Data:Ages of 12 individuals randomly selected from a room.

{ 4, 15, 16, 16, 18, 19, 20, 22, 24, 25, 28, 38 }

Sample Size:n = 12

Averages

Measures of Averages• Mean

Arithmetic Geometric

• Median– Q2, 50% of the data lies above, 50% lies below

• Mode– The most frequently occurring data point(s)

{ 4, 15, 16, 16, 18, 19, 20, 22, 24, 25, 28, 40 }

ig xx /1

Pie Chart

Pie Charts

• Frequency: # of occurrences

• Relative Frequency: proportion of occurrences

Histogram

Histograms• Frequency Histogram

– Height (area) of each bar is the # of occurrences within each interval

• Relative Frequency Histogram– Height (area) of each bar is the proportion of occurrences within

each interval• Determining an interval size

– (Max – Min)/desired # of intervals

Histogram

{ 4, 15, 16, 16, 18, 19, 20, 22, 24, 25, 28, 40 }

Frequency Histogram

Example 3

Estimate the number of electronic components in the sample which took at least 8 hours to fail, if there was a total of 300 items in the sample.

Relative Frequency Histogram

CDFCumulative Frequency PlotX-axis: data pointsY-axis: sum of all relative frequencies for data points up to x

Lorenz Curves

Lorenz Curves• CDF plot used to illustrate income inequality

– Shows percentage (y%) of total income held by poorest x% of households

– 45-degree line: line of perfect equality (LPE)– Gini Co-efficient: Area between Lorenz curve and LPE Area between Lorenz curve and LPI

20%30%

40%50%

60%70%

80%90%

100%0%

20%40%60%80%

Income Distribution in Canada

Lorenz CurveLPELPI

Percentage of Households

Tipping Points

Model of Tipping Points• How many people will do something, given how many

other people are expected to do it• Can be illustrated using a modified Lorenz curve

– Equilibria: points intersecting the 45⁰ line– Stable Equilibria: points at which small deviations from

equilibria will result in a return to equilibria, regardless of the direction of deviation

– Unstable Equilibria: tipping points at which small deviations from equilbria will not result in a return to equilibria

Example 4 (from Asst. 1)100 students are in a class. Let N = the actual number of students clapping and NE be the number of students expected to clap. The relationship between N and NE is given as follows:

N = 0.5NE if NE <= 20N = 2NE – 30 if 20 < NE <= 50N = 0.5NE + 45 if 50 < NE <= 90N = 90 if NE >= 90

Illustrate this graphically. Equilibria? Stable Equilbria? Tipping Points? 0 20 40 60 80 100 120

Model of Tipping Points

Variability and Spread

• Sample Variance Population Variance

• Percentile– The p-th percentile is the data point located at position number

(p/100)*(n + 1)– Use linear interpolation if necessary

• Interquartile Range (IQR) = Q3 (75th percentile) – Q1 (25th percentile)

{ 4, 15, 16, 16, 18, 19, 20, 22, 24, 25, 28, 40 }

n n - 1

How to find Percentiles

Box and Whisker PlotBox and Whisker Plot Steps:• Calculate Q1, Q2 (median), Q3, and IQR• Draw a horizontal line representing scale of

measurement, and a box surrounding Q1 and Q3, with a line drawn for Q2

• Calculate outlier boundaries (dotted lines): – lower fence = Q1 – 1.5*IQR, upper fence = Q3 + 1.5*IQR– Mark any outliers with a * or o on the graph

• Draw whiskers connecting the largest and smallest measurements (upper/lower adjacent values) that are not outliers to the box

Example 5Draw a Box and Whisker Plot for the dataset{ 4, 15, 16, 16, 18, 19, 20, 22, 24, 25, 28, 40 }

QQ Plot

QQ Plot• Theoretical Quantiles

– Quartiles, percentiles, etc. of known distribution– 95th Theoretical quantile: α

• Sample Quantiles• 2 uses of QQ plots

– Sample vs. Theoretical Quantile (45o line = good fit)– Sample vs. Sample Quantile (straight line = similar

distribution)

Measures of Association

• Relative Risk (of event A provided event B occurs or does not occur)

– > 1 : positive association between A and B– Association does not imply causation!

Example 6

Given the following frequency table for individuals grouped according to whether they smoke or not and their education level:

Calculate the relative risk of smoking if a person has a PHD education.

Smoker High School University PHD

No 8 33 42

Yes 51 70 26

Measures of Association

Correlation Coefficient• ρ = Cov(X, Y) or or

σx*σy

• Measures linear relationship between two random variables– ρ > 0 : positive correlation; vice-versa– |ρ| = 1: X and Y are linearly related

ynyxnx

Example 7

• (47, 41) is called an influential outlier

Time Series

Time Series Graphs• The explanatory variate is time• The response variate is the measured variable

of interest at time t• Neighbouring points are joined by straight

lines rather than a simple scatter plot• Time series graphs can be used to look at

trends, seasonal patterns, etc.

Statistical Science

• Statistics is the science of design and collection of data used to draw conclusions about a larger population.

• When we collect this data, we’re always going to have uncertainty

• We fit our data to known probability models to quantify these uncertainties

Terminology

• Descriptive Statistics (Chapter 1)– Tools and techniques used to describe certain

attributes of a population– Graphs, charts, numerical summaries

• Statistical Inference (Rest of Course)– A problem solving method using data to draw

general conclusions on a population

Statistical Inference

• Estimation Problems– After collection of data, we fit the data to

probability models– Using the collected data, form estimates for the

parameters of the models

• Hypothesis Testing– Accepting or rejecting a statement about the

target population

Probability Models

• Random Variables– Represent what we’re going to measure in our

experiment

• Realizations– Represent the actual data we’ve collected from

our experiment

Probability Functions

• CDF = (discrete) or (cts.)

• E[g(X)] = (discrete) or (cts.)

• Var(X) = E(X^2) – [E(X)]^2

• E(aX + b) = aE(X) + b

• Var(aX + b) = a2 Var(X)

• P(a<=Y<=b) = (discrete) or (cts.)

dyyf )(

xfxg )()(

dxxfxg )()(

dyyf )(

xXP )(

Example 8

A random variable X has a continuous probability model with a cumulative distribution function (cdf)

Give an expression for the expected value of

Do not evaluate any sums or integrals.

)arctan(121)()( xxXPxF

)sin(XY

)arctan(1

21)()( xxXPxF

Probability Models

• Binomial (binary data)– Fixed number of trials (n) and fixed probability (π) of

success on each (Bernoulli) trial– P(X=x; n, π) = ; x = 0,1,…,n

• Poisson (discrete data)– Events occur at a constant rate (λ)– P(X=x; λ) = ; x = 0,1,2,…

• Exponential (continuous data)– Waiting time between events occuring at rate λ– f(x; λ) = λe- λx ; x > 0

Gaussian Distribution and CLTGaussian Distribution• f(x; μ, σ) = • If Y ~ G(μ,σ), then Z = ~ G(0,1)• If Y1,Y2,…Yn are G(μ1,σ1), G(μ2,σ2), … , G(μn,σn):

– ~ G( , )

Central Limit Theorem (CLT)• For any iid RVs W1,W2,…Wn with mean μ and s.d. σ:

– If = , then E( ) = μ and SD( ) =

– ~ G(0,1)

Example 9 We are given that non-diabetics have glucose levels represented by

a random variable which follows a G(5.31, 0.58) distribution. Diabetics have glucose levels represented by a random variable which follows a G(11.74, 3.5) distribution. When taking a test, if the person’s glucose level measures higher than 6.5, they will be diagnosed as diabetic.

• If a person is diabetic, what is the probability that he/she is diagnosed correctly?

• What is the probability that a non-diabetic is diagnosed as diabetic?

Response Model

• Problem: what is μ, the average of the attribute of interest in the target population

• We will use our collected data to estimate μ• Let Y be a random variable that represents the measured

response variate• Y = μ + R R~G(0, σ )

– Y ~ G(μ, σ)– μ is systematic (no risk), while R is random (variable)

Maximum Likelihood Estimation

• Binomial π = ; x = # of successes

• Response μ = ; yi is the ith realization

• Maximum Likelihood Estimation – A procedure used to determine a parameter

estimate given any model

Maximum Likelihood Estimation

• First, we assume our data collected will follow a distribution

• Before we collect the sample random variables– {Y1, Y2, …, Yn}

• After we collect the sample realizations– {y1, y2, …, yn}

• We know the distribution of Yi (with unknown parameters), hence we know the PDF/PMF

Likelihood Function

• The Likelihood Function:

• Likelihood: the probability of observing the dataset you have– We want to choose an estimate of the parameter θ that

gives the largest such probability– Ω is the parameter space, the set of possible values for θ– Relative Likelihood: R(μ) =

,);()(1

yfL Continuous

Discrete ,);()( yYPL

MLE Process

• Step One: Define the likelihood function

• Step Two: Define the log likelihood function ln[L(θ)]

• Step Three: Take the derivative with respect to θ

• Step Four: Solve for zero to arrive at the maximum likelihood estimate

• Step Five: Plug in data values (if given) to arrive at a numerical maximum likelihood estimate

Examples 10/11 Discrete:What is the MLE of a geometric distribution with pmf ?

Continuous:Given Y ~ Exp(θ), with realizations y1,y2,…yn , find themaximum likelihood estimate of θ. What is the MLE for therealizations {3, 2, 1, 4}?

1 x,)1();( 1 xxXP

STAT 231 MIDTERM 1 Fall 2010

Documents

Transcript of STAT 231 MIDTERM 1 Fall 2010

Lecture notes for Stat 231: Pattern Recognition and Machine Learning 1. Stat 231. A.L. Yuille. Fall 2004 Linear Separation and Margins. Non-Separable and.

234, 242, 250, 258, 266-(231, 232) - pdf.germanbliss.compdf.germanbliss.com/BEFCO234-242-250-258-266-(231... · 234-231, 242-231, 250-231, 258-231, 266-231 234-232, 242-232, 250-232,

Modello 231 Model 231

STAT 511 Fall 2016 Midterm Exam Two (Practice) Namepeople.stat.sc.edu/wang528/Stat 511/F16 STAT 511 Practice... · 2016-10-29 · STAT 511 Fall 2016 Midterm Exam Two (Practice) Name:

Stat 13 Final Review - UCLA Statisticskcli/stat13/stat13-final-review.pdf · Stat 13 Final review Before (midterm) After Normal distribution Chi-square distribution A. Probability

STAT Working Papers STAT Documents de travail STAT ...

Stat...Stat ... 100

Cases Stat Con- Midterm

Stat 110 Midterm Review, Fall 2011 - Projects at Harvard · Stat 110 Midterm Review, Fall 2011 Prof. Joe Blitzstein (Department of Statistics, Harvard University) 1 General Information

STAT 341 Climate change · STAT 341! Format:! Lectures Monday and ﬁrst hour Friday! Section Wednesday! Problem session second hour Friday (except this week)!! Midterm on Friday

vyzlovka.cz 2012.pdf · 231 20 00 1351 231 20 00 1355 231 20 00 1511 231 20 00 1511 231 20 00 2111 231 20 00 2111 231 20 oo 2132 Rozpoötové opatiení 09.112012 Rozpottové opatiení

1. Introduction to Pattern Recognition and Machine Learning. Prof. A.L. Yuille. Dept. Statistics. UCLA. Stat 231. Fall 2004.

REQUEST FOR APPLICATION...2020/04/21 · 1978, 92 Stat. 352 , as amended by Pub. L. 100–231, 2(2), Jan. 5, 1988, 101 Stat. 1565) provides for an expanded and comprehensive extension

MIDTERM * Midterm Review:Thursday, March 7 * Midterm Date:Tuesday, March 12.

STAT E-102 Midterm Review March 14, 2007. Review Topics—Class 1 Ch. 1, 2 Populations and samples Populations and samples Parameters (usually unknown)

Winter Injury in American chestnut James Sharpe Rebecca Stern April 20, 2015 Stat 231 James Sharpe Rebecca Stern April 20, 2015 Stat 231.

Stat 110 Midterm - Harvard UniversityStat 110 Midterm Prof. Joe Blitzstein October 12, 2011 This exam is closed book and closed notes, except for two standard-sized sheets of paper

STAT 230 MIDTERM 2 EXAM-AID November 14, 2011

1. Stat 231. A.L. Yuille. Fall 2004

Stat 211 Midterm 2 SOS Session