Copyright (c) Bani Mallick1 Lecture 1 STAT 651. Copyright (c) Bani Mallick2 Topics in Lecture #1...
-
date post
20-Dec-2015 -
Category
Documents
-
view
219 -
download
2
Transcript of Copyright (c) Bani Mallick1 Lecture 1 STAT 651. Copyright (c) Bani Mallick2 Topics in Lecture #1...
Copyright (c) Bani Mallick 1
Lecture 1
STAT 651
Copyright (c) Bani Mallick 2
Topics in Lecture #1 Welcome and basic mechanics of the
course
Samples and populations
Relative frequency histograms
The sample mean
Copyright (c) Bani Mallick 3
Book Sections Covered in Lecture #1
Chapter 1
Chapter 3.3, pages 46-53
Copyright (c) Bani Mallick 4
The Web Site
Go to http://stat.tamu.edu/~bmallick/651/651.html
Please make sure to check the web site regularly for notes from me and the TA
I apologize in advance for any typos ()
Copyright (c) Bani Mallick 5
Emails and all that The TA will answer detailed questions about homework
Check the web site for the TA name, office, email address and office hours
My email ([email protected]) should only be used as a last resort, or to set appointments (Spring 2004 only).
Copyright (c) Bani Mallick 6
Office Hours (Spring 2003 only)
My office hours are as follows (Spring 2003)
Tuesdays: 11:00-12:30, 4:00-5:00
Thursdays: 11:00-12:30
The TA will also have office hours
I am not available outside the office hours
Copyright (c) Bani Mallick 7
Printing The Lectures The lectures are set up as PowerPoint files. You can download them from the STAT651 web site
You can print them 2 or 3 per page
Go to “file”, then “print”. A little box will open, and in the bottom left you will see “print what”. It should simply say “slides”, but you can click to open the available “handout” options
Copyright (c) Bani Mallick 8
Other Web Material
All homework assignments
All data sets
Copyright (c) Bani Mallick 9
Who Am I?
You can check out my personal web site
http://stat.tamu.edu/~bmallick
Copyright (c) Bani Mallick 10
Course Mechanics
Exams (3).
You are encouraged to prepare “cheat sheets”, 3 pages for each exam.
No formulae memorization: this is an applied statistics class
Copyright (c) Bani Mallick 11
Course Mechanics
You may bring the book to exams, but the cheat sheets will be more useful.
I will expect you to be able to interpret computer output: both mechanically and conceptually
The exams are multiple choice. Exam scores are curved.
Copyright (c) Bani Mallick 12
Course Mechanics We will use SPSS.
SPSS is available throughout campus
Once you learn SPSS, other packages such as SAS will be easy
The TA can give you help with SPSS
Copyright (c) Bani Mallick 13
SPSS
• You are entitled to get SPSS at no additional cost
• Go to http://cis.tamu.edu/customer-sales/sell/student.php (Ignore any statement about cost)
• Go to http://stat.tamu.edu/~mspeed/spss for help
Copyright (c) Bani Mallick 14
Course Mechanics
Course rules (dates of exams, percent they count, policy on late homework, policy on missed exams) are available at the class web site.
Please print them out and please read them.
Copyright (c) Bani Mallick 15
NHANES
National Health and Nutrition Examination Survey
The major survey whereby the federal government monitors nutrition and health in the U.S.
I will focus on women aged 30-50
First some important definitions
Copyright (c) Bani Mallick 16
NHANES
Population: The entire collection of individuals of interest
In NHANES, the population is all women in the U.S. aged 30-50
Since there are millions of such women, it is impractical to figure out the health and nutrition for all of them: it would cost billions of dollars to do so
Copyright (c) Bani Mallick 17
NHANES
Sample: A subset of the population that is measured in lieu of measuring everyone in the population
Since we want the sample to represent the population, the goal is to make sure we sample a representative subset of the population
In NHANES, women were sampled at random from the population, the randomness meant to ensure that the sample is representative.
Copyright (c) Bani Mallick 18
Samples and Populations
Warning: I will make a big deal about the difference between samples and populations
You will be asked multiple questions on every exam about this distinction.
They will be phrased in various ways.
This is the conceptually hardest part of this course
The sample is not the population: learn this!
Copyright (c) Bani Mallick 19
Variables
What we measure: variables are things that we measure in a sample and population
They can be numerical: your height
They can be binary: your gender
The can be categorical: preference in soft drinks (Pepsi, Coke, Dr. Pepper, None, Other)
Copyright (c) Bani Mallick 20
Random Variation
Different samples lead to different outcomes: This is a hard conceptual point
First we will do an experiment, then discuss the implications
Copyright (c) Bani Mallick 21
Random Variation
Different samples lead to different outcomes: consider heights of males in this class
Sample #1: males whose SSN’s end in 1,2,3 or 4
Sample #2: males whose SSN’s end in 6,7,8 or 9
Note how the numbers will not be identical
Copyright (c) Bani Mallick 22
Random Variation
Different samples lead to different outcomes: samples do not equal populations
One of the main goals of statistics: ascertain how far a sample result is from the population result
For example, how far is the sample mean height of 10 males from the population mean height?
This will require probability statements
Copyright (c) Bani Mallick 23
A Warning!
Fancy statistical methods cannot rescue garbage data
Fancy statistical methods can help you gain insight into your data, over and above what seems obvious on its face
You should always worry about whether the sampled results are representative of the population, and whether your sample allows you to make inferences about the population.
Copyright (c) Bani Mallick 24
Histograms
A graphical means of looking at a sample from a population.
Can be used to compare two populations.
Allows you to judge central tendency, variation, and other odd features of the data
A very useful graphical tool
Copyright (c) Bani Mallick 25
Relative Frequency Histograms
Simplest graphical technique to describe a sample.
Divide range of variable into intervals of nearly equal length.
Plot the % of the data which falls in each interval.
Computers have various ways of choosing the intervals.
You’ll not do these by hand, ever, with me
Copyright (c) Bani Mallick 26
Relative Frequency Histograms
Numerical Example: ages 26,29,30,34,37,38,39,41,43,45
Interval (selected arbitrarily by me): 26-30 31-35 36-40 41-45 46-50
Count # in each interval: 3 1 3 3 0
Compute % in each interval (relative frequencies): 30 10 30 30 0
Copyright (c) Bani Mallick 27
NUMERICAL EXAMPLEIntervals 26-30 31-35 36-40 41-45 46-50
% in interval 30 10 30 30 0
0
5
10
15
20
25
30
26-30 31-35 36-40 41-45 46-50
Copyright (c) Bani Mallick 28
NHANES
Two subpopulations (yes, populations can have subpopulations)
Subpopulation #1: All women in U.S. aged 30-50 and healthy in 1980 who developed breast cancer by 1995
Subpopulation #2: All women in U.S. aged 30-50 and healthy in 1980 who did not develop breast cancer by 1995
Copyright (c) Bani Mallick 29
NHANES
Two samples
Sample #1: 59 women in U.S. aged 30-50 and healthy in 1980 who developed breast cancer by 1995
Sample #2: 60 women in U.S. aged 30-50 and healthy in 1980 who did not develop breast cancer by 1995
Copyright (c) Bani Mallick 30
NHANES
One Variable Measured on each (sub)population
Saturated Fat intake in Diet:
This was measured by a 24-hour recall: they asked each women once what they had eaten the previous day, and computed saturated fat
This is a terrible measure of saturated fat intake (garbage data?), but all that is available
I would have done multiple days, at least
Copyright (c) Bani Mallick 31
NHANES: What do we Expect?
Saturated Fat intake in Diet:
One would expect that the women who developed breast cancer tended to have higher levels of saturated fat in their diet.
What do the relative frequency histograms say?
Copyright (c) Bani Mallick 32
NHANES Saturated Fat
Relative Frequency Histograms
(the scales are the same). What
do you see?
0%
10%
20%
30%
Per
ce
nt
Cancer
Healthy
25 50 75 1000%
10%
20%
30%
Per
ce
nt
Copyright (c) Bani Mallick 33
NHANES log(Saturated Fat) Relative Frequency Histograms
(the scales are the same). What
do you see?
0%
5%
10%
15%
Per
cen
t
Cancer
Healthy
2.00 3.00 4.0
Log(Saturated Fat)
0%
5%
10%
15%
Per
cen
t
Copyright (c) Bani Mallick 34
Construction in SPSS
• I will now show you a few things about SPSS
Copyright (c) Bani Mallick 35
Construction in SPSS
• Select graphs in SPSS menu• Select interactive• Select Histogram• Select percent instead of count for a
relative frequency histogram• Place variable of interest on X-axis
Copyright (c) Bani Mallick 36
Construction in SPSS
• Select variable defining the populations and put it in “Panel Variables”
• The histograms will be side-by-side. I like them one on top of the other
• Double click on graph (may need to do this twice)
• A menu will pop up, go to “Arrangement”
Copyright (c) Bani Mallick 37
Construction in SPSS
• Select “Down then Across”• Then take over to PowerPoint (copy and
paste)• Click on the histogram in your
PowerPoint presentation, and convert it to a Microsoft picture
• Change sizes, and edit as you wish
Copyright (c) Bani Mallick 38
What Histograms Say
• Because each box is a relative frequency (percentage), you can use a histogram to learn a few things about the population
• You can also use them to compare two populations
• Whether one population has generally larger values
• Whether one population is more closely clumped
Copyright (c) Bani Mallick 39
What percentage of
the healthy women ate less than 25 grams of saturated
fat?
0%
10%
20%
30%
Per
ce
nt
Cancer
Healthy
25 50 75 1000%
10%
20%
30%
Per
ce
nt
Copyright (c) Bani Mallick 40
What percentage of
the healthy women ate less than 25 grams of saturated
fat?
Look at the 3 bars, of
about 18%, 20% and
28%, for a total of about
66%
0%
10%
20%
30%
Per
ce
nt
Cancer
Healthy
25 50 75 1000%
10%
20%
30%
Per
ce
nt
Copyright (c) Bani Mallick 41
Histograms and Shifts:
note how bottom plot has higher values
Sample from population A
Sample frompopulationB
0%
4%
8%
12%
Per
cen
t
.00
1.00
-2.0000 -1.0000 0.0000 1.0000 2.0000 3.0000
response
0%
4%
8%
12%
Per
cen
t
Copyright (c) Bani Mallick 42
Histograms and
Variability: note how top plot
has more concentrated values
Sample from population A
Sample frompopulationB
0%
5%
10%
15%
20%
Per
cen
t
.00
1.00
-4.0000 -2.0000 0.0000 2.0000 4.0000
v2
0%
5%
10%
15%
20%
Per
cen
t
Copyright (c) Bani Mallick 43
The Population Mean
• In many problems, the goal is to make inference about the population mean of a numerical variable, e.g., saturated fat intake
• Define in words what you mean by the population mean!
Copyright (c) Bani Mallick 44
The Population Mean
• In many problems, the goal is to make inference about the population mean of a numerical variable, e.g., saturated fat intake
• You’re right! The population mean is the average of all the outcomes in the population
• It cannot be measured, hence we take samples.
• BTW, what’s an average?
Copyright (c) Bani Mallick 45
The Sample Mean
• Formal definition: If the sample is of size n and the data are X1,…, Xn , then the sample mean is
• This is the sum over all the observed values, divided by the number of observations
n
i1 2 n i=1
Σ ΧΧ +Χ +...ΧΧ= =
n n
Copyright (c) Bani Mallick 46
Sample Mean: Example
= the sum over all the observed values, divided by the number of observations
Data: -4, –2, –2, –1, 0, 0, 0, 0, 2, 2, 3, 5,
n =
sum =
=
X
X
Copyright (c) Bani Mallick 47
Sample Mean: Example
= the sum over all the observed values, divided by the number of observations
Data: -4, –2, –2, –1, 0, 0, 0, 0, 2, 2, 3, 5,
n = 12
sum =
=
X
X
Copyright (c) Bani Mallick 48
Sample Mean: Example
= the sum over all the observed values, divided by the number of observations
Data: -4, –2, –2, –1, 0, 0, 0, 0, 2, 2, 3, 5,
n = 12
sum = 3
=
X
X
Copyright (c) Bani Mallick 49
Sample Mean: Example
= the sum over all the observed values, divided by the number of observations
Data: -4, –2, –2, –1, 0, 0, 0, 0, 2, 2, 3, 5,
n = 12
sum = 3
= 3/12 = 0.25
X
X