Introduction to Statistics Data description and summary.
-
date post
22-Dec-2015 -
Category
Documents
-
view
233 -
download
4
Transcript of Introduction to Statistics Data description and summary.
![Page 1: Introduction to Statistics Data description and summary.](https://reader035.fdocuments.in/reader035/viewer/2022062407/56649d7f5503460f94a62c08/html5/thumbnails/1.jpg)
Introduction to Statistics
Data description and summary
![Page 2: Introduction to Statistics Data description and summary.](https://reader035.fdocuments.in/reader035/viewer/2022062407/56649d7f5503460f94a62c08/html5/thumbnails/2.jpg)
Statistics Derived from the word state, which means the
collection of facts of interest to the state The art of learning from data Statistics are no substitute for judgment.
A scientific discipline can be used to collect, describe, summarize, and analyze the data
Descriptive vs. inferential It is a usual expectation to draw a meaningful
conclusion beyond a merely descriptive figure or table from the collected data
An extrapolative inference, a method of deduction
![Page 3: Introduction to Statistics Data description and summary.](https://reader035.fdocuments.in/reader035/viewer/2022062407/56649d7f5503460f94a62c08/html5/thumbnails/3.jpg)
Probability
Some assumptions about the chances of obtaining the different data values for drawing certain logical conclusions
A totality of these assumptions is referred to as a probability model
An inductive approach
![Page 4: Introduction to Statistics Data description and summary.](https://reader035.fdocuments.in/reader035/viewer/2022062407/56649d7f5503460f94a62c08/html5/thumbnails/4.jpg)
Statistics vs. probability
Source: http:///ocw.mit.edu/OcwWeb/Sloan-School-of-Management/
![Page 5: Introduction to Statistics Data description and summary.](https://reader035.fdocuments.in/reader035/viewer/2022062407/56649d7f5503460f94a62c08/html5/thumbnails/5.jpg)
Data: A Set of measurements
Character Nominal, e.g., color: red, green, blue
Binary e.g., (M,F), (H,T), (0,1) Ordinal, e.g., attitude to war: agree, neutral, disagree
Numeric Discrete, e.g., number of children Continuous. e.g., distance, time, temperature Interval, e.g., Fahrenheit/Celsius temperature Ratio (real zero), e.g., distance, number of
children
![Page 6: Introduction to Statistics Data description and summary.](https://reader035.fdocuments.in/reader035/viewer/2022062407/56649d7f5503460f94a62c08/html5/thumbnails/6.jpg)
Concepts about data Population: The set of all units of interest
(finite or infinite). E.g., all students at NCNU
Sample: A subset/subgroup of the population actually observed. E.g., students in this room.
Variable: A property or attribute of each unit, e.g., age, height (a column field within a table)
Observation: Values of all variables for an individual unit (a row record in the table)
![Page 7: Introduction to Statistics Data description and summary.](https://reader035.fdocuments.in/reader035/viewer/2022062407/56649d7f5503460f94a62c08/html5/thumbnails/7.jpg)
Matrix form of raw data
…
…
variable
observation
Sam
ple
![Page 8: Introduction to Statistics Data description and summary.](https://reader035.fdocuments.in/reader035/viewer/2022062407/56649d7f5503460f94a62c08/html5/thumbnails/8.jpg)
Properties of measurements Parameter:
Numerical characteristic of population, defined for each variable, e.g. proportion opposed to war
Statistic: Numerical function of sample used to estimate
population parameter. Precision:
Spread of estimator of a parameter Accuracy:
How close estimator is to true value Bias:
Systematic deviation of estimate from true value
![Page 9: Introduction to Statistics Data description and summary.](https://reader035.fdocuments.in/reader035/viewer/2022062407/56649d7f5503460f94a62c08/html5/thumbnails/9.jpg)
Accuracy vs. Precision
Source: http:///ocw.mit.edu/OcwWeb/Sloan-School-of-Management/
![Page 10: Introduction to Statistics Data description and summary.](https://reader035.fdocuments.in/reader035/viewer/2022062407/56649d7f5503460f94a62c08/html5/thumbnails/10.jpg)
Is it a good sample?
Is it a representative sample from the interested population? Preexisted Bias? unavoidable errors?
![Page 11: Introduction to Statistics Data description and summary.](https://reader035.fdocuments.in/reader035/viewer/2022062407/56649d7f5503460f94a62c08/html5/thumbnails/11.jpg)
Describing data sets Frequency tables and graphs
Scatter plot, bar/pie chart (for attraction) Relative frequency tables and graphs Grouped data with
histograms, Ogive (cumulative frequency), e.g., the Lawrence cu
rve for national wealth distribution Stem-and-leaf plot
Always plot your data appropriately - try several ways!
![Page 12: Introduction to Statistics Data description and summary.](https://reader035.fdocuments.in/reader035/viewer/2022062407/56649d7f5503460f94a62c08/html5/thumbnails/12.jpg)
Scatter plot
Variable x or observation number
Vari
ab
le Y
or
ob
serv
ati
on
![Page 13: Introduction to Statistics Data description and summary.](https://reader035.fdocuments.in/reader035/viewer/2022062407/56649d7f5503460f94a62c08/html5/thumbnails/13.jpg)
Line graph (chart)
![Page 14: Introduction to Statistics Data description and summary.](https://reader035.fdocuments.in/reader035/viewer/2022062407/56649d7f5503460f94a62c08/html5/thumbnails/14.jpg)
Bar chart
![Page 15: Introduction to Statistics Data description and summary.](https://reader035.fdocuments.in/reader035/viewer/2022062407/56649d7f5503460f94a62c08/html5/thumbnails/15.jpg)
Relative frequency
=200=n
(42/200)=
![Page 16: Introduction to Statistics Data description and summary.](https://reader035.fdocuments.in/reader035/viewer/2022062407/56649d7f5503460f94a62c08/html5/thumbnails/16.jpg)
Pie chart
![Page 17: Introduction to Statistics Data description and summary.](https://reader035.fdocuments.in/reader035/viewer/2022062407/56649d7f5503460f94a62c08/html5/thumbnails/17.jpg)
Histogram ( 柱狀圖 / 直方圖 ) Class intervals: a trade-off between too-
few and too-many classes Class boundaries: left-end inclusion
convention E.g., the interval 20-30 contains all values
that both greater than or equal to 20 and less than 30
c.f. right-end inclusion, (MS Excel) Pareto histogram: a bar chart with
categories arranged from the highest to lowest
![Page 18: Introduction to Statistics Data description and summary.](https://reader035.fdocuments.in/reader035/viewer/2022062407/56649d7f5503460f94a62c08/html5/thumbnails/18.jpg)
The life hours of lamps
![Page 19: Introduction to Statistics Data description and summary.](https://reader035.fdocuments.in/reader035/viewer/2022062407/56649d7f5503460f94a62c08/html5/thumbnails/19.jpg)
Interpretation of histogram Area under the histogram represents sampl
e proportion If too many intervals, too jagged; (polygon grap
h) If too few, too smooth
Detecting the data distribution (chart) Symmetric or skewed Uni-modal or bi-modal
Only used for categorizing the numerical data
![Page 20: Introduction to Statistics Data description and summary.](https://reader035.fdocuments.in/reader035/viewer/2022062407/56649d7f5503460f94a62c08/html5/thumbnails/20.jpg)
Ogive (cumulative relative frequency graph)
![Page 21: Introduction to Statistics Data description and summary.](https://reader035.fdocuments.in/reader035/viewer/2022062407/56649d7f5503460f94a62c08/html5/thumbnails/21.jpg)
Stem-and-leaf plotThe case of city minimum temperatures
The tens digit
The ones digit
•You had better sort the data from the smallest to the largest before the stem-and-leaf assignment
The length of leaf means the frequency of this stem (interval)
![Page 22: Introduction to Statistics Data description and summary.](https://reader035.fdocuments.in/reader035/viewer/2022062407/56649d7f5503460f94a62c08/html5/thumbnails/22.jpg)
Run chart For time series data, it is often useful to plot
the data in time sequence.
electric cost
1
2 3
45
6 7 8
9 10
11
12
05000
100001500020000250003000035000400004500050000
0 2 4 6 8 10 12 14
month
elec
tric
cost
![Page 23: Introduction to Statistics Data description and summary.](https://reader035.fdocuments.in/reader035/viewer/2022062407/56649d7f5503460f94a62c08/html5/thumbnails/23.jpg)
Summarizing data sets
Measures of location & central tendency Sample mean, sample median, sample mode
Measures of dispersion Sample variance, sample standard deviation
Sample percentile (quartiles, quantiles) Box (and whiskers) plots, QQ plots
![Page 24: Introduction to Statistics Data description and summary.](https://reader035.fdocuments.in/reader035/viewer/2022062407/56649d7f5503460f94a62c08/html5/thumbnails/24.jpg)
Mean Simple average Weighted average
![Page 25: Introduction to Statistics Data description and summary.](https://reader035.fdocuments.in/reader035/viewer/2022062407/56649d7f5503460f94a62c08/html5/thumbnails/25.jpg)
Median
The middle value is located when the data are arranged in a increasing/decreasing order.
![Page 26: Introduction to Statistics Data description and summary.](https://reader035.fdocuments.in/reader035/viewer/2022062407/56649d7f5503460f94a62c08/html5/thumbnails/26.jpg)
Mode
The value occurs most frequently If no single value occurs most
frequently, all the values that occur at the highest frequency are called mode values.
![Page 27: Introduction to Statistics Data description and summary.](https://reader035.fdocuments.in/reader035/viewer/2022062407/56649d7f5503460f94a62c08/html5/thumbnails/27.jpg)
Skew-ness
Adjusted by the log transformation
Adjusted by the exponential or squared transformation
Exercise and justify it yourselves
![Page 28: Introduction to Statistics Data description and summary.](https://reader035.fdocuments.in/reader035/viewer/2022062407/56649d7f5503460f94a62c08/html5/thumbnails/28.jpg)
A case of bimodal histogram
![Page 29: Introduction to Statistics Data description and summary.](https://reader035.fdocuments.in/reader035/viewer/2022062407/56649d7f5503460f94a62c08/html5/thumbnails/29.jpg)
Mean or median? Appropriate summary of the center of the data?
Mean—if the data has a symmetric distribution with light tails (i.e. a relatively small proportion of the observations lie away from the center of the data).
Median—if the distribution has heavy tails or is asymmetric.
Extreme values that are far removed from the main body of the data are called outliers. Large influence on the mean but not on the median.
![Page 30: Introduction to Statistics Data description and summary.](https://reader035.fdocuments.in/reader035/viewer/2022062407/56649d7f5503460f94a62c08/html5/thumbnails/30.jpg)
Sample variance
(Check it!)
![Page 31: Introduction to Statistics Data description and summary.](https://reader035.fdocuments.in/reader035/viewer/2022062407/56649d7f5503460f94a62c08/html5/thumbnails/31.jpg)
Linear computation of sample variance
if
![Page 32: Introduction to Statistics Data description and summary.](https://reader035.fdocuments.in/reader035/viewer/2022062407/56649d7f5503460f94a62c08/html5/thumbnails/32.jpg)
Sample standard deviation
![Page 33: Introduction to Statistics Data description and summary.](https://reader035.fdocuments.in/reader035/viewer/2022062407/56649d7f5503460f94a62c08/html5/thumbnails/33.jpg)
Percentiles , Quartiles The sample 100p percentile (p quantile) is tha
t data value such that 100p percent of the data are less than or equal to it and 100(1-p) percent are greater than or equal to it.
The sample 25 percentile is called the first quartile, Q1; the sample 50 percentile is called the sample median or the second quartile, Q2; the sample 75 percentile is called the third quartile, Q3.
![Page 34: Introduction to Statistics Data description and summary.](https://reader035.fdocuments.in/reader035/viewer/2022062407/56649d7f5503460f94a62c08/html5/thumbnails/34.jpg)
Finding the sample percentiles
To determine the sample 100p percentile of a data set of size n, Xp, we need to determine the data values such that
(1)At least np of the values are less than or equal to it.
(2)At least n(1-p) of the values are greater than or equal to it. If np is NOT an integer, round up to the next integer
and set the corresponding observation Xp If np is an integer K, average the Kth and (K+1)st ord
ered values. This average is then Xp.
![Page 35: Introduction to Statistics Data description and summary.](https://reader035.fdocuments.in/reader035/viewer/2022062407/56649d7f5503460f94a62c08/html5/thumbnails/35.jpg)
Five number summary
The minimum, The maximum, and three quartiles, Q1, Q2, Q3
![Page 36: Introduction to Statistics Data description and summary.](https://reader035.fdocuments.in/reader035/viewer/2022062407/56649d7f5503460f94a62c08/html5/thumbnails/36.jpg)
Box (and Whiskers) plots A “box” starts at the Q1 and continues to
the Q3, so the length of box is called the interquartile range. (50% of distribution)
the value of the Q2 indicated by a vertical line
A straight line segment (i.e., whiskers) stretching from the smallest to the largest data value (i.e., the range) is drawn on a horizontal axis.
Min. Max.Q1 Q2 Q3Case 1.
![Page 37: Introduction to Statistics Data description and summary.](https://reader035.fdocuments.in/reader035/viewer/2022062407/56649d7f5503460f94a62c08/html5/thumbnails/37.jpg)
Lower fence and upper fence
Min.
Max. Possible outliers Case 2.
Median
Q3
Q1Whisker extends to this adjacent value, the lowest value within the lower fence= Q1 - 1.5 (Q3 - Q1)
**
Whisker extends to this adjacent value, the highest value within the upper fence= Q3 + 1.5 (Q3 - Q1)
![Page 38: Introduction to Statistics Data description and summary.](https://reader035.fdocuments.in/reader035/viewer/2022062407/56649d7f5503460f94a62c08/html5/thumbnails/38.jpg)
Normal sample distribution
For normal data and large samples 50% of the data values fall between mean ±
0.67s 68% of the data values fall between mean ±
1s 95% of the data values fall between mean ±
2s 99.7% of the data values fall between mean
± 3s
![Page 39: Introduction to Statistics Data description and summary.](https://reader035.fdocuments.in/reader035/viewer/2022062407/56649d7f5503460f94a62c08/html5/thumbnails/39.jpg)
QQ (normal) plots Sequentially compare the sample data to the quantiles of theoretical (normal) distribution The ith ordered data value is the pth quanntile, p=(i-0.5)/n
Raw
data
Quantiles of standard normal
![Page 40: Introduction to Statistics Data description and summary.](https://reader035.fdocuments.in/reader035/viewer/2022062407/56649d7f5503460f94a62c08/html5/thumbnails/40.jpg)
Paired data sets (X, Y) andthe sample correlation coefficient, r
r
![Page 41: Introduction to Statistics Data description and summary.](https://reader035.fdocuments.in/reader035/viewer/2022062407/56649d7f5503460f94a62c08/html5/thumbnails/41.jpg)
Illustrations of correlation
![Page 42: Introduction to Statistics Data description and summary.](https://reader035.fdocuments.in/reader035/viewer/2022062407/56649d7f5503460f94a62c08/html5/thumbnails/42.jpg)
r vs. Linear relation
If the these two paired data sets x and y possess a linear relation, y=a+bx, with b>0, then r=1.
If the these two paired data sets x and y possess a linear relation, y=a+bx, with b<0, then r=-1.
r is just an indicator telling how perfect a linear relation exists between X, and y
![Page 43: Introduction to Statistics Data description and summary.](https://reader035.fdocuments.in/reader035/viewer/2022062407/56649d7f5503460f94a62c08/html5/thumbnails/43.jpg)
Properties of r |r| ≤ 1, (why? See the 2.6.1)
If r is positive, x and y may change in the same direction.
If r is negative, x and y may not change in the same direction.
Correlation measures association, not causation Causation still needs the other necessary
conditions: time sequence, exclusion E.g., Wealth and health problems go up with age.
Does wealth cause health problems?
![Page 44: Introduction to Statistics Data description and summary.](https://reader035.fdocuments.in/reader035/viewer/2022062407/56649d7f5503460f94a62c08/html5/thumbnails/44.jpg)
Chebyshev’s inequality
Let Set
(The lower bound)
![Page 45: Introduction to Statistics Data description and summary.](https://reader035.fdocuments.in/reader035/viewer/2022062407/56649d7f5503460f94a62c08/html5/thumbnails/45.jpg)
Proof
Dividing both sides by
The next step? And the upper bound of N(k)/n
![Page 46: Introduction to Statistics Data description and summary.](https://reader035.fdocuments.in/reader035/viewer/2022062407/56649d7f5503460f94a62c08/html5/thumbnails/46.jpg)
Categorizing the bi-variate data
![Page 47: Introduction to Statistics Data description and summary.](https://reader035.fdocuments.in/reader035/viewer/2022062407/56649d7f5503460f94a62c08/html5/thumbnails/47.jpg)
Simpon’s paradox Lurking variables excluded from
considerations can change or reverse a relation between two categorical variables
![Page 48: Introduction to Statistics Data description and summary.](https://reader035.fdocuments.in/reader035/viewer/2022062407/56649d7f5503460f94a62c08/html5/thumbnails/48.jpg)
Gender bias of graduate admissions
35 20
45 40
30 10
30 10
5 10
15 30
Male Female
Ad.
Rej.
Ad.
Rej.
Ad.
Rej.
Male Female
Male Female
35/80 20/60
Engineering
school
Art school
30/60 10/20
5/20 10/40
![Page 49: Introduction to Statistics Data description and summary.](https://reader035.fdocuments.in/reader035/viewer/2022062407/56649d7f5503460f94a62c08/html5/thumbnails/49.jpg)
Homework #1
Chapter 1: Problem 2, 6 Chapter 2: Problem 15 (You had better
use Excel or the book-included software to compute the data.)
![Page 50: Introduction to Statistics Data description and summary.](https://reader035.fdocuments.in/reader035/viewer/2022062407/56649d7f5503460f94a62c08/html5/thumbnails/50.jpg)
Graphical Excellence “Complex ideas communicated with
clarity, precision, and efficiency” Shows the data Makes you think about substance rather
than method, graphic design, or something else
Many numbers in a small space Makes large data sets coherent Encourages the eye to compare different
pieces of the data
![Page 51: Introduction to Statistics Data description and summary.](https://reader035.fdocuments.in/reader035/viewer/2022062407/56649d7f5503460f94a62c08/html5/thumbnails/51.jpg)
ACCENT Principles for effective graphical display
Apprehension: Ability to correctly perceive relations among variables.
Does the graph maximize apprehension of the relations among variables?
Clarity: Ability to visually distinguish all the elements of a graph.
Are the most important elements or relations visually most prominent?
Consistency: Ability to interpret a graph based on similarity to previous graphs.
Are the elements, symbol shapes and colors consistent with their use in previous graphs?
![Page 52: Introduction to Statistics Data description and summary.](https://reader035.fdocuments.in/reader035/viewer/2022062407/56649d7f5503460f94a62c08/html5/thumbnails/52.jpg)
ACCENT Principles for effective graphical display (Cont.)
Efficiency: Ability to portray a possibly complex relation in as simple a way
as possible Are the elements of the graph economically used? Is the graph easy to interpret?
Necessity: The need for the graph, and the graphical elements.
Is the graph a more useful way to represent the data than alternatives (table, text)?
Are all the graph elements necessary to convey the relations? Truthfulness:
Ability to determine the true value represented by any graphical element by its magnitude relative to the implicit or explicit scale.
Are the graph elements accurately positioned and scaled? Source: http://www.math.yorku.ca/SCS/Gallery/, Adapted from: D. A. Burn (1993), "Designing Effective Statistical Graphs". in C. R. Rao, ed., Handbook of Statistics, vol. 9, Chapter 22.
![Page 53: Introduction to Statistics Data description and summary.](https://reader035.fdocuments.in/reader035/viewer/2022062407/56649d7f5503460f94a62c08/html5/thumbnails/53.jpg)
Lies on graphical display (1)
![Page 54: Introduction to Statistics Data description and summary.](https://reader035.fdocuments.in/reader035/viewer/2022062407/56649d7f5503460f94a62c08/html5/thumbnails/54.jpg)
Lies on graphical display (2)
![Page 55: Introduction to Statistics Data description and summary.](https://reader035.fdocuments.in/reader035/viewer/2022062407/56649d7f5503460f94a62c08/html5/thumbnails/55.jpg)
Lies on graphical display (3)
![Page 56: Introduction to Statistics Data description and summary.](https://reader035.fdocuments.in/reader035/viewer/2022062407/56649d7f5503460f94a62c08/html5/thumbnails/56.jpg)
Lies on graphical display (4)
![Page 57: Introduction to Statistics Data description and summary.](https://reader035.fdocuments.in/reader035/viewer/2022062407/56649d7f5503460f94a62c08/html5/thumbnails/57.jpg)
Advices on graphical display
Changes in the scale of the graphic should always correspond to changes in the data being represented
Avoid the confused dimensions Be careful of misunderstanding
from the goosed-up way Don’t quote data from the context