Introduction to Statistics

Introduction to Statistics

Data description and summary

Statistics Derived from the word state, which means the

collection of facts of interest to the state The art of learning from data Statistics are no substitute for judgment.

A scientific discipline can be used to collect, describe, summarize, and analyze the data

Descriptive vs. inferential It is a usual expectation to draw a meaningful

conclusion beyond a merely descriptive figure or table from the collected data

An extrapolative inference, a method of deduction

Probability

Some assumptions about the chances of obtaining the different data values for drawing certain logical conclusions

A totality of these assumptions is referred to as a probability model

An inductive approach

Statistics vs. probability

Source: http:///ocw.mit.edu/OcwWeb/Sloan-School-of-Management/

Data: A Set of measurements

Character Nominal, e.g., color: red, green, blue

Binary e.g., (M,F), (H,T), (0,1) Ordinal, e.g., attitude to war: agree, neutral, disagree

Numeric Discrete, e.g., number of children Continuous. e.g., distance, time, temperature Interval, e.g., Fahrenheit/Celsius temperature Ratio (real zero), e.g., distance, number of

children

Concepts about data Population: The set of all units of interest

(finite or infinite). E.g., all students at NCNU

Sample: A subset/subgroup of the population actually observed. E.g., students in this room.

Variable: A property or attribute of each unit, e.g., age, height (a column field within a table)

Observation: Values of all variables for an individual unit (a row record in the table)

Matrix form of raw data

…

…

variable

observation

Sam

ple

Properties of measurements Parameter:

Numerical characteristic of population, defined for each variable, e.g. proportion opposed to war

Statistic: Numerical function of sample used to estimate

population parameter. Precision:

Spread of estimator of a parameter Accuracy:

How close estimator is to true value Bias:

Systematic deviation of estimate from true value

Accuracy vs. Precision

Source: http:///ocw.mit.edu/OcwWeb/Sloan-School-of-Management/

Is it a good sample?

Is it a representative sample from the interested population? Preexisted Bias? unavoidable errors?

Describing data sets Frequency tables and graphs

Scatter plot, bar/pie chart (for attraction) Relative frequency tables and graphs Grouped data with

histograms, Ogive (cumulative frequency), e.g., the Lawrence cu

rve for national wealth distribution Stem-and-leaf plot

Always plot your data appropriately - try several ways!

Scatter plot

Variable x or observation number

Vari

ab

le Y

or

ob

serv

ati

on

Line graph (chart)

Bar chart

Relative frequency

=200=n

(42/200)=

Pie chart

Histogram ( 柱狀圖 / 直方圖 ) Class intervals: a trade-off between too-

few and too-many classes Class boundaries: left-end inclusion

convention E.g., the interval 20-30 contains all values

that both greater than or equal to 20 and less than 30

c.f. right-end inclusion, (MS Excel) Pareto histogram: a bar chart with

categories arranged from the highest to lowest

The life hours of lamps

Interpretation of histogram Area under the histogram represents sampl

e proportion If too many intervals, too jagged; (polygon grap

h) If too few, too smooth

Detecting the data distribution (chart) Symmetric or skewed Uni-modal or bi-modal

Only used for categorizing the numerical data

Ogive (cumulative relative frequency graph)

Stem-and-leaf plotThe case of city minimum temperatures

The tens digit

The ones digit

•You had better sort the data from the smallest to the largest before the stem-and-leaf assignment

The length of leaf means the frequency of this stem (interval)

Run chart For time series data, it is often useful to plot

the data in time sequence.

electric cost

1

2 3

45

6 7 8

9 10

11

12

05000

100001500020000250003000035000400004500050000

0 2 4 6 8 10 12 14

month

elec

tric

cost

Summarizing data sets

Measures of location & central tendency Sample mean, sample median, sample mode

Measures of dispersion Sample variance, sample standard deviation

Sample percentile (quartiles, quantiles) Box (and whiskers) plots, QQ plots

Mean Simple average Weighted average

Median

The middle value is located when the data are arranged in a increasing/decreasing order.

Mode

The value occurs most frequently If no single value occurs most

frequently, all the values that occur at the highest frequency are called mode values.

Skew-ness

Adjusted by the log transformation

Adjusted by the exponential or squared transformation

Exercise and justify it yourselves

A case of bimodal histogram

Mean or median? Appropriate summary of the center of the data?

Mean—if the data has a symmetric distribution with light tails (i.e. a relatively small proportion of the observations lie away from the center of the data).

Median—if the distribution has heavy tails or is asymmetric.

Extreme values that are far removed from the main body of the data are called outliers. Large influence on the mean but not on the median.

Sample variance

(Check it!)

Linear computation of sample variance

if

Sample standard deviation

Percentiles , Quartiles The sample 100p percentile (p quantile) is tha

t data value such that 100p percent of the data are less than or equal to it and 100(1-p) percent are greater than or equal to it.

The sample 25 percentile is called the first quartile, Q1; the sample 50 percentile is called the sample median or the second quartile, Q2; the sample 75 percentile is called the third quartile, Q3.

Finding the sample percentiles

To determine the sample 100p percentile of a data set of size n, Xp, we need to determine the data values such that

(1)At least np of the values are less than or equal to it.

(2)At least n(1-p) of the values are greater than or equal to it. If np is NOT an integer, round up to the next integer

and set the corresponding observation Xp If np is an integer K, average the Kth and (K+1)st ord

ered values. This average is then Xp.

Five number summary

The minimum, The maximum, and three quartiles, Q1, Q2, Q3

Box (and Whiskers) plots A “box” starts at the Q1 and continues to

the Q3, so the length of box is called the interquartile range. (50% of distribution)

the value of the Q2 indicated by a vertical line

A straight line segment (i.e., whiskers) stretching from the smallest to the largest data value (i.e., the range) is drawn on a horizontal axis.

Min. Max.Q1 Q2 Q3Case 1.

Lower fence and upper fence

Min.

Max. Possible outliers Case 2.

Median

Q3

Q1Whisker extends to this adjacent value, the lowest value within the lower fence= Q1 - 1.5 (Q3 - Q1)

**

Whisker extends to this adjacent value, the highest value within the upper fence= Q3 + 1.5 (Q3 - Q1)

Normal sample distribution

For normal data and large samples 50% of the data values fall between mean ±

0.67s 68% of the data values fall between mean ±

1s 95% of the data values fall between mean ±

2s 99.7% of the data values fall between mean

± 3s

QQ (normal) plots Sequentially compare the sample data to the quantiles of theoretical (normal) distribution The ith ordered data value is the pth quanntile, p=(i-0.5)/n

Raw

data

Quantiles of standard normal

Paired data sets (X, Y) andthe sample correlation coefficient, r

r

Illustrations of correlation

r vs. Linear relation

If the these two paired data sets x and y possess a linear relation, y=a+bx, with b>0, then r=1.

If the these two paired data sets x and y possess a linear relation, y=a+bx, with b<0, then r=-1.

r is just an indicator telling how perfect a linear relation exists between X, and y

Properties of r |r| ≤ 1, (why? See the 2.6.1)

If r is positive, x and y may change in the same direction.

If r is negative, x and y may not change in the same direction.

Correlation measures association, not causation Causation still needs the other necessary

conditions: time sequence, exclusion E.g., Wealth and health problems go up with age.

Does wealth cause health problems?

Chebyshev’s inequality

Let Set

(The lower bound)

Proof

Dividing both sides by

The next step? And the upper bound of N(k)/n

Categorizing the bi-variate data

Simpon’s paradox Lurking variables excluded from

considerations can change or reverse a relation between two categorical variables

Gender bias of graduate admissions

35 20

45 40

30 10

30 10

5 10

15 30

Male Female

Ad.

Rej.

Ad.

Rej.

Ad.

Rej.

Male Female

Male Female

35/80 20/60

Engineering

school

Art school

30/60 10/20

5/20 10/40

Homework #1

Chapter 1: Problem 2, 6 Chapter 2: Problem 15 (You had better

use Excel or the book-included software to compute the data.)

Graphical Excellence “Complex ideas communicated with

clarity, precision, and efficiency” Shows the data Makes you think about substance rather

than method, graphic design, or something else

Many numbers in a small space Makes large data sets coherent Encourages the eye to compare different

pieces of the data

ACCENT Principles for effective graphical display

Apprehension: Ability to correctly perceive relations among variables.

Does the graph maximize apprehension of the relations among variables?

Clarity: Ability to visually distinguish all the elements of a graph.

Are the most important elements or relations visually most prominent?

Consistency: Ability to interpret a graph based on similarity to previous graphs.

Are the elements, symbol shapes and colors consistent with their use in previous graphs?

ACCENT Principles for effective graphical display (Cont.)

Efficiency: Ability to portray a possibly complex relation in as simple a way

as possible Are the elements of the graph economically used? Is the graph easy to interpret?

Necessity: The need for the graph, and the graphical elements.

Is the graph a more useful way to represent the data than alternatives (table, text)?

Are all the graph elements necessary to convey the relations? Truthfulness:

Ability to determine the true value represented by any graphical element by its magnitude relative to the implicit or explicit scale.

Are the graph elements accurately positioned and scaled? Source: http://www.math.yorku.ca/SCS/Gallery/, Adapted from: D. A. Burn (1993), "Designing Effective Statistical Graphs". in C. R. Rao, ed., Handbook of Statistics, vol. 9, Chapter 22.

Lies on graphical display (1)

Advices on graphical display

Changes in the scale of the graphic should always correspond to changes in the data being represented

Avoid the confused dimensions Be careful of misunderstanding

from the goosed-up way Don’t quote data from the context

Introduction to Statistics

Documents

Transcript of Introduction to Statistics