K236:&Basis&of&Data&Sciencebao/K236/K236-L3-print.pdf · K236:&Basis&of&Data&Science...

K236:&Basis&of&Data&ScienceLecture&3.&Review&of&univariate&statistics

Lecturer: Tu Bao Ho and Hieu Chi DamTA: Moharasan Gandhimathi

and Nuttapong Sanglerdsinlapachai

2

Schedule of K2361. Introduction to data science 6/9

2. Introduction to data science 6/13

3. Data and databases 6/16

4. Review of univariate statistics 6/20

5. Review of linear algebra 6/23

6. Data mining software 6/27

7. Data preprocessing 6/30

8. Classification and prediction (1) (1) 7/4

9. Knowledge evaluation 7/7



12. Mining association rules (1) 7/18

13. Mining association rules (2) 7/21

14. Cluster analysis 7/25

15. Review and Examination (the data is not fixed) 7/27

Basic mathematics for data science

3

Outline

• Brief of probability! Probablity distribution! Normal distribution! Sampling distribution

• Brief of statistics! Estimation! Hypothesis testing

4

5

What is statistics?

Statistics provides principles and methodologyfor designing the process of:

! Data Collection

! Summarizing and Interpreting the data

! Drawing Conclusions or Generalities

6

Football 35% 38% 36% 21% 17%

Baseball 16% 16% 21% 34% 39%

Basketball 15% 9% 8% % 10%

Others 33% 37% 35% 36% 34%

Interest in football in baseballin basketball (last decade)

Data$Collection

Draw$Conclusions

Summarization$– Interpretation

What is statistics?

Sport 1990 1981 1972 1960 1948 Mean

Football 35% 38% 36% 21% 17% 29.40%

Baseball 16% 16% 21% 34% 39% 24.80%

Baseketball 15% 9% 8% 9% 10% 10.20%

Others 33% 37% 35% 36% 34% 33.00%

Population and sample

7

• The population is the completecollection of persons, objects… whose characteristics are of interest.

.

• A sample from a population is the set of objects whose data are actually collected in the course of an investigation.

.

• Good sample should be randomlycollected (random sample, ).

Is this a good sample?

Topic:?Which?

motorbikes?are?

preferred?by?different?

groups?of?people?for?

daily?transportation?

Essence&of&statistics

8

population population

sampledata

variables

• Estimation

• Hypothesis?testing

Statistical&Inference

parameters

statistic

• A?parameter is?a?numerical*feature*of?the?population?( ),?such?as?mean,?proportion,?standard?deviation?of?an?attribute.

• Statistical&inference&is?the?way?of?drawing?conclusions?about?population?parameters?from?an?analysis?of?the?sample?data.?

We?want?to?know?

about?the?population?

parameter!

Probability

9

Random variables

• An experiment is the process of observing a phenomenon that has variation in its outcomes. .

• The experiment’s outcomes can be numeric (1, 2, …, 6) or non-numeric (ten or house). For computation, we qualify experiment’s outcomes by assigning each of them a numerical value related to a characteristic of interest.

• A random variable X is a function that associates a numerical value with each outcome of an experiment.

• Random” means before the experiment we do not know the outcome of an experiment or its associated value of X.

Random) X

A?random*variable !:Ω$ → $ℝ$is?a?measurable?function?from?the?set?of?possible?outcomes?to?ℝ.10

Random variables

" A random variable is discrete if it has either a finite number of values or infinitely many values that can be arranged in a sequence.

! Example: Number of cars in JAIST parking during 1 day.JAIST .

" A continuous random variable is a random variable that represents some measurement on a continuous scale and therefore capable of assuming all values in an interval.

! Example: Rainfall after each rain during the raining season..

Probability distribution of a discrete variable

11

• The probability distribution of a discrete random variable describes the probability of occurrence of each value of the variable.

• The probability distribution (or distribution) of a discrete random variable X is a list of the distinct numerical

values of X along with their associated probabilities.

X

.

Value'of'X Probability

0'''''''''''''''''1/8

1'''''''''''''''''3/8

2'''''''''''''''''3/8

3'''''''''''''''''1/8

Total''''''''''''''''1

Probability'distribution'of'XX

12

The?probability&distribution of?a?discrete?random?variable?is?often?described?as?the?function(

][)( ii xXPxf ==

which gives the probability for each value and satisfies

Form of a discrete probability distribution

1. ) *+ ≥ 0, )/0$123ℎ$*+$/)$!

2. ∑ ) *+ = 18+9:

Probability distribution of a continuous variable

13

• The probability distribution of a continuous random variable describes the probabilities of the possible values of the variable.

• Probability of a value range of a continuous random variable X are defined as the area under the curve of its probability distribution function (PDF).

Probability?that?a?man?weighs?between?

160?and?170?pounds

Essence&of&statistics

14

population population

sampledata

variables

• Estimation• Hypothesis?testing

Statistical?Inference

Probability)distributionof)variables(joint)distribution)

parametersstatistic

• A?parameter is?a?numerical*feature*of?the?population?( ),????????????????such?as?mean,?proportion,?standard?deviation.

• A? is?a?single?measure?of?some?attribute?of?a?sample.?It?is?defined?as?a?numerical2valued*function*of?the?sample?observations.

Probability)distributionof)a)statistic

Statistical inference is the ways of drawing conclusions about population parameters from an analysis of the sample data.

15

The normal distribution

+!

decreasing in a symmetric manner.

" Plays a central role in statistics, and inference procedures derived from it have wide applicability and form the backbone of current methods of statistical analysis.

Essence of statistics

16

" A numerical feature of a population is called a parameter. The true value of a population parameter is an unknown constant. A numerical characteristic of a sample is called a statistic. The value of a statistic varies in repeated sampling.

. . .

" Generalizations in statistics (statistical inference) are founded on the understanding of the manner in which variation in the population is transmitted, by sampling, to variation in statistics like the sample mean.

17

Sampling distribution

" Random sampling from a population refers to independent selections where each observation has the same distribution as the population.

" When random sampling from a population, a statistic is a random variable. The probability distribution of a statistic is called its sampling distribution.

18

Sampling distribution

n m

N m, s) ;(=, > ?�⁄ ).With increasing n, the distribution of is more concentrated around m. If the population distribution is normal ; =, > , the distribution of !B is ;(=, > ?�⁄ ).

Regardless of the shape of the population distribution, the distribution of !B$is approximately ;(=, > ?�⁄ ), provided that n is large. This result is called the central limit theorem.

n X ;(=, > ?�⁄ )

When random sampling from a population, a statistic is a random variable. The probability distribution of a statistic is called its sampling distribution.

Outline

• Brief of probability! Probablity distribution! Normal distribution! Sampling distribution

• Brief of statistics! Estimation! Hypothesis testing

19 20

Statistical inference Statistical inference deals with drawing conclusions about population parameters from an analysis of the sample data.

Two most important types of inferences: 2

1. Estimation of parameter(s)Point estimation : Point estimation involves the use of sample data to calculate a single value (statistic) which is to serve as a “best guess” or “best estimate” of an unknown population parameter.

Interval estimation : interval estimation is the use of sample data to calculate an interval of possible (or probable) values of an unknown population parameter.

2. Testing of statistical hypotheses

Point estimation of a population mean

• Random data sample: !1, !2, … , !?.

$$$$$$$$$$$$$$$$$$$$$$DEFGH2F1I$E2HJK1$H12?$LD !M = L?� , L =∑ (!G−!B)2?G=1

?−1�

$$$$$$$$$$$$$P/0$K20Q1$?, Fℎ1$100 1 − R %$100/0$H20QG?$GE$TR 2⁄>?�

21

α/2 α/2 α/2α/2 - α/2 α/2 -α .

zα/2 = the upper !/2 point of the standard normal distribution. That is, the area to the right of zα/2 is !/2, and the area between " zα/2 and zα/2 is 1 – !.

1-!2.58 1.96 1.645 1.44 1.28 0.99 0.95 0.90 0.85 0.80 1

of valuesSome

2/

2/

!

!

!z

z"

z

22

Confidence interval for a parameter

(L,U) 100(1-!)%

P[L<

25

• A new diet program states that the participants are expected to lose over 22 pounds in 5 weeks. From the data of the 5-week weight losses of 56 participants, the sample mean and the std. deviation are found 23.5 and 10.2 pounds.

5 22 565 23.5 , 10.2

• Is the statement substantiated on the basis of these findings? Test with level of significance 0.05. Calculate the P-value and interpret the result.

? 0.05P

• SOLUTION: We have

22: versus22: Hypothesis 10 >= µµ HH1

05.0;2.10;5.23;22;56 0 ===== !µ Sxn

Example: weight loss diet

26

645.1:Region 1.156/10225.23

!"=#

= ZRz

ted.substantianot is 22 that claim stated the0.05 with ,reject not do We 0 >= µ!H

2

3

5622 statisticTest 0

SX

nSXZ !=!= µ

.: is for region Rejection sided-right is

0

1

cZRHH

!"

645.1: 645.105.0 !"== ZRzz#

4

5

. ofrejection for basis strong a providenot do data the,negligiblenot isit As rejected. be could at which smallest theis 0.1357

0

0

HH!

0.1357.1.10]P[Zvalue-P =!=

Example (continued)

22 H0 0.05H0 ! 0.1357

.

27

• Statistical thinking relates processes and statistics, and is based on the following principles:! All work occurs in a system of interconnected processes.! Variation exists in all processes! Understanding and reducing variation are keys to

success.

• Statistical thinking plays an esential role in data science.

Statistical thinking in data science Homework

Based on the key issues mentioned in the class, choose and use your suitable documents to study or recover what you have learnt about statistics.

(no submission of the report).

28

K236:&Basis&of&Data&Sciencebao/K236/K236-L3-print.pdf · K236:&Basis&of&Data&Science...

Documents

Transcript of K236:&Basis&of&Data&Sciencebao/K236/K236-L3-print.pdf · K236:&Basis&of&Data&Science...