K236:&Basis&of&Data&Sciencebao/K236/K236-L3-print.pdf · K236:&Basis&of&Data&Science...

1

Transcript of K236:&Basis&of&Data&Sciencebao/K236/K236-L3-print.pdf · K236:&Basis&of&Data&Science...

  • K236:&Basis&of&Data&ScienceLecture&3.&Review&of&univariate&statistics

    Lecturer: Tu Bao Ho and Hieu Chi DamTA: Moharasan Gandhimathi

    and Nuttapong Sanglerdsinlapachai

    2

    Schedule of K2361. Introduction to data science 6/9

    2. Introduction to data science 6/13

    3. Data and databases 6/16

    4. Review of univariate statistics 6/20

    5. Review of linear algebra 6/23

    6. Data mining software 6/27

    7. Data preprocessing 6/30

    8. Classification and prediction (1) (1) 7/4

    9. Knowledge evaluation 7/7

    10. Classification and prediction (2) (2) 7/11

    11. Classification and prediction (3) (3) 7/14

    12. Mining association rules (1) 7/18

    13. Mining association rules (2) 7/21

    14. Cluster analysis 7/25

    15. Review and Examination (the data is not fixed) 7/27

    Basic mathematics for data science

    3

    Outline

    • Brief of probability! Probablity distribution! Normal distribution! Sampling distribution

    • Brief of statistics! Estimation! Hypothesis testing

    4

  • 5

    What is statistics?

    Statistics provides principles and methodologyfor designing the process of:

    ! Data Collection

    ! Summarizing and Interpreting the data

    ! Drawing Conclusions or Generalities

    6

    Football 35% 38% 36% 21% 17%

    Baseball 16% 16% 21% 34% 39%

    Basketball 15% 9% 8% % 10%

    Others 33% 37% 35% 36% 34%

    Interest in football in baseballin basketball (last decade)

    Data$Collection

    Draw$Conclusions

    Summarization$– Interpretation

    What is statistics?

    Sport 1990 1981 1972 1960 1948 Mean

    Football 35% 38% 36% 21% 17% 29.40%

    Baseball 16% 16% 21% 34% 39% 24.80%

    Baseketball 15% 9% 8% 9% 10% 10.20%

    Others 33% 37% 35% 36% 34% 33.00%

    Population and sample

    7

    • The population is the completecollection of persons, objects… whose characteristics are of interest.

    .

    • A sample from a population is the set of objects whose data are actually collected in the course of an investigation.

    .

    • Good sample should be randomlycollected (random sample, ).

    Is this a good sample?

    Topic:?Which?

    motorbikes?are?

    preferred?by?different?

    groups?of?people?for?

    daily?transportation?

    Essence&of&statistics

    8

    population population

    sampledata

    variables

    • Estimation

    • Hypothesis?testing

    Statistical&Inference

    parameters

    statistic

    • A?parameter is?a?numerical*feature*of?the?population?( ),?such?as?mean,?proportion,?standard?deviation?of?an?attribute.

    • Statistical&inference&is?the?way?of?drawing?conclusions?about?population?parameters?from?an?analysis?of?the?sample?data.?

    We?want?to?know?

    about?the?population?

    parameter!

    Probability

  • 9

    Random variables

    • An experiment is the process of observing a phenomenon that has variation in its outcomes. .

    • The experiment’s outcomes can be numeric (1, 2, …, 6) or non-numeric (ten or house). For computation, we qualify experiment’s outcomes by assigning each of them a numerical value related to a characteristic of interest.

    • A random variable X is a function that associates a numerical value with each outcome of an experiment.

    • Random” means before the experiment we do not know the outcome of an experiment or its associated value of X.

    Random) X

    A?random*variable !:Ω$ → $ℝ$is?a?measurable?function?from?the?set?of?possible?outcomes?to?ℝ.10

    Random variables

    " A random variable is discrete if it has either a finite number of values or infinitely many values that can be arranged in a sequence.

    ! Example: Number of cars in JAIST parking during 1 day.JAIST .

    " A continuous random variable is a random variable that represents some measurement on a continuous scale and therefore capable of assuming all values in an interval.

    ! Example: Rainfall after each rain during the raining season..

    Probability distribution of a discrete variable

    11

    • The probability distribution of a discrete random variable describes the probability of occurrence of each value of the variable.

    • The probability distribution (or distribution) of a discrete random variable X is a list of the distinct numerical

    values of X along with their associated probabilities.

    X

    .

    Value'of'X Probability

    0'''''''''''''''''1/8

    1'''''''''''''''''3/8

    2'''''''''''''''''3/8

    3'''''''''''''''''1/8

    Total''''''''''''''''1

    Probability'distribution'of'XX

    12

    The?probability&distribution of?a?discrete?random?variable?is?often?described?as?the?function(

    ][)( ii xXPxf ==

    which gives the probability for each value and satisfies

    Form of a discrete probability distribution

    1. ) *+ ≥ 0, )/0$123ℎ$*+$/)$!

    2. ∑ ) *+ = 18+9:

  • Probability distribution of a continuous variable

    13

    • The probability distribution of a continuous random variable describes the probabilities of the possible values of the variable.

    • Probability of a value range of a continuous random variable X are defined as the area under the curve of its probability distribution function (PDF).

    Probability?that?a?man?weighs?between?

    160?and?170?pounds

    Essence&of&statistics

    14

    population population

    sampledata

    variables

    • Estimation• Hypothesis?testing

    Statistical?Inference

    Probability)distributionof)variables(joint)distribution)

    parametersstatistic

    • A?parameter is?a?numerical*feature*of?the?population?( ),????????????????such?as?mean,?proportion,?standard?deviation.

    • A? is?a?single?measure?of?some?attribute?of?a?sample.?It?is?defined?as?a?numerical2valued*function*of?the?sample?observations.

    Probability)distributionof)a)statistic

    Statistical inference is the ways of drawing conclusions about population parameters from an analysis of the sample data.

    15

    The normal distribution

    +!

    decreasing in a symmetric manner.

    " Plays a central role in statistics, and inference procedures derived from it have wide applicability and form the backbone of current methods of statistical analysis.

    Essence of statistics

    16

    " A numerical feature of a population is called a parameter. The true value of a population parameter is an unknown constant. A numerical characteristic of a sample is called a statistic. The value of a statistic varies in repeated sampling.

    . . .

    " Generalizations in statistics (statistical inference) are founded on the understanding of the manner in which variation in the population is transmitted, by sampling, to variation in statistics like the sample mean.

  • 17

    Sampling distribution

    " Random sampling from a population refers to independent selections where each observation has the same distribution as the population.

    " When random sampling from a population, a statistic is a random variable. The probability distribution of a statistic is called its sampling distribution.

    18

    Sampling distribution

    n m

    N m, s) ;(=, > ?�⁄ ).With increasing n, the distribution of is more concentrated around m. If the population distribution is normal ; =, > , the distribution of !B is ;(=, > ?�⁄ ).

    Regardless of the shape of the population distribution, the distribution of !B$is approximately ;(=, > ?�⁄ ), provided that n is large. This result is called the central limit theorem.

    n X ;(=, > ?�⁄ )

    When random sampling from a population, a statistic is a random variable. The probability distribution of a statistic is called its sampling distribution.

    Outline

    • Brief of probability! Probablity distribution! Normal distribution! Sampling distribution

    • Brief of statistics! Estimation! Hypothesis testing

    19 20

    Statistical inference Statistical inference deals with drawing conclusions about population parameters from an analysis of the sample data.

    Two most important types of inferences: 2

    1. Estimation of parameter(s)Point estimation : Point estimation involves the use of sample data to calculate a single value (statistic) which is to serve as a “best guess” or “best estimate” of an unknown population parameter.

    Interval estimation : interval estimation is the use of sample data to calculate an interval of possible (or probable) values of an unknown population parameter.

    2. Testing of statistical hypotheses

  • Point estimation of a population mean

    • Random data sample: !1, !2, … , !?.

    $$$$$$$$$$$$$$$$$$$$$$DEFGH2F1I$E2HJK1$H12?$LD !M = L?� , L =∑ (!G−!B)2?G=1

    ?−1�

    $$$$$$$$$$$$$P/0$K20Q1$?, Fℎ1$100 1 − R %$100/0$H20QG?$GE$TR 2⁄>?�

    21

    α/2 α/2 α/2α/2 - α/2 α/2 -α .

    zα/2 = the upper !/2 point of the standard normal distribution. That is, the area to the right of zα/2 is !/2, and the area between " zα/2 and zα/2 is 1 – !.

    1-!2.58 1.96 1.645 1.44 1.28 0.99 0.95 0.90 0.85 0.80 1

    of valuesSome

    2/

    2/

    !

    !

    !z

    z"

    z

    22

    Confidence interval for a parameter

    (L,U) 100(1-!)%

    P[L<

  • 25

    • A new diet program states that the participants are expected to lose over 22 pounds in 5 weeks. From the data of the 5-week weight losses of 56 participants, the sample mean and the std. deviation are found 23.5 and 10.2 pounds.

    5 22 565 23.5 , 10.2

    • Is the statement substantiated on the basis of these findings? Test with level of significance 0.05. Calculate the P-value and interpret the result.

    ? 0.05P

    • SOLUTION: We have

    22: versus22: Hypothesis 10 >= µµ HH1

    05.0;2.10;5.23;22;56 0 ===== !µ Sxn

    Example: weight loss diet

    26

    645.1:Region 1.156/10225.23

    !"=#

    = ZRz

    ted.substantianot is 22 that claim stated the0.05 with ,reject not do We 0 >= µ!H

    2

    3

    5622 statisticTest 0

    SX

    nSXZ !=!= µ

    .: is for region Rejection sided-right is

    0

    1

    cZRHH

    !"

    645.1: 645.105.0 !"== ZRzz#

    4

    5

    . ofrejection for basis strong a providenot do data the,negligiblenot isit As rejected. be could at which smallest theis 0.1357

    0

    0

    HH!

    0.1357.1.10]P[Zvalue-P =!=

    Example (continued)

    22 H0 0.05H0 ! 0.1357

    .

    27

    • Statistical thinking relates processes and statistics, and is based on the following principles:! All work occurs in a system of interconnected processes.! Variation exists in all processes! Understanding and reducing variation are keys to

    success.

    • Statistical thinking plays an esential role in data science.

    Statistical thinking in data science Homework

    Based on the key issues mentioned in the class, choose and use your suitable documents to study or recover what you have learnt about statistics.

    (no submission of the report).

    28