Basic statistics 1

34
Data Science Statistical Analysis : Estimation and Testing By Kumar P

Transcript of Basic statistics 1

Page 1: Basic statistics  1

Data ScienceStatistical Analysis : Estimation and

Testing

ByKumar P

Page 2: Basic statistics  1

Managerial DecisionsHow many Programmers should I staff for?

What is the right level of inventory for our new product manufacturing

Where should we open our new retail store?

What will be next year revenue? Whether we are on right or wrong track

How much should I invest in advertising

Page 3: Basic statistics  1

Flow DiagramAcknowledge Uncertainty

Characterize uncertainty

Make Inferences under uncertainty

Make predictions under uncertainty

Make optimal decisions under uncertainty

Page 4: Basic statistics  1

Type of Statistics

Statistics

Descriptive

Inferential

Page 5: Basic statistics  1

Descriptive statisticsDescriptive statistics utilizes numerical and graphical methods to look for patterns in a data set, to summarize the information revealed in a data set and to present that information in a convenient form.

• Average• Spread• Range• Frequency• Histogram• Mode• Scatter Plot• Mode• Interquartile Range

Page 6: Basic statistics  1

Inferential statistics• Hypothesis Test• Z score• ANNOVA• Confidence Interval• Margin of error• Ordinary least Square• T test• F Test

Page 7: Basic statistics  1

Types of DataType of Data Definition ExampleNominal The categories are in no logical order and have

no particular relationshipYour Previous Degree

Ordinal Can be ranked/ordered but not measured College Rankings

Interval Scale Set of numerical measurements in which the distance between numbers is of a known

Temperature in Celsius

Ratio Scale Ratios are meaningful Sales of a new product

Source of data Definition ExampleObservational Analyst Does not control data

generation processStock returns on BSE

Experimental Analyst has good control over data generation

Clinical trials for drug efficiency

Page 8: Basic statistics  1

Few Examples1. The length of time until a pain reliever begins to work.2. Ranking of racers in moto GP.3. The number of colors used in a statistics textbook.4. The brand of refrigerator in a home.5. The overall satisfaction rating of a new car.6. The number of files on a computer’s hard disk.7. The pH level of the water in a swimming pool.8. The number of staples in a stapler.

Page 9: Basic statistics  1

Population & SamplePopulation: A collection, or set, of individuals or objects or events whose properties are to be analyzed.

Typically, there are too many experimental units in a population to consider every one.

Sample: A Subset of population

Page 10: Basic statistics  1

Measure of Central TendencyMode: The value in the data that occurs most frequently

Mean: The average of a given set of numbers

Mean of sample

Population Mean µ=

Percentiles: The pth Percentile of a group of numbers is that value below which lie p% of the numbers in the group .Pth percentile= (n+1)p/100 where n is the number of data points

Median: 50th percentile

Quartiles: These are percentiles which break down the distribution of the data.1st (25 percentile),3rd (75th percentile)

Interquartile Range(IQR): Difference between 1st and 3rd quartile

value Frequency18 419 120 321 122 223 224 1

Page 11: Basic statistics  1

Quick ExerciseData- 33 26 24 21 18 52 19

Mean ??

Mode ??

Median ??

IQR??

Page 12: Basic statistics  1

Measure of VariabilityRange: Difference between largest number and smallest number in a given data set

Variance: Is the average squared deviation of the data points from their mean

Sample Variance

Population Variance

Standard Deviation: Square root of variance of the data set

Sample sd S=

Population sd

Page 13: Basic statistics  1

Spare some thoughts

Why SD & VAR

Why different denominator

Why not mod

Page 14: Basic statistics  1

Histogram• Histogram is a chart made of bars where height of each bars represent frequency

of values

• Frequency of values can be absolute frequencies of counts or relative frequency

• Relative frequency of data points counts of the data points divided by total number of data points

Page 15: Basic statistics  1

BoxplotBoxplot is a measure of five point summary measures of the distribution of the data

Page 16: Basic statistics  1

Skew nessSkew ness is the measure of the degree of asymmetry of a frequency distribution

Page 17: Basic statistics  1

KurtosisKurtosis is a measure of peakedness of a distribution

Kurtosis for normal distribution is 3

Page 18: Basic statistics  1

What Is Random Variable?

How To Summarize Random Variable?

How to pictorially Represent Probability Distribution?

Random Variable

Page 19: Basic statistics  1

Random VariableA Random Variable describes the probabilities for an uncertain future numerical outcome of a random process

It is a variable that can take on several possible value

It is random because there is some chance associated with each possible values

Random variable is of 2 types• Discrete• Continuous

Page 20: Basic statistics  1

Probability Distribution• Probability

o Long Run average of a random event occurringo Different from subjective beliefs

• A Probability distribution is a rule that identifies possible outcomes of a random variable and assigns a probability to each

• A discrete distribution has finite number of valueso E.g. face value of a card, height of students in class

• A continuous distribution has all possible values in some rangeo E.g. salaries per month, Temperature in a month

Page 21: Basic statistics  1

PDF & CDF of Random Variable

The PDF(probability distribution function) for a discrete random variable x is the relative frequency distributions of the x. It is a graph, table or formula that gives the possible values of x and the probability p(x) associated with each value.

For all xi pdf must satisfy

CDF(Cumulative distribution function), F(x) of a discrete random variable is

F(x)=P(Xx)=

1)( and 1)(0havemust We

xpxp

X p(X=x) F(x)0 0.1 0.11 0.2 0.32 0.3 0.63 0.2 0.84 0.1 0.95 0.1 1.00

1.00

Page 22: Basic statistics  1

Example Toss a fair coin three times and define x = number of heads.

P(x = 0) = 1/8P(x = 1) = 3/8P(x = 2) = 3/8P(x = 3) = 1/8

HHH

HHT

HTH

THH

HTT

THT

TTH

TTT

x p(x)0 1/81 3/82 3/83 1/8

Probability Histogram for x

1/8

1/8

1/8

1/8

1/8

1/8

1/8

1/8

x

3

2

2

2

1

1

1

0

Page 23: Basic statistics  1

Quick exerciseRandomly chosen card from a deck of cards

What is the probability of getting an ace?

What is the probability of getting a card less than 3?

What is the probability of getting 1 head if I toss 2 unbiased coin?

What is the probability of getting 2 head if I toss 3 unbiased coin?

Page 24: Basic statistics  1

An ExampleX p(X=x)

0 0.4

1 0.25

2 0.2

3 0.05

4 0.1

• Daily sales of TVs at store

• What is the probability of a sale?• What is the probability of selling at least three TVs?

Page 25: Basic statistics  1

Expected Value or Mean• The expected value or mean(µ) of a random variable is

the weighted average of its values

‒ The probabilities serve as weights‒ E(x)=

• What is the mean number of TVs sold per day

• What does this imply

Page 26: Basic statistics  1

Variance and Standard Deviation• Both measures of variation or uncertainty in random variable

• Variance(σ2) :The weighted average of the squared deviations from the mean

‒ Probabilities serve as weights‒ σ2(x)=‒ Units are squared of the units of the variables‒ Another way Var(X)=E(X2)-[E(X)]2

• Standard Deviation(σ) :Square root of variance‒ Has units same as variable

Page 27: Basic statistics  1

Sum of Random VariablesLet X1 and x2 be 2 random variables with means µ1 and µ2 and standard deviation σ1 and σ2, suppose Y=aX1 +b X2

‒ What is the Mean of Y?E[Y]=aE[X1] +bE[X2]

‒ What is the standard deviation of Y?Var(Y)=a2var(X1)+b2Var(X2)

• Independent: When the value taken by random variable does not affect the value taken by other random variable

‒ E.g. Rolls of 2 Dice

• Dependent : When the value of one random variable gives us more information about the other random variable

‒ E.g. Height and weight of students

Page 28: Basic statistics  1

ExampleLet X1 and X2 be the outcomes associated with a toss of a pair of dice

E(X1)=E(X2)=3.5SD(X1)=SD(X2)=1.708

Compute the following:E(x1+X2)=SD(X1+X2)=

Page 29: Basic statistics  1

The Empirical Rule• Approximately 68% of data points will be within 1 standard deviation of

the mean

• Approximately 95% of the data points will be within 2 standard deviation of the mean

• A vast majority(almost all) will lie within 3 standard deviation of the mean

Page 30: Basic statistics  1

Normal distribution

• The graph of the PDF is a bell shaped curve• The normal random variable takes values from -∞ to +∞• It is symmetric and centered around the mean(which is also the median and the

mode)• Any normal distribution can be specified with just 2 parameters – the mean(µ)

and the standard deviation(σ)• We write this as X~N(µ,σ2)

Page 31: Basic statistics  1

Comparing multiple normal distributions

Page 32: Basic statistics  1

Probability Calculation for continuous Distribution• The probability associated with any single value of the random variable is always

zero

• Probability of values being in a range = Area under the pdf curve in that range

• Area under the entire curve is always equals 1

Page 33: Basic statistics  1

Z-scores, Standard Normal Distribution

For every value(x) of the random variable X, we calculate its z-score:

Interpretation- How many standard deviations away is the value from the mean?

If X~N(µ,σ2) then

‒ Z-scores have a normal distribution with µ=0 and σ=1‒ i.e. Z~N(0,1)‒ Standard normal distribution

• Inverse Transformation‒ X=µ + zσ

Page 34: Basic statistics  1

Probability calculation for normal distribution• Consider a normal distribution X~N(µ,σ2)

• Methods to calculate P(X‒ Use R:pnorm(x,µ,σ)