Basic statistics 1
-
Upload
kumar-p -
Category
Data & Analytics
-
view
13 -
download
0
Transcript of Basic statistics 1
Data ScienceStatistical Analysis : Estimation and
Testing
ByKumar P
Managerial DecisionsHow many Programmers should I staff for?
What is the right level of inventory for our new product manufacturing
Where should we open our new retail store?
What will be next year revenue? Whether we are on right or wrong track
How much should I invest in advertising
Flow DiagramAcknowledge Uncertainty
Characterize uncertainty
Make Inferences under uncertainty
Make predictions under uncertainty
Make optimal decisions under uncertainty
Type of Statistics
Statistics
Descriptive
Inferential
Descriptive statisticsDescriptive statistics utilizes numerical and graphical methods to look for patterns in a data set, to summarize the information revealed in a data set and to present that information in a convenient form.
• Average• Spread• Range• Frequency• Histogram• Mode• Scatter Plot• Mode• Interquartile Range
Inferential statistics• Hypothesis Test• Z score• ANNOVA• Confidence Interval• Margin of error• Ordinary least Square• T test• F Test
Types of DataType of Data Definition ExampleNominal The categories are in no logical order and have
no particular relationshipYour Previous Degree
Ordinal Can be ranked/ordered but not measured College Rankings
Interval Scale Set of numerical measurements in which the distance between numbers is of a known
Temperature in Celsius
Ratio Scale Ratios are meaningful Sales of a new product
Source of data Definition ExampleObservational Analyst Does not control data
generation processStock returns on BSE
Experimental Analyst has good control over data generation
Clinical trials for drug efficiency
Few Examples1. The length of time until a pain reliever begins to work.2. Ranking of racers in moto GP.3. The number of colors used in a statistics textbook.4. The brand of refrigerator in a home.5. The overall satisfaction rating of a new car.6. The number of files on a computer’s hard disk.7. The pH level of the water in a swimming pool.8. The number of staples in a stapler.
Population & SamplePopulation: A collection, or set, of individuals or objects or events whose properties are to be analyzed.
Typically, there are too many experimental units in a population to consider every one.
Sample: A Subset of population
Measure of Central TendencyMode: The value in the data that occurs most frequently
Mean: The average of a given set of numbers
Mean of sample
Population Mean µ=
Percentiles: The pth Percentile of a group of numbers is that value below which lie p% of the numbers in the group .Pth percentile= (n+1)p/100 where n is the number of data points
Median: 50th percentile
Quartiles: These are percentiles which break down the distribution of the data.1st (25 percentile),3rd (75th percentile)
Interquartile Range(IQR): Difference between 1st and 3rd quartile
value Frequency18 419 120 321 122 223 224 1
Quick ExerciseData- 33 26 24 21 18 52 19
Mean ??
Mode ??
Median ??
IQR??
Measure of VariabilityRange: Difference between largest number and smallest number in a given data set
Variance: Is the average squared deviation of the data points from their mean
Sample Variance
Population Variance
Standard Deviation: Square root of variance of the data set
Sample sd S=
Population sd
Spare some thoughts
Why SD & VAR
Why different denominator
Why not mod
Histogram• Histogram is a chart made of bars where height of each bars represent frequency
of values
• Frequency of values can be absolute frequencies of counts or relative frequency
• Relative frequency of data points counts of the data points divided by total number of data points
BoxplotBoxplot is a measure of five point summary measures of the distribution of the data
Skew nessSkew ness is the measure of the degree of asymmetry of a frequency distribution
KurtosisKurtosis is a measure of peakedness of a distribution
Kurtosis for normal distribution is 3
What Is Random Variable?
How To Summarize Random Variable?
How to pictorially Represent Probability Distribution?
Random Variable
Random VariableA Random Variable describes the probabilities for an uncertain future numerical outcome of a random process
It is a variable that can take on several possible value
It is random because there is some chance associated with each possible values
Random variable is of 2 types• Discrete• Continuous
Probability Distribution• Probability
o Long Run average of a random event occurringo Different from subjective beliefs
• A Probability distribution is a rule that identifies possible outcomes of a random variable and assigns a probability to each
• A discrete distribution has finite number of valueso E.g. face value of a card, height of students in class
• A continuous distribution has all possible values in some rangeo E.g. salaries per month, Temperature in a month
PDF & CDF of Random Variable
The PDF(probability distribution function) for a discrete random variable x is the relative frequency distributions of the x. It is a graph, table or formula that gives the possible values of x and the probability p(x) associated with each value.
For all xi pdf must satisfy
CDF(Cumulative distribution function), F(x) of a discrete random variable is
F(x)=P(Xx)=
1)( and 1)(0havemust We
xpxp
X p(X=x) F(x)0 0.1 0.11 0.2 0.32 0.3 0.63 0.2 0.84 0.1 0.95 0.1 1.00
1.00
Example Toss a fair coin three times and define x = number of heads.
P(x = 0) = 1/8P(x = 1) = 3/8P(x = 2) = 3/8P(x = 3) = 1/8
HHH
HHT
HTH
THH
HTT
THT
TTH
TTT
x p(x)0 1/81 3/82 3/83 1/8
Probability Histogram for x
1/8
1/8
1/8
1/8
1/8
1/8
1/8
1/8
x
3
2
2
2
1
1
1
0
Quick exerciseRandomly chosen card from a deck of cards
What is the probability of getting an ace?
What is the probability of getting a card less than 3?
What is the probability of getting 1 head if I toss 2 unbiased coin?
What is the probability of getting 2 head if I toss 3 unbiased coin?
An ExampleX p(X=x)
0 0.4
1 0.25
2 0.2
3 0.05
4 0.1
• Daily sales of TVs at store
• What is the probability of a sale?• What is the probability of selling at least three TVs?
Expected Value or Mean• The expected value or mean(µ) of a random variable is
the weighted average of its values
‒ The probabilities serve as weights‒ E(x)=
• What is the mean number of TVs sold per day
• What does this imply
Variance and Standard Deviation• Both measures of variation or uncertainty in random variable
• Variance(σ2) :The weighted average of the squared deviations from the mean
‒ Probabilities serve as weights‒ σ2(x)=‒ Units are squared of the units of the variables‒ Another way Var(X)=E(X2)-[E(X)]2
• Standard Deviation(σ) :Square root of variance‒ Has units same as variable
Sum of Random VariablesLet X1 and x2 be 2 random variables with means µ1 and µ2 and standard deviation σ1 and σ2, suppose Y=aX1 +b X2
‒ What is the Mean of Y?E[Y]=aE[X1] +bE[X2]
‒ What is the standard deviation of Y?Var(Y)=a2var(X1)+b2Var(X2)
• Independent: When the value taken by random variable does not affect the value taken by other random variable
‒ E.g. Rolls of 2 Dice
• Dependent : When the value of one random variable gives us more information about the other random variable
‒ E.g. Height and weight of students
ExampleLet X1 and X2 be the outcomes associated with a toss of a pair of dice
E(X1)=E(X2)=3.5SD(X1)=SD(X2)=1.708
Compute the following:E(x1+X2)=SD(X1+X2)=
The Empirical Rule• Approximately 68% of data points will be within 1 standard deviation of
the mean
• Approximately 95% of the data points will be within 2 standard deviation of the mean
• A vast majority(almost all) will lie within 3 standard deviation of the mean
Normal distribution
• The graph of the PDF is a bell shaped curve• The normal random variable takes values from -∞ to +∞• It is symmetric and centered around the mean(which is also the median and the
mode)• Any normal distribution can be specified with just 2 parameters – the mean(µ)
and the standard deviation(σ)• We write this as X~N(µ,σ2)
Comparing multiple normal distributions
Probability Calculation for continuous Distribution• The probability associated with any single value of the random variable is always
zero
• Probability of values being in a range = Area under the pdf curve in that range
• Area under the entire curve is always equals 1
Z-scores, Standard Normal Distribution
For every value(x) of the random variable X, we calculate its z-score:
Interpretation- How many standard deviations away is the value from the mean?
If X~N(µ,σ2) then
‒ Z-scores have a normal distribution with µ=0 and σ=1‒ i.e. Z~N(0,1)‒ Standard normal distribution
• Inverse Transformation‒ X=µ + zσ
Probability calculation for normal distribution• Consider a normal distribution X~N(µ,σ2)
• Methods to calculate P(X‒ Use R:pnorm(x,µ,σ)