EXCEL STATISTICAL ANALYSISinfo.marshall.usc.edu/.../Excel/Statistics_Workshop_V3.pdfUsing Sum() as...

Post on 14-Jul-2020

0 views 0 download

Transcript of EXCEL STATISTICAL ANALYSISinfo.marshall.usc.edu/.../Excel/Statistics_Workshop_V3.pdfUsing Sum() as...

EXCEL

STATISTICAL Functions

Average vs. Weighted Average

AverageWeighted

Average

Average Price

Transaction

DateItem

Price Per

Unit

Units

Sold

5/7/2007 Coffee $ 20 500

6/7/2007 Coffee $ 25 750

7/6/2007 Coffee $ 35 200

8/9/2007 Coffee $ 30 300

Total: 20+25+35+30 = 110

Average Price: 110 / 4 = 27.50

Sum of Prices / Count of Prices

Weighted Average Price

Date ItemPrice Per

Unit

Units

Sold

5/7/2007 Coffee $ 20 500

6/7/2007 Coffee $ 25 750

7/6/2007 Coffee $ 35 200

8/9/2007 Coffee $ 30 300

500+750+200+300 = 1,750

Weighted Average Price: 44,750 / 1,750 = $25.57

∑(Price per Unit * Units Sold) /∑ Units Sold

= 10,000

= 18,750

= 7,000

= 9,000

44,750

XXX

X+

+

Weighted Average Exercise

“Weighted_Avg” sheet

Weighted Average

Solutions

“Weighted_Avg” sheet

Wt. Avg. with

SumProduct() Exercise

“Weighted_Avg_SumProduct” sheet

=SumProduct(Array1,Array2)

Multiplies corresponding numbers in two or more different ranges together then sums them.

Date ItemPrice Per

Unit

Units

Sold

5/7/2007 Coffee $ 20 500

6/7/2007 Coffee $ 25 750

7/6/2007 Coffee $ 35 200

8/9/2007 Coffee $ 30 300

→ 20*500 = 10,000

→ 25*750 = 18,750

→ 35*200 = 7,000

→ 30*300 = 9,000

44,750Array 1 Array 2

Wt. Avg. with

SumProduct() Solution

“Weighted_Avg_SumProduct” sheet

Summations

10

1

2i

Summations

Sum(12 , 22 , 32 , 42 , 52 , 62 , 72 , 82 , 92 , 102)

10

1

2iStartingNumber

EndingNumber Mathematical

Operation

Using Sum() as an Array

Sum(1^2,2^2,3^2,4^2,5^2,6^2,7^2,8^2,9^2,10^2)10

1

2i

Sum(A1:A10^2)

However, if you just press Enter, Excel cannot evaluate the equation. We need to tell it to process each cell in the range one at a time.

Using Sum() as an Array

Press Control + Shift + Enter to make Excel Process cells in a range one at a time.

Sum(A1:A10^2)

Sum(A12,A22,A32,A42,A52,A62,A72,A82,A92,A102)

Sum(12,22,32,42,52,62,72,82,92,102)

Sum(1,4,9,16,25,36,49,64,81,100) 385

Summation Exercises

“Summations” sheet

Summation Solutions

“Summations” sheet

Be sure to press Control + Shift + Enter after typing or editing the formulas.

Factorials in Excel

!)!1(

!1

kn

kn

Factorials (!)

4 * 3 * 2 * 1 4! = = 24

6 * 5 * 4 * 3 * 2 * 16! = = 720

=Fact(4)4! = = 24

=Fact(6)6! = = 720

Factorial Exercise using

Combinations

You have an unlimited supply of Coconuts and Pineapples. How many different ways can you fill a bowl which will contain exactly 3 items?

Combinations

Order of Selection is Not Important

Combinations

Items can be Repeated (with Replacement)

Combinations

(With Replacement, Order Not Important)

!)!1(

!1

kn

kn

n = The number of different items available (2).K = The number of items selected (3).

Factorial Exercise with

Combinations

“Factorials” sheet

Factorial Exercise with

Combinations: Solution

“Factorials” sheet

Frequency

Nu

mb

er

Occ

urr

ing

Grade

F D C B A

7

6

5

4

3

2

1

>89.99 and <=100 ---------> A

>79.99 and <=89.99 ------> B

>69.99 and <=79.99 ------> C

>59.99 and <=69.99 ------> D

<= 59.99 ---------------------> F

We wish to determine the number of students getting an A, B, C, D and F based on the conditions below.

=Frequency(Data Array, Bins Array)

This is the range of values you are analyzing.

This is your grouping.

The values listed are the maximums for the group and are inclusive.

In this example the groupings are:

>89.99 and <=100>79.99 and <=89.99>69.99 and <=79.99>59.99 and <=69.99<=59.99

=Frequency(Data Array, Bins Array)

You must highlight your output area prior to typing the formula.

When done typing, you must press: Control +Shift + Enter

=Frequency(Data Array, Bins Array)

You must:• Highlight First• Press Control + Shift + Enter

“Frequency” sheet

=Frequency(Data Array, Bins Array)

You must:• Highlight First• Press Control + Shift + Enter

Rank and Percentile

(Comparisons within the Group Only)

Rank & Percentile: Rank

Two Factors to Consider with Rank:

• Order (Ascending or Descending)

• How Ties are Handled

RankHandling Ties

Rank.Eq()Ties Given the Same First Rank

Rank.Avg()Ties Given the Same Average Rank

=Rank.Eq(Score,Range,Order)

Value being ranked

Set of values being ranked

0 Low scores have low ranking

1 Low scores have higher rankings

“Ranking” sheet

=Rank.Eq(Score,Range,Order)

Value being ranked

Set of values being ranked

0 Low scores have low ranking

1 Low scores have higher rankings

“Ranking” sheet

=Rank.Avg(Score,Range,Order)

Value being ranked

Set of values being ranked

0 Low scores have low ranking

1 Low scores have higher rankings

“Ranking” sheet

=Rank.Avg(Score,Range,Order)

Value being ranked

Set of values being ranked

0 Low scores have low ranking

1 Low scores have higher rankings

“Ranking” sheet

Percent Rank

Expressing a Score as a Percent of Scores it Greater Than

PercentRank.Inc()This returns the percent of students who ranked less than the score in question. The "Inc" indicates that the ranks includes 0 and 100 in its calculation. The outcome is that the highest score gets 100% indicating that 100% of the scores are less than this score in question.

PercentRank.Exc()This also returns the percent of students who ranked less than the score in question. The "Exc" indicates that the ranks exclude 0 and 100. The outcome is that the highest score might get a 95% but it does not express the score as being better than 100% of the other scores. This is a more commonly used method of expressing percentiles.

Percent Rank

PercentRank.Exc(Array,k)

Specific ScoreSet of Scores

“PercentRank” sheet

Percent Rank

PercentRank.Exc(Array,k)

Specific ScoreSet of Scores

“PercentRank” sheet

Percent Rank

PercentRank.Inc(Array,k)

Specific ScoreSet of Scores

“PercentRank” sheet

Percent Rank

PercentRank.Inc(Array,k)

Specific ScoreSet of Scores

“PercentRank” sheet

Percentile

Returns the Sore for a Given Percentile (i.e. Breakpoint)

Percentile.Inc(Range of Scores, Specific Score)This returns the score associated with a given breakpoint. For example, you wanted to know what test score begins the 90th percentile. The "Inc" indicates that 0 and 100 percent are included when calculating the score associated with the percent given.

Percentile.Exc(Range of Scores, Specific Score)Just like "Percentile.Inc" except that it excludes 0 and 100 percent.

Rank, Rank Percent

& Percentile Exercises

“Percent_and_Rank” sheet

Rank, Rank Percent

& Percentile Exercises

“Percent_and_Rank” sheet

Normal Distribution

Is the Data Normally Distributed?

“Normal_Distribution” sheet

Creating a Histogram

Is the Data Normally Distributed?

“Normal_Distribution” sheet

Skew(Range)Degree of Symmetry. If the skew is greater than + 1 or lower than - 1 then the skew is substantial and the distribution is far from symmetrical.

+-

Is the Data Normally Distributed?

“Normal_Distribution” sheet

Kurt(Range)Kurtosis is the degree of flatness. In Excel, greater than zero indicates a sharp curve while less than zero indicate a flat curve. 0 is a normal curve.

Is the Data Normally Distributed?

“Normal_Distribution” sheet

Average(Range)Answers the questions: Where are most of the data points?

Average is less useful when data is not normally

distributed.

Skew(), Kurt(), & Average()

“Normal_Distribution” sheet

Standard Deviation

How spread out from the average is the data?

If your data is normally distributed then you can use Standard Deviation to see how spread out the data is.

Standard Deviation

How spread out from the average is the data?

For data that is normally distributed: 68% of your data will be within 1 standard deviation from the average. 95% of your data will be within 2 standard deviation from the average. 99% your data will be within 3 standard deviation from the average.

Standard Deviation

How spread out from the average is the data?

Stdev.P(Range)

Stdev.S(Range)

Standard Deviation of a Population

Standard Deviation of a Sample

“Normal_Distribution” sheet

Standard Deviation

How spread out from the average is the data?

“Normal_Distribution” sheet

We already know that that average is 50.7. Using the normal distribution, we now know that:

68% of the students scores are within 21 points of the average (i.e. 29.7 to 71.7). 95% of the students scores are within 42 points (2 x 21) of the average (i.e. 8.7 to 92.7). 99% of the students scores are within 63 points (3 x 21) of the average (i.e. -12.3 to 113.7).

Note because of the thinness of the tails at 3 standard deviations, you should think of this as 0 to 100.

Standard Normal Distribution

The Standard Normal Distribution is the Normal Distribution adjusted so the mean falls on zero and the standard deviation is one.

The image above shows test scores (0-150) plotted on a histogram. The data is normally distributed.

Standard Normal Distribution

and Z Scores

The data points are converted to Z-Scores and represent the number of standard deviations from the mean using this formula:

𝑍 𝑆𝑐𝑜𝑟𝑒 =𝐷𝑎𝑡𝑎 𝑃𝑜𝑖𝑛𝑡 − 𝐴𝑣𝑒𝑟𝑎𝑔𝑒

𝑆𝑡𝑎𝑛𝑑𝑎𝑟𝑑 𝐷𝑒𝑣𝑖𝑎𝑡𝑖𝑜𝑛 𝑓𝑜𝑟 𝑡ℎ𝑒 𝑆𝑎𝑚𝑝𝑙𝑒

For example: • A Z-Score of 0 indicates that the score is the same

as the mean. • A positive Z-Score of 1.62 means that the data

point is 1.62 standard deviations above the mean.

• A negative Z-Score means it is below the mean.

Standardize(Data point, Mean, Standard Deviation)or

Standard Normal Distribution

and Z Scores

Z-Scores allow you to:

• Compare scores from different normal distributions.

• Combine scores from data with different means and standard deviations.

Standard Normal Distribution

Z Scores as Probabilities

Z-Scores are often converted to probabilities percentages to make them easier to understand. Z-scores are expressed as a 0% to 100% probability of occurring.

For example, a test score of 85 might turn out to have a z-score of 1.62 which in turn might have a corresponding probability of 94%. This indicates that getting a score of 85 means you did better than 94% of the population. It also can be taken to mean that if you randomly sample a score, there is a 94% probability that it will be less than 85.

Score of 85

Z Score of 1.62

94% Probability

Functions for Returning Z Scores

and Probabilities in Excel

• Data Point to Probability: Norm.Dist(Data point , Mean, Standard deviation, Cumulative?)

• Z-Score to a Probability: Norm.S.Dist(Z-Score, Cumulative?)

Data Point to a Z Score:

Returning a Probability

Returning a Z Score

Standardize(DataPoint , Mean , Standard Deviation)

• Probability to Data Point: Norm.Inv(Probability Percent, Mean, Standard deviation)

Returning Data Point for a Given Probability

About Cumulative

• Data Point to Probability: Norm.Dist(Data point , Mean, Standard deviation, Cumulative?)

• Z-Score to a Probability: Norm.S.Dist(Z-Score, Cumulative?)

Returning a Probability

Cumulative is set to either True or False: TrueExcel returns the percentage of students whose scores were less than yours.FalseExcel returns the percentage of students who have the same score as yours.

Standard Normal Distribution Exercise

Compare Local Test Scores with National

Use Norm.Dist() to see where our student’s scores are when comparted with National test Scores.

National Average Score: 65National Standard Deviation: 15

“Probability_Ex1_National” sheet

Standard Normal Distribution Exercise

Compare Local Test Scores with National

What is the National cutoff Score for the 90th

Percentile? Norm.Inv()

How many of our students are in the National 90th

Percentile? CountIf()

Percent of our Students in the National 90th

Percentile? # in 90th/Total Count

“Probability_Ex1_National” sheet

Standard Normal Distribution Exercise

Compare Local Test Scores with National

“Probability_Ex1_National” sheet

Solution

Standard Normal Distribution Exercise

What is the 90th

Percentile Cutoff Score?

“Probability_Ex1_National” sheet

Solution

Standard Normal Distribution Exercise

Number of Local Students Who Made

the National 90th

Percentile?

“Probability_Ex1_National” sheet

Solution

Standard Normal Distribution Exercise

Percent of Local Students Who Made

the National 90th

Percentile?

“Probability_Ex1_National” sheet

Solution

Standard Normal Distribution Exercise

“Probability_Ex2” sheet

Using Norm.Dist() to Calculate a Z-Score ProbabilityA volleyball coach is considering hiring a new player who is 78" tall but is wondering if his height is really that impressive. The coach knows that in the USA, the average height of a man is 69.1" with a standard deviation of 2.9". What percent of the national population is the player taller than?

Standard Normal Distribution Exercise

“Probability_Ex2” sheet

Using Norm.Dist() to Calculate a Z-Score ProbabilityA volleyball coach is considering hiring a new player who is 78" tall but is wondering if his height is really that impressive. The coach knows that in the USA, the average height of a man is 69.1" with a standard deviation of 2.9". What percent of the national population is the player taller than?

Solution

Standard Normal Distribution Exercise

“Probability_Ex3” sheet

Using Norm.Inv() to Calculate a Data Point from a ProbabilityOur volleyball coach also wants to make sure his male players are all taller than 95% of the population. What is the minimum height his players must meet to get on the team? The coach knows that in the USA, the average height of a man is 69.1" with a standard deviation of 2.9".

Standard Normal Distribution Exercise

“Probability_Ex3” sheet

Using Norm.Inv() to Calculate a Data Point from a ProbabilityOur volleyball coach also wants to make sure his male players are all taller than 95% of the population. What is the minimum height his players must meet to get on the team? The coach knows that in the USA, the average height of a man is 69.1" with a standard deviation of 2.9".

Solution

Standard Normal Distribution Exercise

“Probability_Ex4” sheet

Using Norm.Dist() for Probability Between Two Data PointsA new students is being sent to our height minded Volleyball coach to be on his team. Again, Google tells him that in the USA, the average height of a man is 69.1" with a standard deviation of 2.9".

• What is the probability that the student will be exactly 70 inches tall?

• What is the probability that the student will be 70 inches tall or less?

• What is the probability that the student will be between 70 to 74 inches tall?

• What is the probability that the student will be greater than 70 inches tall?

Standard Normal Distribution Exercise

(Norm.Dist(DataPoint , Mean, Standard Deviation , Cumulative True)

“Probability_Ex4” sheet

• What is the probability that the student will be 70 inches tall or less?

• What is the probability that the student will be between 70 to 74 inches tall?

• What is the probability that the student will be greater than 70 inches tall?

Standard Normal Distribution Exercise

“Probability_Ex4” sheet

Using Norm.Dist() for Probability Between Two Data PointsA new students is being sent to our height minded Volleyball coach to be on his team. Again, Google tells him that in the USA, the average height of a man is 69.1" with a standard deviation of 2.9".

Solution

T-Test

Used to determine how similar two sets of data are. More specifically, they test the hypothesis that the means of the two sets of data do not differ (i.e. the null hypothesis). In other words, are they different simply due to chance or over time will they continue to be different?

Requirements:• Standard Deviation of the Population is

unknown.• Sample Size is less than 30 (n<30)• Data is Normally Distributed

T-Tests

(Comparing Two Means)

Interpreting the Results• T-Test results are typically expressed in probabilities (p).• Because it T-Tests evaluates the hypothesis that there is no real

difference, the smaller the p value the more likely it is that there is a true difference between the means. A 95% level of confidence is typically used so:

A p-value less than .05 indicates that the data sets do differ and not just do to randomness

T-Tests

(Comparing Two Means)

Factors to Consider: Number of TailsOne Tailed: testing is in just one direction e.g. The mean of sample A is less than the mean of sample B

Two Tailed: testing is in both directionse.g. Testing that the mean of sample A is less than or greater than the mean of sample B.

T-Tests

(Comparing Two Means)

Factors to Consider: Paired vs. UnpairedPaired: Each value in a sample is paired with a value in another sample. Each sample has the same number of data points. Often used in before and after comparison.

• Testing a person’s reflexes before and after drinking alcohol.• Weighing each individual before and after taking a new

weight loss drug.

T-Tests

(Comparing Two Means)

Factors to Consider: Paired vs. UnpairedUnpaired: data is gathered by randomly sampling each set of data separately under different conditions. The size of each group does not have to be equal.

• Comparing the performance of two different stocks.• Comparing the battery life of batteries from two different

suppliers.

T-Tests

(Comparing Two Means)

Factors to Consider: Paired vs. UnpairedUnpaired: data is gathered by randomly sampling each set of data separately under different conditions. The size of each group does not have to be equal.

Unpaired can have either an equal or unequal population variance. Because you generally don’t know the population variance, Unequal variances is used much more often.

T-Tests

(Comparing Two Means)

Factors to Consider: One-Sample vs. Two-SampleOne-Sample: You wish to compare your data to a known value or population mean.

• Measuring a school’s test scores against the national average.• Measuring birth weight in a certain city against a mean taken

50 years ago in the same city.

T-Tests

(Comparing Two Means)

Factors to Consider: One-Sample vs. Two-SampleTwo-Sample: The means was sampled independently from two different population groups.

• Comparing the mean of people on a new weight loss drug against the mean of a control group taking a placebo.

• Comparing the battery life of batteries from two different suppliers.

T-Tests with the data Analysis Tookpak

T-Test: Example 1

A researcher gave one group a drug to lower toxins in their cells and another

group a placebo. She wishes to test the hypothesis that the amount of toxins in

the group treated with the drug will be less than the group given the placebo.

Number of Tails?

One or Two Sample?

Paired or Unpaired?

“Ttest_Ex1” Sheet

One - Because we are testing that the difference will be one direction (less than), this is a one-tailed test.

Two Sample - Because the data comes from two independent samples (drug group and placebo group).

Unpaired- Two separate groups are measured. Also, we don't know the variance of the two populations so we will assume they are unequal.

T-Test: Example 1

A researcher gave one group a drug to lower toxins in their cells and another group a

placebo. She wishes to test the hypothesis that the amount of toxins in the group treated

with the drug will be less than the group given the placebo.

Because P is less than .05 (.029), we reject the null hypothesis. The difference is not due to randomness and the drug does reduce the amount of toxins in the blood.

T-Test: Example 2

A quality control manager wishes to see if there is any difference between the numbers of

defects per day produced by machines maintained by company employees versus those

maintained by an outside contractor. Over a 10 day period she has recorded the number of

defects per day produced by all in-house maintained machines and all outside contractor

maintained machines.

Number of Tails?

One or Two Sample?

Paired or Unpaired?

“Ttest_Ex2” Sheet

Two- Because we are testing for any difference (less than or greater than).

Two - Because the data comes from two independent samples, it is a two-sample test.

Unpaired- Two separate groups are measured. Also, we don't know the variance of the two populations so we will assume they are unequal.

T-Test: Example 2

A quality control manager wishes to see if there is any difference between the numbers of

defects per day produced by machines maintained by company employees versus those

maintained by an outside contractor. Over a 10 day period he has recorded the number of

defects per day produced by all in-house maintained machines and all outside contractor

maintained machines.

“Ttest_Ex2” Sheet

Because P is greater than .05 (.118), we accept the null hypothesis. There is no real difference between the two means.

T-Test: Example 3

A company wishes to see if a new drug it has developed for diabetes has any effect on

people's weight. They weighed several individuals before giving them the drug and then

again after giving them the drug.

Number of Tails?

One or Two Sample?

Paired or Unpaired?

“Ttest_Ex3” Sheet

Two- Because we are testing for any difference (less than or greater than).

Two - Because we are the data comes from two samples, it is a two-sample test.

Paired – We are weighting the same individual(s) both before and after the test

T-Test: Example 3

A company wishes to see if a new drug it has developed for diabetes has any effect on

people's weight. They weighed several individuals before giving them the drug and then

again after giving them the drug.

“Ttest_Ex3” Sheet

Because P is les than .05 (.0000103), we reject the null hypothesis. The drug does affect people’s weight.

T-Test: Example 4

"TV's R Us" is opening a store in Sunnydale, CA and wishes to see if the town’s average

differs from the national average of 290 minutes per day. The data from the 15 residents

they surveyed is below.

Number of Tails?

One or Two Sample?

Paired or Unpaired?

“Ttest_Ex4” Sheet

Two- Because we are testing for any difference (less than or greater than).

One (Sort of)- We are comparing our data with a national average so this is a one sample test; however, the Data Analysis Tookpakdoesn’t do have a one sample test so we must make a column of 290 and specify a Two Sample test.

Unpaired- Two separate groups are measured. Also, we don't know the variance of the two populations so we will assume they are unequal.

T-Test: Example 4

"TV's R Us" is opening a store in Sunnydale, CA and wishes to see if the town’s average

differs from the national average of 290 minutes per day. The data from the 15 residents

they surveyed is below.

“Ttest_Ex4” Sheet

Because P is les than .05, we reject the null hypothesis. Sunnydale’s viewing does differ from the national average.

T.Test() Function

T.Test(Array1,Array2,Tails,Type)

Tails has Two Choices1 for a one tailed test.2 for a two tailed test.

Type has Three Choices:1 for Unpaired and Unequal Variance.2 for Unpaired and Equal Variance.3 for paired

T.Test() Function

Z-Test (When N>30)

Like a T-Test this is used to determine how similar two sets of data are. When the number of data points is greater than 30, use a Z-Test rather than a T-Test.

Requirements:• Variance of Samples should be close (F-Test).• Does not have to be normally distributed.• Samples sizes don’t have to be the same size

but should be close. • Data points should be independent of each

other.

Note that if you know the standard deviation of the population you should always use a Z-Test regardless of the sample size.

Z-Test (n>30)

Steps in Excel

Computer the Variance of both data setsVar.S(Range)

Perform a Z-TestData Analysis Toolpak

Z-Test (n>30)

Exercise 1

A company manufacturing “Easy-Start” file logs wants to know if a new manufacturing treatment increases the log's burning time. They have tested 41 of their old longs and 35 of their new logs and have recorded the burning times in the table shown to the right. They now wish to see if the difference they are seeing in the two means is due to randomness or the new logs actually do burn longer.

“Ztest_Ex1” Sheet

Z-Test (n>30)

Exercise 1

Step 1: Find the Variance for both the Old and New logs.

=VAR.S(A5:A45)And

=VAR.S(B5:B39)

“Ztest_Ex1” Sheet

Z-Test (n>30)

Exercise 1

Step 2: Perform a Z-Test to see if the samples differ

Because our P value for a one-tail test (0.00913) is less than 0.05, we will reject the null hypothesis that the means are the same. Or in other words, the difference in our means is not due to randomness and the new logs do burn longer than the older logs.

Linear Regression

Cause and Effect of Two Variables

Linear Regression - Slope

Rise Over Run

The slope of a line tells you how much a dependent variable (Sales) changes given the values of an independent variable (Advertising Costs) and is the RISE / RUN.

Slope = (Y2-Y1)/(X2-X1) = (230-200)/(20-10) = 30/10 = 3

A slope of 3 indicates that for every dollar of advertising we spend, our sales increase by 3 dollars.

Slope(Independent Range Y, Dependent Range X)

Linear Regression – Slope Exercise 1

For each dollar of advertising we spend, how much do our sales go up?

Use Excel’s Slope() functionIt uses the Least Squares Method (Line of Best Fit) to find the slope using multiple data points

“Slope_and_Intercept” Sheet

Linear Regression – Slope Exercise 1

For each dollar of advertising we spend, how much do our sales go up?

“Slope_and_Intercept” Sheet

SolutionFor every dollar of advertising we spend, our sales go up $2.86

Linear Regression – Y Intercept

The Y-Intercept is where your lines intersects the Y axis and tells you what the value of your Y coordinate would be if its corresponding X coordinate were zero. In this example, it tells us what sales would be if we spent nothing on advertising?

Intercept: y = mx+b

Where:y = the y coordinate of any data point on the line.x = the corresponding x coordinate for the y coordinate given above.m = the slope of the line.b = the y-intercept.

Intercept(Independent Range Y, Dependent Range X)

Linear Regression – Intercept Exercise 1

If we didn’t spend anything on advertising, what would our sales be?

Use Excel’s Intercepet() function to find the Y Intercpet.

Intercept(Independent Range Y, Dependent Range X)

“Slope_and_Intercept” Sheet

Linear Regression – Intercept Exercise 1 Solution

If we didn’t spend anything on advertising, what would our sales be?

“Slope_and_Intercept” Sheet

SolutionIf we spent nothing on advertising our sales would be $179,192

Plotting The Line

Plotting The Line

Plotting The Line

Meaning of R SquaredR2 is the square of the correlation and represents the percentage that the independent variable (advertising) "explains" the dependent variable (sales). Or in other words, checks to see how much advertising affects sales. It is always positive and between 0 and 1 with 1 being a 100% correlation and 0 being no correlation.

In this example R2 is .8548 (or 85.5%) which is a pretty strong correlation indicating that advertising does indeed affect sales. It also means there is some other unknown variable affecting 14.5% of sales.

Forecasting Future Values

(Linear Regression)

Hard

ware

Sale

s

New Home Startups

800K 900K 600K 400K 425K 500K

2.25M

2.0M

1.75M

1.5M

1.25M

1M

20

05

20

06

20

07

20

08

20

09

20

10

Regression Formula - Predicts a single data point.Forecast() - Predicts a single data point.Trend() - Predicts multiple data points.

Exercise 1: Predicting a Y Value Using Slope,

Y-Intercept, and the corresponding X Value

𝒀 𝑽𝒂𝒍𝒖𝒆 = Slope * X value + Y Intercept Y=mX+b

“Regression_Ex1” Sheet

Forecast sales for the second half of the year based on sales for the first half using Y=mX+b

Exercise 1: Predicting a Y Value Using Slope,

Y-Intercept, and the corresponding X Value

𝒀 𝑽𝒂𝒍𝒖𝒆 = 𝑺𝒍𝒐𝒑𝒆 ∗ 𝑿 𝑽𝒂𝒍𝒖𝒆 + Y Intercept Y=mX+b

“Regression_Ex1” Sheet

1. Calculate Slope in F2:=SLOPE(C2:C7,A2:A7)

2. Calculate Y-Intercept in F3:=Intercept(C2:C7,A2:A7)

3. Predict July Sales in C8:=Slope * X Value + Y Intercept=$F$2*A8+$F$3

4. Copy down.

Solution

Exercise 2: Forecasting a Y Value using Forecast()

“Forecast_Ex2” Sheet

Forecast(x, Known y's, known x's)

X is the known matching independent variable for the dependent variable Y you are trying to predict.

Known Y's are the known dependent variables. (Hardware sales in this example.)

Known X's are the known independent variables. (New Home Start-ups in this example.)

We believe that the number of new home start-ups in our state has a strong impact on our sales and is a good indicator of what our hardware sales will be. Predict what our Hardware Sales will be in 2014.

Exercise 2: Forecasting a Y Value using Forecast()

“Forecast_Ex2” Sheet

Forecast(x, Known y's, known x's)

X is the known matching independent variable for the dependent variable Y you are trying to predict.

Known Y's are the known dependent variables. (Hardware sales in this example.)

Known X's are the known independent variables. (New Home Start-ups in this example.)

SolutionWe believe that the number of new home start-ups in our state has a strong impact on our sales and is a good indicator of what our hardware sales will be. Predict what our Hardware Sales will be in 2014.

Exercise 3: Forecasting a Y Value using Forecast()

“Forecast_Ex3” Sheet

Use Forecast() to predict future sales based upon Advertising Cost and Sales

Exercise 3: Forecasting a Y Value using Forecast()

“Forecast_Ex3” Sheet

Use Forecast() to predict future sales based upon Advertising Cost and Sales

Solution

Exercise 4: Forecasting a Y Value using Forecast()

“Forecast_Ex4” Sheet

How much should you spend on advertising to achieve your desired sales?

Switch what you use for your dependent and independent variable ranges.

Exercise 4: Forecasting a Y Value using Forecast()

“Forecast_Ex4” Sheet

How much should you spend on advertising to achieve your desired sales?

Solution

Trend() Exercise 1: Using Trend() to Predict Multiple Values

“Trend_Ex1” Sheet

=Trend(Known Y's, Known X's, New X's, Constant T/F)

Known Y'sThese are the historic dependent variables (e.g Sales for the 1st

half of the year.)

Known X'sThese are the historic independent variables (e.g Ad Costs for the 1st

half of the year.)

New X'sThese are the independent variables we plan to use (e.g Ad Costs for the 2nd half of the year.)

Constant (Optional)If True or omitted, b is calculated normally. If False, b is set to 0 and m-values are adjusted so y=mx.

Trend() is an array function so:

a. Highlight where you want your results (C9:C14).b. Type your formula.c. Press Control + Shift + Enter

Trend() Exercise 1: Using Trend() to Predict Multiple Values

“Trend_Ex1” Sheet

=Trend(Known Y's, Known X's, New X's, Constant T/F)

Trend() is an array function so:

a. Highlight where you want your results (C9:C14).b. Type your formula.c. Press Control + Shift + Enter

Solution

Trend() Exercise 2: Using Trend() For Multiple Regression

“Trend_Ex2” Sheet

=Trend(Known Y's, Known X's, New X's, Constant T/F)

If the array of known Y values (E3:E13) are in a single column, then Trend() interprets each column of known X values (A3:D13) as a separate variable.

ExerciseIn this example we have an office building we wish to sell but we are unsure of what our asking price should be. To help us determine our asking price, we have gathered data on office buildings similar to ours and in our area that have sold recently (A3:E13).

Trend() Exercise 2: Using Trend() For Multiple Regression

“Trend_Ex2” Sheet

=Trend(Known Y's, Known X's, New X's, Constant T/F)

=Trend(E3:E13,A3:D13,A17:D17)

Chi-Square Test

Used to test how likely it is that observed data is due to chance. Example, we randomly asked 44 men and 44 women whether they preferred red or white wine. As the table to the right shows, it appears that men prefer red and women prefer white but is this true or just random chance?

About Chi-Square

• Chi-Square Tests the Null HypothesisChi-square tests the null hypotheses (that there is no relationship between two measured phenomena) against our data. Therefore, you will need two data sets to compare: expected and actual. A value of less than .05 indicates the null hypotheses is weak and our assumed results is not due to randomness or in other words, or data does appear to reflect reality.

• Difference Between Chi-Square and a T or Z TestChi-Square tests the null hypotheses about a relationship between two (usually) variables. (e.g. do men and women have a preference for red wine, white wine, or beer?) The t-test tests the null hypothesis between two means. (e.g. whether boys and girls in kindergarten have the same average height.

Using Chi-Square

• Use Chi-Square with Categorized Data Only(i.e. Data that has been counted and divided into categories)

• Do not use Chi-Square with uncategorized continuous data such as height, weight, or scores. (However, you can use Chi-Square if you have categorized your ranges. For example: Count of ages 0-25, 26-50, 50-75, etc.)

Using Chi-Square

• Do Not use with Derived Data, Percentages, or Ratios(e.g. estimations, assumptions, averages or some other statistical manipulation)

• Do Not use Chi-Square if any Category is less than 5

• Variables must be Mutually Exclusive(e.g. a person cannot like both red and white wine. It is one or the other)

• Data in the Set Should Never be Excluded (e.g. if you asked people if they liked red wine, white wine or Champaign, you must use all three.

Exercise 1: Chi-Square

We asked 44 men and 44 women if they preferred red or white wine. Our results are shown in the table to the right.

“Chi_Square_2x2” Sheet

Exercise 1: Chi-Square

Testing our data involves three steps:

Step A: Categorize the Results

Step B: Build a Model of Expected Data Representing the Null Hypothesis

Step C: Perform Chi-Square Test

“Chi_Square_2x2” Sheet

Exercise: Chi-SquarePart A: Categorize the Results Using CountifS()

“Chi_Square_2x2” Sheet

1. =COUNTIFS($A:$A,E$3,$B:$B,$D4)

2. Copy and Total

=CountifS(Criteria Range1, Criteria1, Criteria Range2, Criteria2)

Exercise: Chi-SquarePart B: Build a Model of Expected Data Representing the Null Hypothesis

“Chi_Square_2x2” Sheet

Use the same column and row totals as above. We had 44 males and 44 females and 38 people liked red and 50 people linked white wine.

The Null Hypotheses states that there is no preference so we need the male to female figures to reflect that. In E11 type the formula below then copy.

=E$13*$G11/$G$11

Exercise: Chi-SquarePart C: Determine Chi-Square using Chisq.Test()

“Chi_Square_2x2” Sheet

=ChiSq.Test(E4:F5,E11:F12)

=ChiSq.Test(Actual Range, Expected Range)

.000107 is less than .05 indicating that the null hypothesis is weak and our results are not due to random chance.Or in other words, for the people who answered our survey, the assertion that more women preferred white wine while more men preferred red wine is valid and not due to chance.

CorrelationIs there really a correlation between our X and Y variables?

Correlation

Is there really a correlation between our hardware sales and the number of new housing starts?

=Correl(Array1,Array2)

-1 -.75 -.5 -.25 0 .25 .5 .75 1

PerfectCorrelation

InverseCorrelation

NoCorrelation

Forecast_Ex2 sheet

Correl(Array1,Array2)

Forecast_Ex2 sheet

Correlation Exercise 1:Determine if there really is a correlation between our hardware sales and new home start ups.

Correl(Array1,Array2)

Forecast_Ex2 sheet

Solution to Correlation Exercise 1There is nearly a perfect correlation. New home start ups is an excellent predictor of the sales we can expect in our hardware store.

Correl(Array1,Array2)

Is there a correlation between Interest rates and average home price in the last few years?

Is there a correlation between the age of an automobile and its average resale value?