Data Analysis.ppt
-
Upload
dotrev-ibs -
Category
Documents
-
view
242 -
download
0
Transcript of Data Analysis.ppt
-
7/29/2019 Data Analysis.ppt
1/28
Univariate
Analysis
Bivariate
Analysis
Multivariate
Analysis
Data Analysis
-
7/29/2019 Data Analysis.ppt
2/28
Three Types of Analysis
we can classify analysis into three types
1. Univariate, involving a single variable at a time,
2. Bivariate, involving two variables at a time, and
3. Multivariate, involving three or more variablessimultaneously.
-
7/29/2019 Data Analysis.ppt
3/28
Revision : Application Areas: Correlation
1. Correlation and Regression are generallyperformed together. The application ofcorrelation analysis is to measure the
degree of association between two setsof quantitative data. The correlationcoefficient measures this association. Ithas a value ranging from 0 (nocorrelation) to 1 (perfect positivecorrelation), or -1 (perfect negativecorrelation).
-
7/29/2019 Data Analysis.ppt
4/28
2. For example, how are sales ofproduct A correlated with sales ofproduct B? Or, how is the advertising
expenditure correlated with otherpromotional expenditure? Or, aredaily ice cream sales correlated with
daily maximum temperature?
-
7/29/2019 Data Analysis.ppt
5/28
3. Correlation does not necessarily meanthere is a causal effect. Given any two
strings of numbers, there will besome correlation among them. It does notimply that one variable is causing a
change in another, or is dependentupon another.
4. Correlation is usually followed byregression analysis in many applications.
-
7/29/2019 Data Analysis.ppt
6/28
Application Areas: Regression
1. The main objective of regression analysis is
to explain the variation in one variable(called the dependent variable),based on thevariation in one or more other variables(called the independent variables).
2. The applications areas are in explainingvariations in sales of a product based onadvertising expenses, or number of salespeople, or number of sales offices, or on all
the above variables.
3. If there is only one dependent variable andone independent variable is used to explain
the variation in it, then the model is known
-
7/29/2019 Data Analysis.ppt
7/28
4. If multiple independent variables are usedto explain the variation in a dependentvariable, it is called a multiple regression
model.
5. Even though the form of the regressionequation could be either linear or non-linear, we
will limit our discussion to linear (straight line)models.
-
7/29/2019 Data Analysis.ppt
8/28
The general regression model (linear) is ofthe type
Y = b0
+ b1x
1+ b
2x
2+.+ b
nx
n
( OR Y = a + b1x1 + b2x2 +.+ bnxn )
where
y is the dependent variable
x1, x2, x3.xn are the independent variables
expected to be related to y and expected toexplain or predict y.
b1, b2, b3bn are the coefficients of the
respective independent variables, which will
-
7/29/2019 Data Analysis.ppt
9/28
Purposes of Regression Analysis
To establish the relationship betweena dependent variable (outcome) and a set ofindependent (explanatory) variables
To identify the relative importance of thedifferent independent (explanatory)
variables on the outcome
To make predictions
-
7/29/2019 Data Analysis.ppt
10/28
Steps of Regression Analysis
Step 1: Construct a regression modelStep 2: Estimate the regression and interpret
the result
Step 3: Conduct diagnostic analysis of theresults
Step 4: Change the original regression model if
necessaryStep 5: Make predictions
-
7/29/2019 Data Analysis.ppt
11/28
DATA (INPUT / OUTPUT)
1. Input data on y and each of the x
variables is required to do a regressionanalysis. This data is input into acomputer package to perform the
regression analysis.
2. The output consists of the b coefficientsfor all the independent variables in themodel. It also gives the results of a t testfor the significance of each variable in themodel, and the results of the F test for
the model on the whole.
-
7/29/2019 Data Analysis.ppt
12/28
3 Assuming the model is statistically significantat the desired confidence level (usually 90 or95%), the coefficient of determination or R2 of themodel is an important part of the output. The R2
value is the percentage (or proportion) of the totalvariance in y explained by all the independentvariables in the regression equation.
-
7/29/2019 Data Analysis.ppt
13/28
Requirements for applying Multiple regression analysis
1. The variables used (independent and dependent) are
assumed to be either interval scaled or ratio scaled.
2. Nominally scaled variables can be used asindependent variables in a regression model, withdummy variable coding.
3. If the dependent variable happens to be a
nominally scaled one, discriminant analysisshould be the technique used instead of regression.
4. Dependent variable essentially METRIC
Independent variables Metric or Dummy
-
7/29/2019 Data Analysis.ppt
14/28
Worked Example: Problem
A manufacturer and marketer ofelectric motors would like to build aregression model consisting of five orsix independent variables, to predictsales. Past data has been collected for15 sales territories, on Sales and six
different independent variables. Builda regression model and recommendwhether or not it should be used by
the company.
-
7/29/2019 Data Analysis.ppt
15/28
The data are for a particular year,
in different sales territories inwhich the company operates, andthe variables on which data are
collected are as follows:
-
7/29/2019 Data Analysis.ppt
16/28
Dependent VariableY =sales in Rs.lakhs in the territory
Independent Variables
X1 = market potential in the territory(in Rs.lakhs).
X2 = No. of dealers of the company in theterritory.
X3 = No. of salespeople in the territory.X4 = Index of competitor activity in the
territory on a 5 point scale(1=low, 5=high level of activity by
competitors).X5 = No. of service people in the territory.X6 = No. of existing customers in the
territory.The followin slide ives the Data file :
1 2 3 4 5 6 7
-
7/29/2019 Data Analysis.ppt
17/28
1
SALES
2
POTENTL
3
DEALERS
4
PEOPLE
5
COMPET
6
SERVICE
7
CUSTOM
1 5 25 1 6 5 2 202 60 150 12 30 4 5 503
20 45 5 15 3 2 254 11 30 2 10 3 2 205 45 75 12 20 2 4 306 6 10 3 8 2 3 167
15 29 5 18 4 5 308 22 43 7 16 3 6 409 29 70 4 15 2 5 3910 3 40 1 6 5 2 511
16 40 4 11 4 2 1712 8 25 2 9 3 3 1013 18 32 7 14 3 4 3114 23 73 10 10 4 3 4315
81 150 15 35 4 7 70
-
7/29/2019 Data Analysis.ppt
18/28
Regression
We will first run the regression model of the
following form, by entering all the 6 'x' variablesin the model -
Y= b0+ b1x1 + b2x2 + b3x3 + b4x4 + b5x5 + b6x6
..Equation 1[ OR
Y= a + b1x1 + b2x2 + b3x3 + b4x4 + b5x5 + b6x6
..Equation 1]
and determine the values of b0, b1, b2, b3, b4, b5, &
-
7/29/2019 Data Analysis.ppt
19/28
MULTIPLE REGRESSION RESULTS:
All independent variables were entered in one block
Dependent Variable: SALES
Multiple R: .988531605
Multiple R-Square: .977194734
Adjusted R-Square: .960090784Number of cases: 15
Th ANOVA T bl
-
7/29/2019 Data Analysis.ppt
20/28
The ANOVA Table
STAT.
MULTIPLE
REGRESS.
Analysis of Variance; Depen.Var: SALES (regdata1.sta)
Effect
Sums of
Squares df
Mean
Squares F
Regress.
Residual
Total
6609.484
154.249
6763.733
6
8
1101.581
19.281
57.13269 .000004
From the analysis of variance table, the last column
indicates the p-level to be 0.000004. This indicatesthat the model is statistically significant at a
confidence level of (1-0.000004)*100 or
(0.999996)*100, or 99.9996.
-
7/29/2019 Data Analysis.ppt
21/28
:
STAT.MULTIPLE
REGRESS.
Regression Summary for Dependent Variable: SALESR= .98853160 R
2= .97719473 Adjusted R
2= .96009078
F(6,8)=57.133 p< .00000 Std.Error of Estimate: 4.3910
N=15
BETA
St.Err.
of
BETA
B
St. Err.
of B t(8) p-level
Intercept -3.1729 5.813394 -.54581 .600084
POTENTL .439073 .144411 .22685 .074611 3.04044 .016052
DEALERS .164315 .126591 .81938 .631266 1.29800 .230457PEOPLE .413967 .158646 1.09104 .418122 2.60937 .031161
COMPET .084871 .060074 -1.89270 1.339712 -1.41276 .195427
SERVICE .040806 .116511 -.54925 1.568233 -.35024 .735204
CUSTOM .050490 .149302 .06594 .095002 .33817 .743935
C l 4 f th t bl titl d B li t ll th ffi i t
-
7/29/2019 Data Analysis.ppt
22/28
Column 4 of the table, titled B lists all the coefficientsfor the model. These are : a (intercept) = -3.17298
b1 = .22685
b2 = .81938b3 = 1.09104b4 = -1.89270b5 = -0.54925
b6 = 0.06594
Substituting these values of a, b1, b2, ..b6 in
equation 1 we can write the equation (roundingoff all coefficients to 2 decimals), as
S l 3 17 23 ( t ti l) 82
-
7/29/2019 Data Analysis.ppt
23/28
Sales = -3.17 + .23 (potential) + .82(dealers) + 1.09 (salespeople) - 1.89(competitor activity) - 0.55 (service
people) + 0.07 (existing customers)
[Y= a + b1x1 + b2x2 + b3x3 + b4x4 + b5x5 + b6x6
..Equation 1]The estimated increase in sales for every unit increaseor decrease in the independent variables is given by the
coefficients of the respective variables. For instance, ifthe number of sales people is increased by 1, sales in Rs. lakhs, are estimated to increase by 1.09, if all othervariables are unchanged. Similarly, if 1 more dealeris added, sales are expected to increase by 0.82 lakh, if
Th SERVICE i bl d t k t h i t iti
-
7/29/2019 Data Analysis.ppt
24/28
The SERVICE variable does not make too much intuitivesense. If we increase the number of service people,sales are estimated to decrease according to the0.55
coefficient of the variable "No. of Service People"(SERVICE).
Now look at the individual variable t tests, we find that
the coefficients of the variable SERVICE is statisticallynot significant (p-level 0.735204). Therefore, thecoefficient for SERVICE is not to be used in interpretingthe regression, as it may lead to wrong conclusions.
Strictly speaking, only two variables, potential(POTENTL) and No. of sales people (PEOPLE) aresignificant statistically at 90 percent confidence levelsince their - level is less than 0.10. One should
-
7/29/2019 Data Analysis.ppt
25/28
Different modes of entering independentvariables in the model
Enter
Forward Stepwise Regression
Backward step wise Regression Step wise regression
Th fi l d l
-
7/29/2019 Data Analysis.ppt
26/28
The final model
Sales = -10.6164 + .2433 (POTENTL)
+ 1.4244 (PEOPLE)Equation 3
Predictions:If potential in a territory were to be Rs. 50 lakhs, andthe territory had 6 salespeople, then expected sales,using the above equation would be
= -10.6164 +.2433(50) +1.4244(6)
= 10.095 lakhs.Similarly, we could use this model to make predictionsregarding sales in any territory for which Potential andNo. of Sales People were known.
-
7/29/2019 Data Analysis.ppt
27/28
Recommended usage
1. It is recommended that for serious decision-making, therehas to be a-priori knowledge of the variables which arelikely to affect y, and only such variables should be used inthe regression analysis.
2. For exploratory research, the hit-and-trial approach may beused.
3. It is also recommended that unless the model is itself
significant at the desired confidence level (as evidenced bythe F test results printed out for the model), the R valueshould not be interpreted.
-
7/29/2019 Data Analysis.ppt
28/28
Multicollinearity and how to tackle it
Multicollinearity : Interrelationship of the variousindependent variables
It is essential to verify whether independent variables are
highly correlated with each other. If they are, this may indicate
that they are not independent of each other, and we may be
able to use only 1 or 2 of them to predict the dependent
variables.
Independent variables which are highly correlated with each
other should not be included in the model together