Data Mining Project

18
Decision Mining Introduction: Data mining, or knowledge discovery, is the computer- assisted process of digging through and analyzing enormous sets of data and then extracting the meaning of the data. Data Mining is the extraction of hidden, predictive information patterns from large databases. Data Mining is especially useful now-a-days when there is massive amount of data and identifying the useful portions of it can be a tedious job in itself. With data mining we can now try and predict the future trends rather than identifying them after they have already taken place. Data mining tools predict behaviors and future trends, allowing businesses to make proactive, knowledge-driven decisions. XLMiner is an add-in available for MS Excel that allows us to perform Data Mining on the data sets. Problem Solving:

description

A

Transcript of Data Mining Project

Page 1: Data Mining Project

Decision Mining

Introduction:

Data mining, or knowledge discovery, is the computer-assisted process of digging

through and analyzing enormous sets of data and then extracting the meaning of the data. Data

Mining is the extraction of hidden, predictive information patterns from large databases. Data

Mining is especially useful now-a-days when there is massive amount of data and identifying the

useful portions of it can be a tedious job in itself.

With data mining we can now try and predict the future trends rather than identifying

them after they have already taken place. Data mining tools predict behaviors and future trends,

allowing businesses to make proactive, knowledge-driven decisions. XLMiner is an add-in

available for MS Excel that allows us to perform Data Mining on the data sets.

Problem Solving:

In this project we would like to solve a problem of Southwest in Calculating the Air Fares for

the new airports introduced, and providing discounts for its customers.

Using the Dataset we wanted explore the FARE by creating a correlation tables.

We wanted to explore the categorical predictors by computing the percentage of flights in

each category, using pivot tables.

We would like to compare the best model in terms of the predictors between the stepwise

regression to reduce the number of predictors and exhaustive search instead of stepwise

regression.

Predict the average fare on a route using exhaustive search.

Compare the predictive accuracy of the model.

Page 2: Data Mining Project

Data Description :

The Dataset “Airfares” is provided by professor.

It contains real data that were collected for the third quarter of 1996, in which several

new airports have opened in major cities, opening the market for new routes.

The dataset has a total of 639 records; each record consists of 18 attributes or variables.

In order to price flights on these routes, a major airline collected information on 638 air

routes in the United States.

Some factors are known about these new routes. A major unknown factor is whether

Southwest or another discount airline will travel on these new routes.

Southwest's strategy of covering only major cities, use of secondary airports,

standardized fleet, low fares has been very different from the model followed by the older

and bigger airlines.

The presence of discount airlines is therefore believed to reduce the fares greatly.

Data Code:

S_CODES_CITYE_CODEE_CITYCOUPON

NEWVACATIONSWHIS_INCOMEE_INCOMES_POPE_POPSLOT

Starting airport's codeStarting cityEnding airport's codeEnding cityAverage number of coupons (a one-coupon flight is a non-stop flight, A two-coupon flight is a one stop flight, etc.) for that routeNumber of new carriers entering that route between Q3-96 and Q2-97Whether a vacation route (Yes) or not (No).Whether Southwest Airlines serves that route (Yes) or not (No)Herfindel Index - measure of market concentrationStarting city's average personal incomeEnding city's average personal incomeStarting city's populationEnding city's populationWhether either endpoint airport is slot controlled or not; This is a measure of airport congestion

Page 3: Data Mining Project

GATE

DISTANCEPAXFARE (the response)

Whether either endpoint airport has gate constraints or not; This is another measure of airport congestionDistance between two endpoint airports in milesNumber of passengers on that route during period of data collectionAverage fare on that route

Excel Sheet-Airfares.xls: (Click on the below excel sheet to view complete data)

Step-1: To find out the best numerical predictor.

Scatter plot is used to obtain the correlation between two characteristics. Where

correlation implies that as one variable changes, the other variable will also change. Scatter plot

may indicate a cause and effective relationship between the two characteristics. Due to the

existence of third characteristics (or more), the scatter plot may affect the cause and both

characteristics of interest.

Sometimes, though we know that there is good correlation between two characteristics,

we can use one variable to predict the other, particularly if one characteristic is easy to measure

and the other isn’t. For example, if we prove that weight gain in the first trimester of pregnancy

Page 4: Data Mining Project

correlates well with fetus development, we can use gain as a predictor. The alternative would be

expensive tests to monitor the actual development of the fetus.

0 1 2 3 4$0.00

$50.00

$100.00

$150.00

$200.00

$250.00

$300.00

$350.00

$400.00

FARE

FARE

Scatter Plot of Numerical Predictor (New) Vs Response (Fare)

.00 2000.00 4000.00 6000.00 8000.00 10000.00 12000.00$0.00

$50.00

$100.00

$150.00

$200.00

$250.00

$300.00

$350.00

$400.00

FARE

FARE

Scatter Plot of Numerical Predictor (HI) Vs Response (Fare)

Page 5: Data Mining Project

$15,000 $20,000 $25,000 $30,000 $35,000 $40,000 $45,000$0.00

$50.00

$100.00

$150.00

$200.00

$250.00

$300.00

$350.00

$400.00

FARE

FARE

Scatter Plot of Numerical Predictor (S_Income) Vs Response (Fare)

$10,000 $15,000 $20,000 $25,000 $30,000 $35,000$0.00

$50.00

$100.00

$150.00

$200.00

$250.00

$300.00

$350.00

$400.00

FARE

FARE

Scatter Plot of Numerical Predictor (E_Income) Vs Response (Fare)

Page 6: Data Mining Project

0 2000000 4000000 6000000 8000000 10000000$0.00

$50.00

$100.00

$150.00

$200.00

$250.00

$300.00

$350.00

$400.00

FARE

FARE

Scatter Plot of Numerical Predictor (S_POP) Vs Response (Fare)

0 2000000 4000000 6000000 8000000 10000000$0.00

$50.00

$100.00

$150.00

$200.00

$250.00

$300.00

$350.00

$400.00

FARE

FARE

Scatter Plot of Numerical Predictor (E_POP) Vs Response (Fare)

Page 7: Data Mining Project

0 500 1000 1500 2000 2500 3000$0.00

$50.00

$100.00

$150.00

$200.00

$250.00

$300.00

$350.00

$400.00

FARE

FARE

Scatter Plot of Numerical Predictor (Distance) Vs Response (Fare)

0 10000 20000 30000 40000 50000 60000$0.00

$50.00

$100.00

$150.00

$200.00

$250.00

$300.00

$350.00

$400.00

FARE

FARE

Scatter Plot of Numerical Predictor (Pax) Vs Response (Fare)

Page 8: Data Mining Project

By plotting scatter plots of all numerical predictor with response (Fare), we can say that

the numerical variable Distance has a best correlation with the response (Fare). Therefore, we

consider “Distance” as a Numerical Predictor.

Step-2: To find the categorical predictor.

Pivot table is a reporting tool that sorts and sums independent of the original data layout

in spreadsheets, by dragging and dropping columns to different rows, columns, or summary

positions. It can automatically sort, count, total or give the average of the data stored in one table

or spreadsheet. The result is obtained in the form of summarized data, which will be available in

the pivot table.

Pivot table is a flexible tool for displaying data by rearranging the data in a variety of

different ways. It is easy and useful to any reader of a report which is generated with pivot table

and the reader can decide what to look at from which perspective just by dragging and dropping

fields graphically. Pivot table can be used to geographically switch the placement of rows and

columns. This can be used to automatically reorganize data.

Page 9: Data Mining Project

Pivot Table: (Click on the below excel sheet to view the complete data)

VACATION SW SLOT GATE Average of FARE

No No Controlled Constrained $214.62

Free $170.07

Controlled Result $178.64

Free Constrained $195.95

Free $181.43

Free Result $186.90

No Result $184.64

Yes Controlled Constrained $74.28

Free $91.71

Controlled Result $89.97

Free Constrained $136.09

Free $106.02

Free Result $108.95

Yes Result $105.23

No Result $156.90

Yes No Controlled Free $154.74

Controlled Result $154.74

Free Constrained $138.19

Free $142.70

Free Result $142.01

No Result $143.07

Yes Controlled Free $138.85

Controlled Result $138.85

Free Free $101.83

Free Result $101.83

Pivot Table of the Categorical Variables (Vacation, SW, Slot, Gate) and Response (Fare)

After plotting Pivot table by dragging and dropping, we can say that the one with the

highest total between the categories is the best Categorical variable. Therefore, we can say that

Slot is the best Categorical variable in response to Fare.

Page 10: Data Mining Project

Step-3: To find the best suited model

Multiple linear regression analysis is a model used to find the linear relationship between

a quantitative dependent variables and a set of predictors. It is mainly used for selecting

modeling step and the performance assessment depending on the goal.

MLR Excel Sheet: (Click on the below excel sheet to view the complete MLR data)

From the above result set we can say that using predictors that are uncorrelated with the

dependent variable increases the variance of predictions. Therefore, we try to drop the actually

correlated predictor with the dependent variable to increase the average error of predictors.

Page 11: Data Mining Project

Step-4: To find result set of Exhaustive Search

Exhaustive search is used to evaluate subsets, for the moderate values of predictors. This search

avoids the artificial increase in R2, which will result in increase in number of predictors.

Therefore, we use another criteria Mallow’sc p, which is full model with unbiased data but yet

reduces the number of predictors for best result.

Excel sheet of logistics regression model and its output: (Click on the below excel sheet to

view the complete LR data)

Page 12: Data Mining Project

Step-5: Deciding the best suited model

In order to avoid a perfect predictive model with loosely fitted data and a good explanatory

model with low predictive accuracy. It is a very important step to be taken, to choose the

modeling process to be used in the analysis. The model is treated as a good one when it has a

minimum number of predictions. When R2 is higher, the number of predictions increases, which

indirectly effects the performance of the model.

Comparison of Lift and Decile ChartsMLR LR

0100

200300

0

10000

20000

30000

40000

Lift chart (training dataset)

Cumulative FARE when sorted us-ing predicted valuesCumulative FARE using average

# cases

Cum

ulati

ve

1 2 3 4 5 6 7 8 9 100

0.5

1

1.5

2

Decile-wise lift chart (training dataset)

Deciles

Decil

e m

ean

/ Gl

obal

mea

n

0 50 100 1500

1020304050

Lift chart (training dataset)

Cumulative SW when sorted us-ing predicted valuesCumulative SW using average

# cases

Cum

ulati

ve

1 2 3 4 5 6 7 8 9 100

0.51

1.52

2.53

Decile-wise lift chart (training dataset)

Deciles

Decil

e m

ean

/ Gl

obal

mea

n

Page 13: Data Mining Project

Therefore by comparing RMSE, Average Errors, Lift Charts and Decile Wise Charts of

both the model training dataset, we conclude that the model with minimum number of predictors

performs perfectly.

The RMSE value, Average Error Value is small for Logistics Regression model in

comparison with MLR output. We also see from the lift charts that LR output maintains a linear

structure.

Finally we state that the Logistics Regression Model is the best model for calculating

airfares in Southwest for newly introduced airports.

Conclusion:

We conclude that the data mining techniques are one of the best techniques which are

used to find the perfectly suited model for an particular requirement. It is an easy software tool,

which is available in the market for obtaining the best result set. By comparing the obtained

result set of the two models we conclude that Logistics Regression (Exhaustive Search) is a best

suited model for this particular problem.