Data Mining Beyond Adventure Works (Redmond WA 10/3/2009)

24
Data Mining beyond Adventure Works Mark Tabladillo Ph.D. http://marktab.net October 3, 2009

description

(Delivered at Redmond WA -- Oct 3, 2009) Microsoft provides excelllent tutorials and information about data mining through the fictional Adventure Works demos. However, what happens when you stray off that neat-and-tidy path? Data miners should be concerned about data preparation, proper algorithm selection, and correct interpretation. This interactive experience will consist of succinct audience participation demos to introduce some practical issues in real-world data mining.

Transcript of Data Mining Beyond Adventure Works (Redmond WA 10/3/2009)

Data Mining beyond

Adventure Works

Mark Tabladillo Ph.D.

http://marktab.net

October 3, 2009

Approach of this Presentation

• Emphasize

– Conceptual value of data mining

– Relationship of data mining to the real

world

• Reserve

– Specific procedures and mechanics

– Specific mathematics

– Production implementation

2 © 2009 Mark Tabladillo Ph.D.

Outline

• Data Mining Fundamentals

• Interactive Demos

• Conclusion

3 © 2009 Mark Tabladillo Ph.D.

Interactive Demos

• Sports

• Government Forecasting

4 © 2009 Mark Tabladillo Ph.D.

Data Mining Definitions

• Data mining is the automatic or semi-

automatic process of exploring data for

meaningful or useful patterns.

• Data mining algorithms typically use

estimation or optimization to achieve

results (as opposed to only calculations).

5 © 2009 Mark Tabladillo Ph.D.

Microsoft Data Mining

• Microsoft Data Mining refers to

Microsoft’s specific implementation of

certain common data mining algorithms for

the DMX (Data Mining Extensions)

language.

• Also called SQL Server Data Mining, the

technology is integrated into SQL Server

rather than presented as an independent

application.

6 © 2009 Mark Tabladillo Ph.D.

Data Mining Tasks

• Supervised

– Answer known, what is correlated?

• Unsupervised

– Answer unknown (unspecified), what are the

groups?

• Forecasting

– Given a trend, what is next?

7

Value

Slide

© 2009 Mark Tabladillo Ph.D.

List the Data Mining Algorithms

• Ten Answers

• Each one is a field of academic focus

8 © 2009 Mark Tabladillo Ph.D.

The Data Mining Algorithms

• Microsoft Naive Bayes

• Microsoft Linear Regression

• Microsoft Decision Trees

• Microsoft Time Series

• Microsoft Clustering

• Microsoft Sequence Clustering

• Microsoft Association Rules

• Microsoft Neural Networks

• Microsoft Logistic Regression

• Text Mining 9 © 2009 Mark Tabladillo Ph.D.

The Analyze Tab

10 © 2009 Mark Tabladillo Ph.D.

Menu Option Data Mining Algorithm

Analyze Key Influencers Naïve Bayes

Detect Categories Clustering

Fill from Example Logistic Regression

Forecast Time Series

Highlight Exceptions Clustering

Scenario Analysis (Goal Seek) Logistic Regression

Scenario Analysis (What If) Logistic Regression

Prediction Calculator Logistic Regression

Shopping Basket Analysis Association Rules

Demo One:

National League Baseball

• Directions:

You are on the management team for the

Atlanta Braves. To better serve the team,

you have been instructed by the owner to

group the players by considering both their

position and their salary.

11 © 2009 Mark Tabladillo Ph.D.

Demo One:

National League Baseball

• The following rules apply:

– You must make more than one group

– Each group must have at least two players

– Players of different position may be in the

same group

12 © 2009 Mark Tabladillo Ph.D.

Demo One:

National League Baseball

• Individual attributes can be used to make

groups

• Historical statistics can be used to group

new players

• Both supervised and unsupervised

algorithms can be applied to the same

data

13 © 2009 Mark Tabladillo Ph.D.

Demo Two:

Government Forecasting

• Directions:

The President is asking your opinion on

how the following numbers will increase

over the next few months. Because this

project is sensitive, you do not know what

these numbers measure. However, based

on the available history, make your best

projection for the next six periods.

14 © 2009 Mark Tabladillo Ph.D.

Demo Two:

Government Forecasting

15 © 2009 Mark Tabladillo Ph.D.

0

1

2

3

4

5

6

7

8

Jan Feb Mar Apr May Jun Jul Aug Sep Oct Nov Dec Jan Feb Mar Apr May Jun Jul Aug Sep Oct Nov Dec

2007 2007 2007 2007 2007 2007 2007 2007 2007 2007 2007 2007 2008 2008 2008 2008 2008 2008 2008 2008 2008 2008 2008 2008

Demo Two:

Government Forecasting

16 © 2009 Mark Tabladillo Ph.D.

0

2

4

6

8

10

12

Sep Oct Nov Dec Jan Feb Mar Apr May Jun Jul Aug Sep Oct Nov Dec Jan Feb Mar Apr May Jun Jul Aug

2007 2007 2007 2007 2008 2008 2008 2008 2008 2008 2008 2008 2008 2008 2008 2008 2009 2009 2009 2009 2009 2009 2009 2009

Demo Two:

Government Forecasting

• Rapid response is as useful as prediction

• Seek intelligent correlations among related

metrics

• Projections depend on time frame –

modeling is continual

17 © 2009 Mark Tabladillo Ph.D.

Forecasting Algorithms

• Microsoft Time Series

18

Value

Slide

© 2009 Mark Tabladillo Ph.D.

Supervised Algorithms

• Microsoft Naive Bayes

• Microsoft Linear Regression

• Microsoft Decision Trees

• Microsoft Neural Networks

• Microsoft Logistic Regression

19

Value

Slide

© 2009 Mark Tabladillo Ph.D.

Unsupervised Algorithms

• Microsoft Clustering

• Microsoft Sequence Clustering

• Microsoft Association Rules

• Text Mining

20

Value

Slide

© 2009 Mark Tabladillo Ph.D.

Resources

• MarkTab.NET Links, video resources and information for data mining

• Data Mining with Microsoft SQL Server 2008

by Jamie MacLennan (Author), ZhaoHui Tang (Author), Bogdan Crivat (Author)

• Smart Business Intelligence Solutions with Microsoft® SQL Server® 2008

(PRO-Developer)

by Lynn Langit (Author), Matthew Roche (Author)

21 © 2009 Mark Tabladillo Ph.D.

Regroup and Conclusion

• Main Points from this Presentation

22 © 2009 Mark Tabladillo Ph.D.

Contact Information

• Mark Tabladillo

Twitter @marktabnet

• Also on:

Linked In

Facebook

23 © 2009 Mark Tabladillo Ph.D.

Bonus:

Sequence Clustering Ideas

• Trading players in professional sports

• Assigning players to certain positions

• Moving from city to city

• Store path at the mall

• Cancer treatment path

• Taking up a musical instrument

• Taking up sports

• Blogging

• Viral news

24 © 2009 Mark Tabladillo Ph.D.