Data Mining David Klein

Data Mining

David Klein & Adam Cogan

Admin Stuff

• Attendance– You initial sheet

• Hands On Lab– You get me to initial sheet

• Certificate – At end of 10 sessions– If I say you have completed successfully

About

• David Klein is a Senior Software Architect at SSW, specialising in .NET & SQL Server & BI solutions– Current Clients – Sally Knox Medical & Pisces

• Adam Cogan is Chief Architect at SSW and one of 2 Microsoft Regional Directors in Australia, specialising in Office, SQL and .NET solutions

Course Overview

The 5 Sessions (Part B)

1. SSIS and Creating a Data Warehouse2. Creating a Cube and Cube Issues3. Reporting Services4. Other Cube Browsers5. Data Mining http://www.ssw.com.au/ssw/events/2006SQL/

http://www.ssw.com.au/ssw/events/2006SQL/

http://www.ssw.com.au/ssw/events/2006SQL/

Session 5: Tonight’s Agenda

1. Why Data Mining?2. Uses3. Algorithms4. Implementation

– Sql Server Management Studio SSMS– Reporting Services– Sql Server Integration Services– DMX

5. Demo ?6. Hands on Lab

Why Data Mining?

• Marketing– Who picks the movie? The kids, the wife, me.– Who are our Customers and what sort of films

do they hire?– Is a 30 year old woman with 2 children going to

hire Arnie’s latest film

• Validation– Is this data sensible? Terminator 2 and Toy

Story

• Prediction– Sales Next Year

Complete Set Of Algorithms

Decision TreesDecision Trees ClusteringClustering Time SeriesTime Series

Sequence Sequence ClusteringClustering

AssociationAssociation Naïve BayesNaïve Bayes

Neural Neural NNetetss

Naïve Bayes

• Quickly builds mining models that can be used for classification and prediction

• It calculates probabilities for each possible state of the input attribute, given each state of the predictable attribute– This can later be used to predict an outcome

of the predicted attribute based on the known input attributes

• This makes the model a good option for exploring the data

Naïve Bayes – Toy Story 2

Decision Trees (1)

• Decision Trees assign (classify) each case to one of a few (discrete) broad categories of selected attribute (variable) and explains the classification with few selected input variables

• The process of building is recursive partitioning – splitting data into partitions and then splitting it up more

• Initially all cases are in one big box

Decision Trees (2)

• The algorithm tries all possible breaks in classes using all possible values of each input attribute; it then selects the split that partitions data to the purest classes of the searched variable– Several measures of purity

• Then it repeats splitting for each new class– Again testing all possible breaks

• Unuseful branches of the tree can be pre-pruned or post-pruned

Decision Trees (3)

• Decision trees are used for classification and prediction

• Typical questions:– Predict which customers will leave– Help in mailing and promotion campaigns– Explain reasons for a decision– What are the movies young female customers

like to buy?

Decision Trees – Who Decides

Cluster Analysis (1)

• Grouping data into clusters– Objects within a cluster have high similarity based on

the attribute values

• The class label of each object is not known• Several techniques

– Partitioning methods– Hierarchical methods– Density based methods– Model based methods– And more…

Cluster Analysis (2)

• Segments a heterogeneous population into a number of more homogenous subgroups or clusters

• Some typical questions:– Discover distinct groups of customers– Identification of groups of houses in a city– In biology, derive animal and plant taxonomies– Find outliers

Conclusion: When To Use What

Analytical problem Examples Algorithms

Classification: Assign cases to predefined classes

Credit risk analysisChurn analysisCustomer retention

Decision TreesNaive BayesNeural Nets

Segmentation: Taxonomy for grouping similar cases

Customer profile analysisMailing campaign

ClusteringSequence Clustering

Association: Advanced counting for correlations

Market basket analysisAdvanced data exploration

Decision TreesAssociation

Time Series Forecasting: Predict the future

Forecast salesPredict stock prices

Time Series

Prediction: Predict a value for a new case based on values for similar cases

Quote insurance ratesPredict customer income

All

Deviation analysis: Discover how a case or segment differs from others

Credit card fraud detectionNetwork infusion analysis

All

Summary

• Why Data Mining?• Uses• Algorithms• Implementation

– Sql Server Management Studio SSMS– Reporting Services– Sql Server Integration Services– DMX

• Demo ?• Hands on Lab

Book

Data Mining with SQL Server 2005ZhaoHui Tang and Jamie MacLennanWiley Press

2 things

[email protected]@ssw.com.au

mailto:[email protected]

mailto:[email protected]

BI is Cool

Data Mining David Klein

Documents

Transcript of Data Mining David Klein