Dr. Russell Anderson Dr. Musa Jafar West Texas A&M University.
-
Upload
simon-paul -
Category
Documents
-
view
213 -
download
1
Transcript of Dr. Russell Anderson Dr. Musa Jafar West Texas A&M University.
![Page 1: Dr. Russell Anderson Dr. Musa Jafar West Texas A&M University.](https://reader036.fdocuments.in/reader036/viewer/2022072013/56649e535503460f94b48afa/html5/thumbnails/1.jpg)
DATA MINING METHODS COURSE
Dr. Russell AndersonDr. Musa Jafar
West Texas A&M University
![Page 2: Dr. Russell Anderson Dr. Musa Jafar West Texas A&M University.](https://reader036.fdocuments.in/reader036/viewer/2022072013/56649e535503460f94b48afa/html5/thumbnails/2.jpg)
What is Data Mining?
The process of discovering useful information in large data repositories. (Tan, P-N., Steinbach, M., and Kumar, V., Introduction to Data Mining, Addison-Wesley, 2006)
Discovered information should be: Valid Previously unknown Actionable
![Page 3: Dr. Russell Anderson Dr. Musa Jafar West Texas A&M University.](https://reader036.fdocuments.in/reader036/viewer/2022072013/56649e535503460f94b48afa/html5/thumbnails/3.jpg)
Course Objectives
Seven objectives of Lenox and Cuff in 2002 (based on ACM 2001 Ironman Report) Prepare and warehouse data Process data based on set of DM algorithms Analyze results Make predictions Select proper algorithm Make application Motivated to continue graduate studies in DM
We have added Get to know data using statistical analysis tools Use visualization tools for analysis and review
![Page 4: Dr. Russell Anderson Dr. Musa Jafar West Texas A&M University.](https://reader036.fdocuments.in/reader036/viewer/2022072013/56649e535503460f94b48afa/html5/thumbnails/4.jpg)
Overall Approach
1. Get to know the data.2. Select an appropriate data mining
algorithm based on the data and the mining objective.
3. Construct a model using the selected algorithm.
4. Analyze the results.5. Make application.
![Page 5: Dr. Russell Anderson Dr. Musa Jafar West Texas A&M University.](https://reader036.fdocuments.in/reader036/viewer/2022072013/56649e535503460f94b48afa/html5/thumbnails/5.jpg)
Get to Know the Data
How is it structured? Single table/flat-file. Multi-table – relationships
Number of observations Number of dimensions (attributes)
Compute summary statistics using tool such as MS-Excel
Visually evaluate characteristics of the data
![Page 6: Dr. Russell Anderson Dr. Musa Jafar West Texas A&M University.](https://reader036.fdocuments.in/reader036/viewer/2022072013/56649e535503460f94b48afa/html5/thumbnails/6.jpg)
Visual Exploration
Tools developed: Correlation Matrix Scatter Plot Parallel Coordinate Plot
![Page 7: Dr. Russell Anderson Dr. Musa Jafar West Texas A&M University.](https://reader036.fdocuments.in/reader036/viewer/2022072013/56649e535503460f94b48afa/html5/thumbnails/7.jpg)
Visual Exploration Objectives
Distributions of data Data ranges of numeric attributes Cardinality of discrete attributes Shape of distribution
Skewed Multi-model
Location of outliers Identification possible relationships
between attributes Identification of subpopulations within the
data
![Page 8: Dr. Russell Anderson Dr. Musa Jafar West Texas A&M University.](https://reader036.fdocuments.in/reader036/viewer/2022072013/56649e535503460f94b48afa/html5/thumbnails/8.jpg)
The Data Mining Methodologies
Microsoft Business Intelligence Tools Association Analysis – aka market basket analysis Classification
Decision Trees Artificial Neural Network Bayesian Analysis
Regression Cluster Analysis
Custom Tools with Embedded Visual Presentation Artificial neural network for both classification and
regression Self-Organizing Map (SOM) for cluster analysis
![Page 9: Dr. Russell Anderson Dr. Musa Jafar West Texas A&M University.](https://reader036.fdocuments.in/reader036/viewer/2022072013/56649e535503460f94b48afa/html5/thumbnails/9.jpg)
What do students need to know?
Purpose of each methodology Steps of underlying algorithm Data types supported Issues in construction and application
Parameter settings Results interpretation
![Page 10: Dr. Russell Anderson Dr. Musa Jafar West Texas A&M University.](https://reader036.fdocuments.in/reader036/viewer/2022072013/56649e535503460f94b48afa/html5/thumbnails/10.jpg)
Issue - Overtraining
Does the model fit the training data too well?
Need to separate available into training and validation subsets.
Visual view of training progress valuable.
![Page 11: Dr. Russell Anderson Dr. Musa Jafar West Texas A&M University.](https://reader036.fdocuments.in/reader036/viewer/2022072013/56649e535503460f94b48afa/html5/thumbnails/11.jpg)
Classification ErrorsWhat are the costs?
Mushroom edibility classifiers
Classifier A ActualEdible Poisonous
Predicted Edible 38% 0%Poisonous 8% 54%
Classifier B ActualEdible Poisonous
Predicted Edible 44% 1%Poisonous 2% 53%
![Page 12: Dr. Russell Anderson Dr. Musa Jafar West Texas A&M University.](https://reader036.fdocuments.in/reader036/viewer/2022072013/56649e535503460f94b48afa/html5/thumbnails/12.jpg)
Prediction Model Evaluation
Black Box - models built using sophisticated methodologies (ANN’s for example) perform very well, but gaining an understanding of the model itself is difficult.
Contribution of individual input attributes Nature of contribution (shape of curve) Interaction between input attributes
![Page 13: Dr. Russell Anderson Dr. Musa Jafar West Texas A&M University.](https://reader036.fdocuments.in/reader036/viewer/2022072013/56649e535503460f94b48afa/html5/thumbnails/13.jpg)
See you tomorrow
For a detailed presentation of the mechanics of the software deployed, attend our workshop tomorrow morning. Saturday: 8-10 AM Kachina A
Microsoft SQL Server Business Intelligence Studio
Visualization Tools