University of Arkansas Data Mining with Teradata TM Warehouse Miner Jim Kashner CTO Data Mining.
-
Upload
myrtle-snow -
Category
Documents
-
view
224 -
download
1
Transcript of University of Arkansas Data Mining with Teradata TM Warehouse Miner Jim Kashner CTO Data Mining.
University of Arkansas
Data Mining withTeradataTM Warehouse Miner
Jim KashnerCTO Data Mining
11/30/2004 Copyright 2004 Teradata, a division of NCR 2
The Empirical Method and Decision SupportThe Empirical Method and Decision Support
• all of the information in this presentation are “jim’s opinions numbers 8 through 224” (for today) …
• a framework for making decisions in the presence of uncertainty
• seeks to shed light on the validity or plausibility of notions, suppositions, propositions, hypotheses
• is iterative and circular– don’t ever finish– just stop at some point
…
Notion
Data
Analysis
Interpretation
Supposition
Proposition
Hypothesis
Refined Hypothesis
11/30/2004 Copyright 2004 Teradata, a division of NCR 3
Teradata Warehouse MinerTeradata Warehouse MinerTechnology Enablers for the Data Mining ProcessTechnology Enablers for the Data Mining Process
• the various releases of Teradata Warehouse Miner are intended to serve as very powerful technology enablers for the Data Mining Process
• but, Tools Don’t Build Models, Thoughtful People Do– When a good tool between the ears drives the data mining
process, good models are built– When too much is asked of analytical software, the risk of spurious
and invalid models rises proportionately
• but thoughtful people who build models can also be helped by having a proven and generic process to follow– The formal Teradata Data Mining Method is one of several good
processes used to conduct successful data mining projects• its foundation is the “tried and true” empirical method• its not a prescription, just a set of carefully constructed suggestions
11/30/2004 Copyright 2004 Teradata, a division of NCR 4
Teradata Data Mining MethodTeradata Data Mining Method
Project Management
Knowledge Transfer
BusinessIssues
ArchitectureandTechnologyPreparation
DataPreparation
AnalyticalModeling
KnowledgeDelivery and Deployment
• data mining is a very iterative process– the linear process depicted above serves as a guide, and
identifies the chunky bits of the process
11/30/2004 Copyright 2004 Teradata, a division of NCR 5
Data Mining with Teradata Warehouse MinerData Mining with Teradata Warehouse MinerTeradata’s Data Mining Method – Our ProcessTeradata’s Data Mining Method – Our Process
““Data Profiling”Data Profiling”
• Data Exploration• Data Transformation
TWM – Stats & ADS TWM – Stats & ADS
• Multivariate Statistics• Machine Learning Algorithms
Analytic ModelingAnalytic Modeling
TWM – Analytics TWM – Analytics
Model DeploymentModel Deployment
• Scoring & Evaluation• Lifecycle Maintenance
TWM – DeploymentTWM – Deployment
Model Deploymentand Maintenance
Data Preparationand PreProcessing
Model Constructionand Evaluation
Business Question Identification and
Qualification
Architecture and
TechnologyPreparation
Project Management -and- Knowledge Transfer
Highly Iterative Process
11/30/2004 Copyright 2004 Teradata, a division of NCR 6
Data Mining and the Empirical MethodData Mining and the Empirical Method
• data mining is not automated discovery of hidden patterns in your data
• data mining is thoughtful and technology enabled discovery of hidden patterns in your data
• welcome to the empirical method
11/30/2004 Copyright 2004 Teradata, a division of NCR 7
Teradata as an Analytic EngineTeradata as an Analytic Engine
• Teradata is especially well-suited to perform complex aggregations and evaluations of sets according to conditional logic– native Teradata functions– expressed as SQL– where indexes cannot reasonably be expected to exist for any
particular aggregation, set evaluation, or conditional logic
• analytical modeling algorithms require an engine that can perform complex aggregations and evaluations of sets according to conditional logic
• the very good fit of Teradata as an analytic engine is rather obvious after considering what analytical modeling algorithms actually do under the hood
11/30/2004 Copyright 2004 Teradata, a division of NCR 8
Said another way ...Said another way ...
Given: The following notation is used in virtually all statistical, artificial intelligence, and machine learning algorithms that denote equations used to represent and calculate data mining models:
f (x) - which means sum - and - Σ f (x) - which means sum
f (x) - which means multiply
Є and Є - which mean is, and is not an element of (set theory)
Question: What do they all have in common?
Answer: All of these are what Teradata does better than any other engine on this planet.
Note: f(x) are other supported functions, mathematical and other, either as native Teradata functions, or those that can be expressed in SQL with Teradata extensions very efficiently.
11/30/2004 Copyright 2004 Teradata, a division of NCR 9
Teradata Warehouse Miner is an ongoing Teradata Warehouse Miner is an ongoing experimentexperiment
• TeraMinerTM Stats– June, 1999
• Teradata Warehouse Miner– Stats, Analytics, & Deployment– July, 2001
• Teradata Warehouse Miner– Stats, Analytics, Deployment, & ADS (Analytical Data Set generation)– June, 2004
• additional functionality continually in subsequent releases– to each of these components of Teradata Warehouse Miner
• because of our success with this “experimental approach”, we continue to ask: “Why not?”– Teradata continues to amaze us by what it can do– our Teradata Warehouse Miner Software Engineering Team is quite
amazing too
11/30/2004 Copyright 2004 Teradata, a division of NCR 10
What isWhat isTeradata Warehouse Miner ?Teradata Warehouse Miner ?
• TWM includes a set of .NET Interfaces and a User Interface– generates and executes Teradata-specific SQL
• ANSI SQL when possible
– instantiated by User Interface– easily integrated into other applications (partners, custom)– all analysis parameters, model definition, and analysis results stored in metadata– select results or explain, or persist results in table, temporary table or view
• TWM includes several types of .NET Interfaces– Registry independent application extensions or plug-ins– Teradata Warehouse Miner Descriptive Statistics DLL – Teradata Warehouse Miner ADS DLL– Teradata Warehouse Miner Data Reorganization DLL– Teradata Warehouse Miner Analytic Algorithm & Scoring DLLs (4)– Teradata Warehouse Miner Matrix DLL– Teradata Warehouse Miner Statistical Test DLL
• TWM includes a GUI for the desktop– User interface to .NET Objects– Queries Teradata Data Dictionary to aid in parameterizing functions
• directly using HELP syntax • optionally, MDS DIM (Metadata Services Database Information Model)
– Interactive display of results – SQL, Data, Graphs, Reports
11/30/2004 Copyright 2004 Teradata, a division of NCR 11
Teradata Warehouse MinerTeradata Warehouse MinerHigh Level ArchitectureHigh Level Architecture
Teradata Warehouse Miner
• Windows Interface– build, maintain, and execute
projects– explore and manipulate results
• tabular and graphical– parameterize .NET APIs
• .NET APIs & ADO– .NET Interfaces (APIs)
• documented for developers– ActiveX Data Objects
• DLL interface ”plug-ins”– write all API parameters and all
XML results in TWM metadata • stored in binary data type
– generate & submit SQL– receive query results from
Teradata and present them in user interface
– read model definition and results stored in TWM metadata to display XML reports and graphs
– read model definition in TWM metadata to score and evaluate
Teradata RDBMS
User Interface Services
Teradata Platform:
Teradata RDBMS Version 2 Release 4.1 or later
Business Services
Data Services
Windows NT, 2000, XP, .NET 2003 ServerClient Platform:
Manager
Algorithms (COM) Algorithms (.NET) Data Access
Teradata ODBC
Metadata Access
Projects
Analyses
Teradata Metadata Services
User Interface
Visualizations
11/30/2004 Copyright 2004 Teradata, a division of NCR 12
Teradata Warehouse MinerTeradata Warehouse MinerData Description FunctionsData Description Functions
Univariate StatisticsCountMinimum, MaximumModesMeanStandard DeviationStandard ErrorVarianceCoefficient of VariationSkewnessKurtosisUncorrected Sum of SquaresCorrected Sum of Squares
Quantiles and RanksTop 10/Bottom 10 PercentilesDecilesQuartilesTertilesTop 5/Bottom 5 Ranked Values with Counts
Scatter Plot Analysis2-D and 3-D Plots of Continuous Variables
Correlation AnalysisQuickly view pair-wise correlations among ‘n’ variables
Values Analysis(basic data quality analysis)
Data Types Counts # NULL Values # Positive Values # Negatives Values # Zeros # Blanks # Unique Values
Frequency AnalysesFrequency of Discrete Variables
N-Way Cross-TabulationPair-wise Cross-Tabs
Histogram AnalysesHistograms of Continuous VariablesOptions for
Even WidthUser Defined Widths/BoundariesQuantileAdaptive BinningOverlay columnsStatistics within bins
Overlap AnalysisIndex/Key Column Consistency
Data ExplorerPerforms basic statistical analysis on a set of tables and selected columns within any Teradata database
Intelligent decisions about which functions to perform
Most criteria for “Intelligent” decisions can be modified by user
Values Analysis - Every column in the set of input tables
Univariate Statistical Analysis - Every column of numeric or date type
Frequency Analysis - Every column that has less than or equal to a number of unique values
Histogram Analysis - Every numeric or date type column that has more than a number of unique values
Data Visualizations2D & 3D Histograms
2D & 3D Frequency Bar Charts
Values Bar Charts & Circular Graphs
Box and Whisker Plots
Scatter Plots
Integrated Data Explorer Graphics
11/30/2004 Copyright 2004 Teradata, a division of NCR 13
Teradata Warehouse MinerTeradata Warehouse MinerData Derivation and Transformation FunctionsData Derivation and Transformation Functions
Variable Creation
AggregationsCount, Average, Sum, etc.
Windowed Aggregates/OLAPRank, Quantiles, Moving Sums, etc.
Arithmetic operators/functions: +, -, *, /, MOD, **
ABS, EXP, LN, LOG, SQRT, etc.
Trigonometric & Hyperbolic functions
COS, SIN, TAN, ACOS, etc.
COSH, SINH, TANH, ACOSH, etc.
CASE expressions and NULL operators
valued and searched types
NULLIF, COALESCE
Comparison operators=, >, <, <>, <=, >=
Logical predicatesBETWEEN…AND…, IN (expression list), etc.
Variable Creation (cont)
Calendar functions: day_of_week, day_of_calendar, quarter_of_year, etc.
String functionsLOWER, UPPER, TRIM, ||, etc.
Data Type conversion
SQL predicatesTRUE, FALSE, NULL
Variable Dimensioning
Simple DimensionsSpecific values
Range of values
Combined Dimensions
Hierarchical Dimensions
SysCalendar, etc.
Variable TransformationBin Coding
Design Coding
Recoding
Rescaling
DeriveHook to Variable Creation
Statistical TransformationsZ-Score
Sigmoid
NULL Value ReplacementLiteral value
Mean value
Median value
Mode
Imputed values
11/30/2004 Copyright 2004 Teradata, a division of NCR 14
Teradata Warehouse MinerTeradata Warehouse MinerData Reorganization, Build ADS, Matrix FunctionsData Reorganization, Build ADS, Matrix Functions
Data Reorganization
Random Sampleand Stratified Random
Partitioning
Denormalize/Pivoting
JoiningInner
Left Outer
Right Outer
Full Outer
Build ADS
Create Final ADS
Create Metadata for Refresh
Matrix Functions
Correlation
Covariance
SSCP
Corrected SSCP
11/30/2004 Copyright 2004 Teradata, a division of NCR 15
Teradata Warehouse MinerTeradata Warehouse MinerAnalytical Techniques, Scoring, Visualizations (1)Analytical Techniques, Scoring, Visualizations (1)
Analytic Algorithms
(Multivariate Statistical Techniques)
Linear Regressionmodel statisticsvariable coefficients, standard errors, confidence intervals, etc.
incremental R2
step-wise variable selection optionsforward & forward onlybackward & backward only
Factor AnalysisPrincipal Component AnalysisPrincipal Axis FactorsMaximum Likelihood FactorsOrthogonal & Oblique Rotations
Logistic RegressionLogit Model Coefficients, Odds Ratios and StatisticsModel Success Analysis and Lift Tablesstep-wise variable selection options
forward & forward onlybackward & backward only
Model ScoringLinear RegressionLogistic RegressionFactor AnalysisSQL-based model scoring
all scoring SQL is provided
Supporting VisualizationsScatter PlotLift ChartRegression PlotsFactor PatternScree Plot
Multivariate DiagnosticsExtensive Collinearity DiagnosticsAutomated Identification of ConstantsRow level diagnostics, and much more…SQL-based model evaluation
11/30/2004 Copyright 2004 Teradata, a division of NCR 16
Teradata Warehouse MinerTeradata Warehouse MinerAnalytical Techniques, Scoring, Visualizations (2)Analytical Techniques, Scoring, Visualizations (2)
Analytic Algorithms
(AI and Machine Learning Techniques)
Decision Tree/Rule Inductiongini / regression (i.e., CART)Entropy (i.e., C4.5 / C5.0)CHAIDpruning
gini algorithm pruninggain ratio algorithm pruningmanual pruning
ClusteringK-Means
Nearest Neighbor LinkageExpectation Maximization
Gaussian Mixture ModelPoisson Mixture Modelvariable importance report
Affinity and Sequence AnalysesFeature Rich Implementations
Support
Confidence
Lift
z-Score
Model ScoringDecision TreesClusteringAffinity and Sequence AnalysesSQL-based model scoring
all scoring SQL is provided
Supporting VisualizationsGraphical Tree Browser
Interactive PruningText RulesDistributions
Lift ChartsCluster Sizes / Distance / MeasuresAssociation Color Map
Model Evaluationtruth table (confusion matrix)model statistics & indicesSQL-based model evaluation
11/30/2004 Copyright 2004 Teradata, a division of NCR 17
Teradata Warehouse MinerTeradata Warehouse MinerStatistical TestsStatistical Tests
Binomial TestsBinomial
Sign
Rank TestsMann-Whitney (Kruskal-Wallis)
Wilcoxon
Friedman
Contingency Table TestsChi-square
Median
Parametric TestsF (Two Way) Unequal Sample Size
F (N-Way) Equal Sample Size
T
Normality/Equality TestsKolmogorov-Smirnov
Lilliefors Test
Shapiro-Wilk
D’Agostino & Pearson Omnibus
Smirnov
11/30/2004 Copyright 2004 Teradata, a division of NCR 18
Why Did We Build Teradata Warehouse Miner?Why Did We Build Teradata Warehouse Miner?Integrated Data Mining EnvironmentIntegrated Data Mining Environment
Other TechnologiesInefficient Environment
- Elapsed and Execution Times
Continual Data MovementData RedundancyMetadata Inconsistencies“Many Versions of The Truth”
Teradata and TWMEfficiently Architected
Environment- MPP Performance and Scalability
No Data MovementNo Data RedundancyShared Metadata“One Version of The Truth”
ModelersBuild Models
BusinessDeploys Models
ModelersBuild Models
BusinessDeploys Models
11/30/2004 Copyright 2004 Teradata, a division of NCR 19
Why are Integrated Analytics Important?Why are Integrated Analytics Important?Efficiency, Performance & ScalabilityEfficiency, Performance & Scalability
ModelersBuild Models
BusinessDeploys Models
Source Data
AnalyticMetadata
Analytic Data Set
• Mine data in an integrated environment
Huge data volumes – leverages the parallelism of Teradata
Minimize data redundancy Eliminate proprietary data structures Simplify data & system management Better results using larger amounts of
detailed data Eliminate potential errors during data
movement & external sampling Integrated model building and scoring Reduced overall modeling time
Many resulting elapsed and execution time improvements have been astronomical !
11/30/2004 Copyright 2004 Teradata, a division of NCR 20
The Teradata Warehouse Miner GoalThe Teradata Warehouse Miner GoalEnable Entire Data Mining Process Enable Entire Data Mining Process In Teradata In Teradata
Teradata Teradata Data Warehouse Data Warehouse
ScoredScoredData SetData Set
SourceSourceDataData
AnalyticAnalyticData SetData Set
Data Pre-Processing
ModelDeployment
Analytical Modeling
AnalyticAnalyticMetadataMetadata
• data starts and ends in the database• open to accommodate 3rd party partner tools
11/30/2004 Copyright 2004 Teradata, a division of NCR 21
Teradata Warehouse MinerTeradata Warehouse MinerProjects and Analytic ModulesProjects and Analytic Modules
• Teradata Warehouse Miner Projects contain one or more tasks
• each task is called an Analytic Module– eight categories of analytic modules
• ADS (Analytical Data Set generation)– Variable Creation– Variable Transformation– Build ADS
• Analytics (Analytic Algorithms)• Descriptive Statistics• Matrix Functions (correlation, …)• Miscellaneous
– free form SQL , …
• Reorganization (Structure of Data)• Scoring (and Model Evaluation)• Statistical Tests
• Analytic Modules are the fundamental building blocks used to conduct data analysis in Teradata Warehouse Miner
11/30/2004 Copyright 2004 Teradata, a division of NCR 22
Teradata Warehouse MinerTeradata Warehouse MinerElements in the Primary WindowElements in the Primary Window
Project Icon
Analytic Module Icon
ODBC Connection
Icon
Connection Properties Icon
Run and Stop Icons
Runtime Message Area
Data Source Status
Project Area
Analysis Set-up and Results Viewing Area
hmmm… I wonder what else might fill this large gray area some day...
Main Menus
Main Toolbar
Open, Save, and Save All Icons
11/30/2004 Copyright 2004 Teradata, a division of NCR 23
• there are 7 basic steps in the use of Teradata Warehouse Miner*– connect to an ODBC data source with appropriate permissions– create a new, (or open an existing) Project– add at least one Analytic Module to the Project– set input and analytic options
• select table(s) and column(s) to be analyzed• set Analytic Module parameters**• set other Analytic Module options as necessary**
– set output and results options– execute the Analytic Module (using the run icon )
• optionally, save the Project(s) and Analyses– examine, interpret, and use results of interest**
• that’s it
* use these steps after you or a system administrator has set up an ODBC Data Source (DSN) on your PC. The DSN must point to source, result, and metadata Teradata databases for which you have appropriate permissions
** setting Analytic Model options, and interpreting and using results appropriately requires expertise specific to the Analytic Module chosen
Teradata Warehouse MinerTeradata Warehouse MinerThe 7 Steps to ResultsThe 7 Steps to Results
11/30/2004 Copyright 2004 Teradata, a division of NCR 24
Using Teradata Warehouse Miner
The 7 Steps to Results
An Example
11/30/2004 Copyright 2004 Teradata, a division of NCR 25
Teradata Warehouse MinerTeradata Warehouse MinerStep 1 - connect to an ODBC data sourceStep 1 - connect to an ODBC data source
11/30/2004 Copyright 2004 Teradata, a division of NCR 26
Teradata Warehouse MinerTeradata Warehouse MinerStep 2 - create a new ProjectStep 2 - create a new Project
11/30/2004 Copyright 2004 Teradata, a division of NCR 27
Teradata Warehouse MinerTeradata Warehouse MinerStep 3 - add an Analytic Module to the ProjectStep 3 - add an Analytic Module to the Project
11/30/2004 Copyright 2004 Teradata, a division of NCR 28
Teradata Warehouse MinerTeradata Warehouse MinerStep 4 – set input and analytic optionsStep 4 – set input and analytic options ( (select table and columns to be analyzedselect table and columns to be analyzed))
11/30/2004 Copyright 2004 Teradata, a division of NCR 29
Teradata Warehouse MinerTeradata Warehouse MinerStep 4 – set input and analytic optionsStep 4 – set input and analytic options ( (set Analytic Module parametersset Analytic Module parameters))
11/30/2004 Copyright 2004 Teradata, a division of NCR 30
Teradata Warehouse MinerTeradata Warehouse MinerStep 4 – set input and analytic optionsStep 4 – set input and analytic options ( (set other Analytic Module options as necessary)set other Analytic Module options as necessary)
11/30/2004 Copyright 2004 Teradata, a division of NCR 31
Teradata Warehouse MinerTeradata Warehouse MinerStep 5 – set output and results optionsStep 5 – set output and results options
**Note: This screen-shot is from a Scoring Module for the
analytic algorithm module used in this example
11/30/2004 Copyright 2004 Teradata, a division of NCR 32
Teradata Warehouse MinerTeradata Warehouse MinerStep 6 - execute the Analytic ModuleStep 6 - execute the Analytic Module
11/30/2004 Copyright 2004 Teradata, a division of NCR 33
Teradata Warehouse MinerTeradata Warehouse MinerStep 6 - execute the Analytic ModuleStep 6 - execute the Analytic Module ( (optionally, save the Project(s) and Analysesoptionally, save the Project(s) and Analyses))
11/30/2004 Copyright 2004 Teradata, a division of NCR 34
Teradata Warehouse MinerTeradata Warehouse MinerStep 7 - examine, interpret, and use results (1)Step 7 - examine, interpret, and use results (1)
11/30/2004 Copyright 2004 Teradata, a division of NCR 35
Teradata Warehouse MinerTeradata Warehouse MinerStep 7 - examine, interpret, and use results (2)Step 7 - examine, interpret, and use results (2)
11/30/2004 Copyright 2004 Teradata, a division of NCR 36
Tips for Navigating the Teradata Tips for Navigating the Teradata Warehouse Miner InterfaceWarehouse Miner Interface
• on-line help and user’s guide– very extensive and thorough– tutorials for each function– describes many of the analytical techniques in detail– many reference formulae are provided– use these liberally
• menus and toolbar
• runtime message area
• setting program options and preferences– global– run-time
• setting up Project Directories for files on PC client– optionally, for local HTML reports and associated graphics
11/30/2004 Copyright 2004 Teradata, a division of NCR 37
Teradata Warehouse Miner
Demo
TWM, an enabling technology to assist in addressing qualified business questions that are well suited to the
processes of decision support and data mining
(data exploration – data transformation – exploratory modeling – model building and validation – scoring and
evaluation – lifecycle maintenance – …)
11/30/2004 Copyright 2004 Teradata, a division of NCR 38
University of Arkansas
Data Mining withTeradataTM Warehouse Miner
Questions and Discussion