Part II Tools for Knowledge Discovery. Knowledge Discovery in Databases Chapter 5.
A Kit For Knowledge Discovery
description
Transcript of A Kit For Knowledge Discovery
![Page 1: A Kit For Knowledge Discovery](https://reader036.fdocuments.in/reader036/viewer/2022062315/5681513c550346895dbf556c/html5/thumbnails/1.jpg)
A Kit For Knowledge Discovery
![Page 2: A Kit For Knowledge Discovery](https://reader036.fdocuments.in/reader036/viewer/2022062315/5681513c550346895dbf556c/html5/thumbnails/2.jpg)
2
Data, Data everywhere yet ...
I can’t find the data I need data is scattered over the network many versions, subtle differences
I can’t get the data I need need an expert to get the data
I can’t understand the data I found available data poorly documented
I can’t use the data I found results are unexpected data needs to be transformed from one
form to other
![Page 3: A Kit For Knowledge Discovery](https://reader036.fdocuments.in/reader036/viewer/2022062315/5681513c550346895dbf556c/html5/thumbnails/3.jpg)
?• There are sequence of steps (with eventual feedback loops) that should be followed to discover knowledge (e.g., patterns) in data.
• Achieving Standardized Process Model
![Page 4: A Kit For Knowledge Discovery](https://reader036.fdocuments.in/reader036/viewer/2022062315/5681513c550346895dbf556c/html5/thumbnails/4.jpg)
What is KDD ?
1
Legitimate
Innovative
2
Probably
useful
3
Accurate understandable patterns in data.
Knowledge Discovery in Data is the significant
method of evaluating
![Page 5: A Kit For Knowledge Discovery](https://reader036.fdocuments.in/reader036/viewer/2022062315/5681513c550346895dbf556c/html5/thumbnails/5.jpg)
______
______
______
Transformed Data
Patternsand Rules
Target Data
RawData
KnowledgeData MiningTransformation
Interpretation& Evaluation
Selection& Cleaning
Integration Understan
ding
Knowledge Discovery Process
DATAWarehouse
Knowledge
![Page 6: A Kit For Knowledge Discovery](https://reader036.fdocuments.in/reader036/viewer/2022062315/5681513c550346895dbf556c/html5/thumbnails/6.jpg)
Outcomes of Data Mining
Forecasting Future
Clustering Based On Attributes
Events Correlation – Association
Classification on Recognizing patterns
Sequencing Events ~ Later Predictions
![Page 7: A Kit For Knowledge Discovery](https://reader036.fdocuments.in/reader036/viewer/2022062315/5681513c550346895dbf556c/html5/thumbnails/7.jpg)
Data Mining
Look for hidden patterns and trends in data that is not immediately apparent from summarizing the data
![Page 8: A Kit For Knowledge Discovery](https://reader036.fdocuments.in/reader036/viewer/2022062315/5681513c550346895dbf556c/html5/thumbnails/8.jpg)
Data Mining
+ =Data
Interestingnesscriteria
Hiddenpatterns
![Page 9: A Kit For Knowledge Discovery](https://reader036.fdocuments.in/reader036/viewer/2022062315/5681513c550346895dbf556c/html5/thumbnails/9.jpg)
Data Mining
+ =Data
Interestingnesscriteria
Hiddenpatterns
Type of Patterns
![Page 10: A Kit For Knowledge Discovery](https://reader036.fdocuments.in/reader036/viewer/2022062315/5681513c550346895dbf556c/html5/thumbnails/10.jpg)
Data Mining
+ =Data
Interestingnesscriteria
Hiddenpatterns
Type of data Type of Interestingness criteria
![Page 11: A Kit For Knowledge Discovery](https://reader036.fdocuments.in/reader036/viewer/2022062315/5681513c550346895dbf556c/html5/thumbnails/11.jpg)
A single, complete and consistent store of data obtained from a variety of different sources made available to end users in a what they can understand and use in a business context.
What is a Data Warehouse?
![Page 12: A Kit For Knowledge Discovery](https://reader036.fdocuments.in/reader036/viewer/2022062315/5681513c550346895dbf556c/html5/thumbnails/12.jpg)
12
What is Data Warehousing?
A process of transforming data into information and making it available to users in a timely enough manner to make a difference
Data
Information
![Page 13: A Kit For Knowledge Discovery](https://reader036.fdocuments.in/reader036/viewer/2022062315/5681513c550346895dbf556c/html5/thumbnails/13.jpg)
Data Mining Process
1. Problem Definition
2. Data Integration & Cleaning
3. Model Framing & Evaluation
4. Knowledge Discovery
3
2
1
4
![Page 14: A Kit For Knowledge Discovery](https://reader036.fdocuments.in/reader036/viewer/2022062315/5681513c550346895dbf556c/html5/thumbnails/14.jpg)
Basic Operations in DM
Predictive:
Regression
Classification
Collaborative Filtering
Data Mining Task
Descriptive:
Clustering / Similarity Matching
Association rules
Deviation detection
![Page 15: A Kit For Knowledge Discovery](https://reader036.fdocuments.in/reader036/viewer/2022062315/5681513c550346895dbf556c/html5/thumbnails/15.jpg)
Why Machine Learning
Growing flood of online data
Budding industry
Progress in algorithms and theory
• Data mining: using historical data to improve decision– medical records medical knowledge⇒
– log data to model user
• Software applications we can’t program by hand– autonomous driving
– speech recognition
• Self customizing programs– Newsreader that learns user interests
![Page 16: A Kit For Knowledge Discovery](https://reader036.fdocuments.in/reader036/viewer/2022062315/5681513c550346895dbf556c/html5/thumbnails/16.jpg)
Machine Learning
Text
Unsupervised
Supervised
Unsupervised
Data have no target attribute. Explore Data to find Patterns
Machine Learning
Data Mining
Supervised
Discover patterns in the data.Presence of Target Attribute
![Page 17: A Kit For Knowledge Discovery](https://reader036.fdocuments.in/reader036/viewer/2022062315/5681513c550346895dbf556c/html5/thumbnails/17.jpg)
Applications Of Data Mining
![Page 18: A Kit For Knowledge Discovery](https://reader036.fdocuments.in/reader036/viewer/2022062315/5681513c550346895dbf556c/html5/thumbnails/18.jpg)
Applications of Data MiningFraud/Non-Compliance Anomaly detection
Isolate the factors that lead to fraud, waste and abuse
Target auditing and investigative efforts more effectively
Credit/Risk Scoring
Intrusion detection
Recruiting/Attracting customers
Maximizing profitability (cross selling, identifying profitable customers)
Service Delivery and Customer Retention
Build profiles of customers likely to use which services
![Page 19: A Kit For Knowledge Discovery](https://reader036.fdocuments.in/reader036/viewer/2022062315/5681513c550346895dbf556c/html5/thumbnails/19.jpg)
Tools For Data Mining
LinkOut NCBI Sequin Rapid Miner LibSvm ADaM
etc….
![Page 20: A Kit For Knowledge Discovery](https://reader036.fdocuments.in/reader036/viewer/2022062315/5681513c550346895dbf556c/html5/thumbnails/20.jpg)
Why Weka
Weka is a collection of machine learning algorithms for data
mining tasks.
The algorithms can either be applied directly to a dataset or
called from your own Java code.
Weka contains tools for data pre-processing, classification,
regression, clustering, association rules, and visualization.
It is also well-suited for developing new machine learning
schemes.
![Page 21: A Kit For Knowledge Discovery](https://reader036.fdocuments.in/reader036/viewer/2022062315/5681513c550346895dbf556c/html5/thumbnails/21.jpg)
About WEKA
Waikato Environment for Knowledge Analysis (WEKA)
Developed by the Department of Computer Science, University of Waikato,
New Zealand
Machine learning/data mining software coded in Java
Used for research, education, and applications
Exclusively for KDD.
Various Versions are available such as Version 2.3, 1998; Version 3.0, 1999;
Version 3.4, 2003; Version 3.6, 2008.
![Page 22: A Kit For Knowledge Discovery](https://reader036.fdocuments.in/reader036/viewer/2022062315/5681513c550346895dbf556c/html5/thumbnails/22.jpg)
Weka GUI Chooser
![Page 23: A Kit For Knowledge Discovery](https://reader036.fdocuments.in/reader036/viewer/2022062315/5681513c550346895dbf556c/html5/thumbnails/23.jpg)
A Vital Part In Weka
ww.themegallery.com
Explorer
![Page 24: A Kit For Knowledge Discovery](https://reader036.fdocuments.in/reader036/viewer/2022062315/5681513c550346895dbf556c/html5/thumbnails/24.jpg)
Weka !!!!!!!!
Weka is a collection of machine learning algorithms for data mining tasks.
The algorithms can either be applied directly to a dataset or called from
your own Java code.
Weka contains tools for data pre-processing, classification, regression,
clustering, association rules, and visualization.
Perfectly suited for developing new machine learning schemes.
![Page 25: A Kit For Knowledge Discovery](https://reader036.fdocuments.in/reader036/viewer/2022062315/5681513c550346895dbf556c/html5/thumbnails/25.jpg)
Weka’s Structural Layout
Explorer
Experimenter Knowledge Flow
Simple CLI
An environment for exploring data with WEKA
Supports the same functionsas the Explorer but with drag-and-drop
Performing experiments and conductingstatistical tests between learning schemes
Provides a simple command-line interface that allows directexecution of WEKA
![Page 26: A Kit For Knowledge Discovery](https://reader036.fdocuments.in/reader036/viewer/2022062315/5681513c550346895dbf556c/html5/thumbnails/26.jpg)
Algorithms
www.themegallery.com
![Page 27: A Kit For Knowledge Discovery](https://reader036.fdocuments.in/reader036/viewer/2022062315/5681513c550346895dbf556c/html5/thumbnails/27.jpg)
WEKA ! File
WEKA stores data in flat files (ARFF format).
Easy to transform EXCEL file to ARFF format.
ARFF file consists of a list of instances
ARFF file can be created using Notepad or Word.
Name of the dataset is with @relation
Attribute information is with @attribute
Data is with @data.
Attribute Attribute Relation Relation
File File Format Format (ARFF)(ARFF)
![Page 28: A Kit For Knowledge Discovery](https://reader036.fdocuments.in/reader036/viewer/2022062315/5681513c550346895dbf556c/html5/thumbnails/28.jpg)
Sample ARFF
![Page 29: A Kit For Knowledge Discovery](https://reader036.fdocuments.in/reader036/viewer/2022062315/5681513c550346895dbf556c/html5/thumbnails/29.jpg)
Intrinsic Operations
Select Attributes
Associate
Cluster
Classify
Preprocess
55
44
33
22
1
![Page 30: A Kit For Knowledge Discovery](https://reader036.fdocuments.in/reader036/viewer/2022062315/5681513c550346895dbf556c/html5/thumbnails/30.jpg)
![Page 31: A Kit For Knowledge Discovery](https://reader036.fdocuments.in/reader036/viewer/2022062315/5681513c550346895dbf556c/html5/thumbnails/31.jpg)
Preprocessing
Changing Data formats as per the Needs.
Varies as Per Mining Datasets.
Some of the Preprocessing Steps
Adding/removing attributes
Attribute value substitution
Discretization (MDL, Kononenko, etc.)
Time series filters (delta, shift)
Sampling, randomization
Missing value management
Normalization and other numeric transformations
![Page 32: A Kit For Knowledge Discovery](https://reader036.fdocuments.in/reader036/viewer/2022062315/5681513c550346895dbf556c/html5/thumbnails/32.jpg)
Algorithms
![Page 33: A Kit For Knowledge Discovery](https://reader036.fdocuments.in/reader036/viewer/2022062315/5681513c550346895dbf556c/html5/thumbnails/33.jpg)
Pre-Processing
Browse for the datafile in local filesystem.
RelationsRelationsInstances Instances SchemaSchema
Attributes Attributes FiltersFilters
Opening Files Current Relation Operations
![Page 34: A Kit For Knowledge Discovery](https://reader036.fdocuments.in/reader036/viewer/2022062315/5681513c550346895dbf556c/html5/thumbnails/34.jpg)
Weka – Formulating Files
![Page 35: A Kit For Knowledge Discovery](https://reader036.fdocuments.in/reader036/viewer/2022062315/5681513c550346895dbf556c/html5/thumbnails/35.jpg)
Dataset -.txt Format
![Page 36: A Kit For Knowledge Discovery](https://reader036.fdocuments.in/reader036/viewer/2022062315/5681513c550346895dbf556c/html5/thumbnails/36.jpg)
Weka ~ Dataset’s
![Page 37: A Kit For Knowledge Discovery](https://reader036.fdocuments.in/reader036/viewer/2022062315/5681513c550346895dbf556c/html5/thumbnails/37.jpg)
Missing Values
![Page 38: A Kit For Knowledge Discovery](https://reader036.fdocuments.in/reader036/viewer/2022062315/5681513c550346895dbf556c/html5/thumbnails/38.jpg)
GenericObjectEditor
A Property Editor for objects as editable in the
GenericObjectEditor configuration file, which lists possible
values that can be selected from, and themselves configured.
The configuration file is called "GenericObjectEditor.props"
and may live in either the location given by "user.home" or the
current directory (this last will take precedence), and a default
properties file is read from the weka distribution.
![Page 39: A Kit For Knowledge Discovery](https://reader036.fdocuments.in/reader036/viewer/2022062315/5681513c550346895dbf556c/html5/thumbnails/39.jpg)
Weka ~ GenericObjectEditor
This Editor allows configure a filter.Same kind of dialog box is used to configure other objects, such as classifiers and clusterers.
This Editor allows configure a filter.Same kind of dialog box is used to configure other objects, such as classifiers and clusterers.
![Page 40: A Kit For Knowledge Discovery](https://reader036.fdocuments.in/reader036/viewer/2022062315/5681513c550346895dbf556c/html5/thumbnails/40.jpg)
Sample - Cluster
Attributes for Cluster
![Page 41: A Kit For Knowledge Discovery](https://reader036.fdocuments.in/reader036/viewer/2022062315/5681513c550346895dbf556c/html5/thumbnails/41.jpg)
Weka’s Viewer
![Page 42: A Kit For Knowledge Discovery](https://reader036.fdocuments.in/reader036/viewer/2022062315/5681513c550346895dbf556c/html5/thumbnails/42.jpg)
PCA Analysis
![Page 43: A Kit For Knowledge Discovery](https://reader036.fdocuments.in/reader036/viewer/2022062315/5681513c550346895dbf556c/html5/thumbnails/43.jpg)
Pre-Processing Retrievals
BeforeBefore AfterAfter
![Page 44: A Kit For Knowledge Discovery](https://reader036.fdocuments.in/reader036/viewer/2022062315/5681513c550346895dbf556c/html5/thumbnails/44.jpg)
Retrieving Significant Attributes
![Page 45: A Kit For Knowledge Discovery](https://reader036.fdocuments.in/reader036/viewer/2022062315/5681513c550346895dbf556c/html5/thumbnails/45.jpg)
![Page 46: A Kit For Knowledge Discovery](https://reader036.fdocuments.in/reader036/viewer/2022062315/5681513c550346895dbf556c/html5/thumbnails/46.jpg)
Algorithms
![Page 47: A Kit For Knowledge Discovery](https://reader036.fdocuments.in/reader036/viewer/2022062315/5681513c550346895dbf556c/html5/thumbnails/47.jpg)
Feature Selection
Some columns are noisy or redundant. This noise makes it more difficult to
discover meaningful patterns from the data;
To discover quality patterns, most data mining algorithms require much
larger training data set on high-dimensional data set.
Feature selection, also known as variable selection, feature
reduction, attribute selection or variable subset selection,
is the technique of selecting a subset of relevant features for building
robust learning models
![Page 48: A Kit For Knowledge Discovery](https://reader036.fdocuments.in/reader036/viewer/2022062315/5681513c550346895dbf556c/html5/thumbnails/48.jpg)
Attribute Selection
Attribute selection involves searching through all possible combinations of
attributes in the data to find which subset of attributes works best for
prediction.
To do this, two objects must be set up:
The evaluator determines what method is used to assign a worth to each
subset of attributes.
The search method determines what style of search to be done
The Attribute Selection Mode box has two options:
1. Use full training set.
2. Cross-validation.
![Page 49: A Kit For Knowledge Discovery](https://reader036.fdocuments.in/reader036/viewer/2022062315/5681513c550346895dbf556c/html5/thumbnails/49.jpg)
Attribute Selection
Very flexible: arbitrary combination of search and evaluation methods
Both filtering and wrapping methods Search methods
best-first genetic ranking ...
Evaluation mmeasures Relief information gain gain ratio ...
![Page 50: A Kit For Knowledge Discovery](https://reader036.fdocuments.in/reader036/viewer/2022062315/5681513c550346895dbf556c/html5/thumbnails/50.jpg)
Applying Algorithm
![Page 51: A Kit For Knowledge Discovery](https://reader036.fdocuments.in/reader036/viewer/2022062315/5681513c550346895dbf556c/html5/thumbnails/51.jpg)
Best Attribute
![Page 52: A Kit For Knowledge Discovery](https://reader036.fdocuments.in/reader036/viewer/2022062315/5681513c550346895dbf556c/html5/thumbnails/52.jpg)
Algorithm……
![Page 53: A Kit For Knowledge Discovery](https://reader036.fdocuments.in/reader036/viewer/2022062315/5681513c550346895dbf556c/html5/thumbnails/53.jpg)
![Page 54: A Kit For Knowledge Discovery](https://reader036.fdocuments.in/reader036/viewer/2022062315/5681513c550346895dbf556c/html5/thumbnails/54.jpg)
Classification
Classification is a data mining function that assigns items in a collection to
target categories or classes.
The goal of classification is to accurately predict the target class for each
case in the data.
A classification task begins with a data set in which the class assignments
are known.
For example, a classification model that predicts credit risk could be
developed based on observed data for many loan applicants over a period of
time
![Page 55: A Kit For Knowledge Discovery](https://reader036.fdocuments.in/reader036/viewer/2022062315/5681513c550346895dbf556c/html5/thumbnails/55.jpg)
Classification ~ Naive Bayes classifier A naive Bayes classifier assumes that the presence (or absence) of a
particular feature of a class is unrelated to the presence (or absence) of any
other feature, given the class variable.
For example, a fruit may be considered to be an apple if it is red, round, and
about 4" in diameter.
Even if these features depend on each other or upon the existence of the other
features, a naive Bayes classifier considers all of these properties to
independently contribute to the probability that this fruit is an apple.
![Page 56: A Kit For Knowledge Discovery](https://reader036.fdocuments.in/reader036/viewer/2022062315/5681513c550346895dbf556c/html5/thumbnails/56.jpg)
Naive Bayes Classifier
![Page 57: A Kit For Knowledge Discovery](https://reader036.fdocuments.in/reader036/viewer/2022062315/5681513c550346895dbf556c/html5/thumbnails/57.jpg)
Confusion Matrix –Pervasive Role
![Page 58: A Kit For Knowledge Discovery](https://reader036.fdocuments.in/reader036/viewer/2022062315/5681513c550346895dbf556c/html5/thumbnails/58.jpg)
Confusion Matrix - Dataset
![Page 59: A Kit For Knowledge Discovery](https://reader036.fdocuments.in/reader036/viewer/2022062315/5681513c550346895dbf556c/html5/thumbnails/59.jpg)
Second Fold -Classification
![Page 60: A Kit For Knowledge Discovery](https://reader036.fdocuments.in/reader036/viewer/2022062315/5681513c550346895dbf556c/html5/thumbnails/60.jpg)
![Page 61: A Kit For Knowledge Discovery](https://reader036.fdocuments.in/reader036/viewer/2022062315/5681513c550346895dbf556c/html5/thumbnails/61.jpg)
Algorithms
![Page 62: A Kit For Knowledge Discovery](https://reader036.fdocuments.in/reader036/viewer/2022062315/5681513c550346895dbf556c/html5/thumbnails/62.jpg)
Clustering
Clustering is the task of assigning a set of objects into groups
(called clusters) so that the objects in the same cluster are more similar (in
some sense or another) to each other than to those in other clusters.
Belong to Unsupervised Learning
![Page 63: A Kit For Knowledge Discovery](https://reader036.fdocuments.in/reader036/viewer/2022062315/5681513c550346895dbf556c/html5/thumbnails/63.jpg)
Example ~ Weka
![Page 64: A Kit For Knowledge Discovery](https://reader036.fdocuments.in/reader036/viewer/2022062315/5681513c550346895dbf556c/html5/thumbnails/64.jpg)
Attributes Replacements
![Page 65: A Kit For Knowledge Discovery](https://reader036.fdocuments.in/reader036/viewer/2022062315/5681513c550346895dbf556c/html5/thumbnails/65.jpg)
Updations
![Page 66: A Kit For Knowledge Discovery](https://reader036.fdocuments.in/reader036/viewer/2022062315/5681513c550346895dbf556c/html5/thumbnails/66.jpg)
K- Means
![Page 67: A Kit For Knowledge Discovery](https://reader036.fdocuments.in/reader036/viewer/2022062315/5681513c550346895dbf556c/html5/thumbnails/67.jpg)
Visualizer
Open Saved File
Save File =>Will Store in ARFF
![Page 68: A Kit For Knowledge Discovery](https://reader036.fdocuments.in/reader036/viewer/2022062315/5681513c550346895dbf556c/html5/thumbnails/68.jpg)
Visualizer – Samples
![Page 69: A Kit For Knowledge Discovery](https://reader036.fdocuments.in/reader036/viewer/2022062315/5681513c550346895dbf556c/html5/thumbnails/69.jpg)
![Page 70: A Kit For Knowledge Discovery](https://reader036.fdocuments.in/reader036/viewer/2022062315/5681513c550346895dbf556c/html5/thumbnails/70.jpg)
Association rules
Association rules are if/then statements that help uncover relationships
between seemingly unrelated data in a relational database or other
information repository.
Finding frequent patterns, associations, correlations, or causal structures
among sets of items or objects in transaction databases.
An example of an association rule would be "If a customer buys a dozen
eggs, he is 90% likely to also purchase milk.“
Market Basket Analysis
![Page 71: A Kit For Knowledge Discovery](https://reader036.fdocuments.in/reader036/viewer/2022062315/5681513c550346895dbf556c/html5/thumbnails/71.jpg)
Association
![Page 72: A Kit For Knowledge Discovery](https://reader036.fdocuments.in/reader036/viewer/2022062315/5681513c550346895dbf556c/html5/thumbnails/72.jpg)
Description
![Page 73: A Kit For Knowledge Discovery](https://reader036.fdocuments.in/reader036/viewer/2022062315/5681513c550346895dbf556c/html5/thumbnails/73.jpg)
Rules Framing
Rules Set
![Page 74: A Kit For Knowledge Discovery](https://reader036.fdocuments.in/reader036/viewer/2022062315/5681513c550346895dbf556c/html5/thumbnails/74.jpg)
Visualize
![Page 75: A Kit For Knowledge Discovery](https://reader036.fdocuments.in/reader036/viewer/2022062315/5681513c550346895dbf556c/html5/thumbnails/75.jpg)
Result Analysis
WekaWeka
Result 2Result 2
Result 1 Result 1
ConceptConcept
![Page 76: A Kit For Knowledge Discovery](https://reader036.fdocuments.in/reader036/viewer/2022062315/5681513c550346895dbf556c/html5/thumbnails/76.jpg)