Post on 11-May-2015
1
Introduction to Data MiningIntroduction to Data Mining
CChapter 4hapter 4
2
Chapter 4 OutlineChapter 4 Outline– BackgroundBackground– Information is PowerInformation is Power– Knowledge is PowerKnowledge is Power– Data MiningData Mining
3
IntroductionIntroduction
4
Information is PowerInformation is Power
RelevantRelevant Right InformationRight Information Globalised worldGlobalised world Vast amount of information availableVast amount of information available
5
What is an informationWhat is an information
a collection of dataa collection of data The act of human analysis and The act of human analysis and
interpretation of activitiesinterpretation of activities Decomposing it into various Decomposing it into various
components and tackling themcomponents and tackling them
6
What is Knowledge? What is Knowledge?
The act of human synthesis and The act of human synthesis and evaluation of informationevaluation of information
Integration of the relevant components Integration of the relevant components and form as a relevant whole system.and form as a relevant whole system.
7
Lots of data is being collected Lots of data is being collected and warehoused and warehoused – Web data, e-commerceWeb data, e-commerce– purchases at department/purchases at department/
grocery storesgrocery stores– Bank/Credit Card Bank/Credit Card
transactionstransactions
Computers have become cheaper and more powerfulComputers have become cheaper and more powerful
Competitive Pressure is Strong Competitive Pressure is Strong – Provide better, customized services for an Provide better, customized services for an edge edge (e.g. in (e.g. in
Customer Relationship Management)Customer Relationship Management)
Why Mine Data? Commercial Why Mine Data? Commercial ViewpointViewpoint
8
Why Mine Data? Scientific ViewpointWhy Mine Data? Scientific Viewpoint
Data collected and stored at Data collected and stored at enormous speeds (GB/hour)enormous speeds (GB/hour)
– remote sensors on a satelliteremote sensors on a satellite
– telescopes scanning the skiestelescopes scanning the skies
– microarrays generating gene microarrays generating gene expression dataexpression data
– scientific simulations scientific simulations generating terabytes of datagenerating terabytes of data
Traditional techniques infeasible for raw dataTraditional techniques infeasible for raw data Data mining may help scientists Data mining may help scientists
– in classifying and segmenting datain classifying and segmenting data– in Hypothesis Formationin Hypothesis Formation
9
Data Mining Definition IData Mining Definition I
The nontrivial extraction of hidden, previously The nontrivial extraction of hidden, previously unidentified, and potentially valuable unidentified, and potentially valuable knowledge from dataknowledge from data
A variety of techniques such as neural A variety of techniques such as neural networks, decision trees or standard networks, decision trees or standard statistical techniques to identify nuggets of statistical techniques to identify nuggets of information or decision-making knowledge in information or decision-making knowledge in bodies of data, and extracting these in such a bodies of data, and extracting these in such a way that they can be put to use in areas such way that they can be put to use in areas such as decision support, prediction, forecasting, as decision support, prediction, forecasting, and estimation.and estimation.
10
Data Mining Definition IIData Mining Definition II
Finding hidden information in a Finding hidden information in a databasedatabase
11
Hidden InformationHidden Information
Number of years of experiencesNumber of years of experiences Great secret recipesGreat secret recipes Success FactorsSuccess Factors
12
Draws ideas from machine learning/AI, Draws ideas from machine learning/AI, pattern recognition, statistics, and pattern recognition, statistics, and database systemsdatabase systems
Traditional TechniquesTraditional Techniquesmay be unsuitable due to may be unsuitable due to – Enormity of dataEnormity of data– High dimensionality High dimensionality
of dataof data– Heterogeneous, Heterogeneous,
distributed nature distributed nature of dataof data
Origins of Data MiningOrigins of Data Mining
Machine Learning/Pattern
Recognition
Statistics/AI
Data Mining
Database systems
13
What is (not) Data Mining?What is (not) Data Mining?
What is Data Mining?
– Certain names are more prevalent in certain US locations (O’Brien, O’Rurke, O’Reilly… in Boston area)
– Group together similar documents returned by search engine according to their context (e.g. Amazon rainforest, Amazon.com,)
What is not Data Mining?
– Look up phone number in phone directory
– Query a Web search engine for information about “Amazon”
14
Database Processing vs. Data Database Processing vs. Data Mining ProcessingMining Processing
QueryQuery– Well definedWell defined– SQLSQL
QueryQuery– Poorly definedPoorly defined– No precise query languageNo precise query language
DataData– Operational dataOperational data
OutputOutput– PrecisePrecise– Subset of databaseSubset of database
DataData– Not operational dataNot operational data
OutputOutput– FuzzyFuzzy– Not a subset of databaseNot a subset of database
15
Query ExamplesQuery Examples DatabaseDatabase
Data MiningData Mining
– Find all customers who have purchased breadFind all customers who have purchased bread
– Find all items which are frequently purchased Find all items which are frequently purchased with bread. (association rules)with bread. (association rules)
– Find all credit applicants with surname name of Lee.Find all credit applicants with surname name of Lee.– Identify customers who have purchased more Identify customers who have purchased more than $100,000 in the last year.than $100,000 in the last year.
– Find all credit applicants who are good credit Find all credit applicants who are good credit risks. (classification)risks. (classification)– Identify customers with similar eating habits. Identify customers with similar eating habits. (Clustering)(Clustering)
16
Data Mining Models and TasksData Mining Models and Tasks
17
Classification: DefinitionClassification: Definition Given a collection of records (Given a collection of records (training set training set ))
– Each record contains a set of Each record contains a set of attributesattributes, one of the , one of the attributes is the attributes is the classclass..
Find a Find a modelmodel for class attribute as a function of the for class attribute as a function of the values of other attributes.values of other attributes.
Goal: Goal: previously unseenpreviously unseen records should be assigned records should be assigned a class as accurately as possible.a class as accurately as possible.– A A test settest set is used to determine the accuracy of the is used to determine the accuracy of the
model. Usually, the given data set is divided into model. Usually, the given data set is divided into training and test sets, with training set used to training and test sets, with training set used to build the model and test set used to validate it.build the model and test set used to validate it.
18
Illustrating Classification TaskIllustrating Classification Task
Apply
Model
Induction
Deduction
Learn
Model
Model
Tid Attrib1 Attrib2 Attrib3 Class
1 Yes Large 125K No
2 No Medium 100K No
3 No Small 70K No
4 Yes Medium 120K No
5 No Large 95K Yes
6 No Medium 60K No
7 Yes Large 220K No
8 No Small 85K Yes
9 No Medium 75K No
10 No Small 90K Yes 10
Tid Attrib1 Attrib2 Attrib3 Class
11 No Small 55K ?
12 Yes Medium 80K ?
13 Yes Large 110K ?
14 No Small 95K ?
15 No Large 67K ? 10
Test Set
Learningalgorithm
Training Set
19
Examples of Classification Examples of Classification TaskTask
Predicting tumor cells as benign or malignantPredicting tumor cells as benign or malignant
Classifying credit card transactions Classifying credit card transactions as legitimate or fraudulentas legitimate or fraudulent
Classifying secondary structures of protein Classifying secondary structures of protein as alpha-helix, beta-sheet, or random as alpha-helix, beta-sheet, or random coilcoil
Categorizing news stories as finance, Categorizing news stories as finance, weather, entertainment, sports, etcweather, entertainment, sports, etc
20
Classification TechniquesClassification Techniques
Decision Tree based MethodsDecision Tree based Methods Rule-based MethodsRule-based Methods Memory based reasoningMemory based reasoning Neural NetworksNeural Networks Naïve Bayes and Bayesian Belief Naïve Bayes and Bayesian Belief
NetworksNetworks Support Vector MachinesSupport Vector Machines
21
Example of a Decision TreeExample of a Decision Tree
Tid Refund MaritalStatus
TaxableIncome Cheat
1 Yes Single 125K No
2 No Married 100K No
3 No Single 70K No
4 Yes Married 120K No
5 No Divorced 95K Yes
6 No Married 60K No
7 Yes Divorced 220K No
8 No Single 85K Yes
9 No Married 75K No
10 No Single 90K Yes10
categoric
al
categoric
al
continuous
class
Refund
MarSt
TaxInc
YESNO
NO
NO
Yes No
Married Single, Divorced
< 80K > 80K
Splitting Attributes
Training Data Model: Decision Tree
22
Another Example of Decision Another Example of Decision TreeTree
Tid Refund MaritalStatus
TaxableIncome Cheat
1 Yes Single 125K No
2 No Married 100K No
3 No Single 70K No
4 Yes Married 120K No
5 No Divorced 95K Yes
6 No Married 60K No
7 Yes Divorced 220K No
8 No Single 85K Yes
9 No Married 75K No
10 No Single 90K Yes10
categoric
al
categoric
al
continuous
classMarSt
Refund
TaxInc
YESNO
NO
NO
Yes No
Married Single,
Divorced
< 80K > 80K
There could be more than one tree that fits the same data!
23
Decision Tree Classification Decision Tree Classification TaskTask
Apply
Model
Induction
Deduction
Learn
Model
Model
Tid Attrib1 Attrib2 Attrib3 Class
1 Yes Large 125K No
2 No Medium 100K No
3 No Small 70K No
4 Yes Medium 120K No
5 No Large 95K Yes
6 No Medium 60K No
7 Yes Large 220K No
8 No Small 85K Yes
9 No Medium 75K No
10 No Small 90K Yes 10
Tid Attrib1 Attrib2 Attrib3 Class
11 No Small 55K ?
12 Yes Medium 80K ?
13 Yes Large 110K ?
14 No Small 95K ?
15 No Large 67K ? 10
Test Set
TreeInductionalgorithm
Training SetDecision Tree
24
Apply Model to Test DataApply Model to Test Data
Refund
MarSt
TaxInc
YESNO
NO
NO
Yes No
Married Single, Divorced
< 80K > 80K
Refund Marital Status
Taxable Income Cheat
No Married 80K ? 10
Test DataStart from the root of tree.
25
Apply Model to Test DataApply Model to Test Data
Refund
MarSt
TaxInc
YESNO
NO
NO
Yes No
Married Single, Divorced
< 80K > 80K
Refund Marital Status
Taxable Income Cheat
No Married 80K ? 10
Test Data
26
Apply Model to Test DataApply Model to Test Data
Refund
MarSt
TaxInc
YESNO
NO
NO
Yes No
Married Single, Divorced
< 80K > 80K
Refund Marital Status
Taxable Income Cheat
No Married 80K ? 10
Test Data
27
Apply Model to Test DataApply Model to Test Data
Refund
MarSt
TaxInc
YESNO
NO
NO
Yes No
Married Single, Divorced
< 80K > 80K
Refund Marital Status
Taxable Income Cheat
No Married 80K ? 10
Test Data
28
Apply Model to Test DataApply Model to Test Data
Refund
MarSt
TaxInc
YESNO
NO
NO
Yes No
Married Single, Divorced
< 80K > 80K
Refund Marital Status
Taxable Income Cheat
No Married 80K ? 10
Test Data
29
Apply Model to Test DataApply Model to Test Data
Refund
MarSt
TaxInc
YESNO
NO
NO
Yes No
Married Single, Divorced
< 80K > 80K
Refund Marital Status
Taxable Income Cheat
No Married 80K ? 10
Test Data
Assign Cheat to “No”
30
Decision Tree Classification Decision Tree Classification TaskTask
Apply
Model
Induction
Deduction
Learn
Model
Model
Tid Attrib1 Attrib2 Attrib3 Class
1 Yes Large 125K No
2 No Medium 100K No
3 No Small 70K No
4 Yes Medium 120K No
5 No Large 95K Yes
6 No Medium 60K No
7 Yes Large 220K No
8 No Small 85K Yes
9 No Medium 75K No
10 No Small 90K Yes 10
Tid Attrib1 Attrib2 Attrib3 Class
11 No Small 55K ?
12 Yes Medium 80K ?
13 Yes Large 110K ?
14 No Small 95K ?
15 No Large 67K ? 10
Test Set
TreeInductionalgorithm
Training Set
Decision Tree
31
What is Cluster Analysis?What is Cluster Analysis?
Finding groups of objects such that the objects in a group will be Finding groups of objects such that the objects in a group will be similar (or related) to one another and different from (or similar (or related) to one another and different from (or unrelated to) the objects in other groupsunrelated to) the objects in other groups
Inter-cluster distances are maximized
Intra-cluster distances are
minimized
32
Applications of Cluster Applications of Cluster AnalysisAnalysis
UnderstandingUnderstanding– Group related documents for browsing, group Group related documents for browsing, group
genes and proteins that have similar functionality, genes and proteins that have similar functionality, or group stocks with similar price fluctuationsor group stocks with similar price fluctuations
SummarizationSummarization– Reduce the size of large data setsReduce the size of large data sets
33
What is not Cluster Analysis?What is not Cluster Analysis?
Supervised classificationSupervised classification– Have class label informationHave class label information
Simple segmentationSimple segmentation– Dividing students into different registration groups Dividing students into different registration groups
alphabetically, by last namealphabetically, by last name
Results of a queryResults of a query– Groupings are a result of an external specificationGroupings are a result of an external specification
Graph partitioningGraph partitioning– Some mutual relevance and synergy, but areas are Some mutual relevance and synergy, but areas are
not identicalnot identical
34
Notion of a Cluster can be Notion of a Cluster can be AmbiguousAmbiguous
How many clusters?
Four Clusters Two Clusters
Six Clusters
35
Types of ClusteringsTypes of Clusterings
A A clusteringclustering is a set of clusters is a set of clusters
Important distinction between Important distinction between hierarchicalhierarchical and and partitionalpartitional sets of clusters sets of clusters
Partitional ClusteringPartitional Clustering– A division data objects into non-overlapping subsets A division data objects into non-overlapping subsets
(clusters) such that each data object is in exactly one (clusters) such that each data object is in exactly one subsetsubset
Hierarchical clusteringHierarchical clustering– A set of nested clusters organized as a hierarchical A set of nested clusters organized as a hierarchical
tree tree
36
Partitional ClusteringPartitional Clustering
Original Points A Partitional Clustering
37
Hierarchical ClusteringHierarchical Clustering
p4p1
p3
p2
p4 p1
p3
p2
p4p1 p2 p3
p4p1 p2 p3
Traditional Hierarchical Clustering
Non-traditional Hierarchical Clustering Non-traditional Dendrogram
Traditional Dendrogram
38
Association RulesAssociation Rules Association Rules are a data mining technique and complement Association Rules are a data mining technique and complement
market basket analysis.market basket analysis. All association rules are unidirectional and take the following form:All association rules are unidirectional and take the following form:
Left-hand side rule IMPLIES Right-hand side ruleLeft-hand side rule IMPLIES Right-hand side rule
Both left hand side and the right-hand side of the rule may contain Both left hand side and the right-hand side of the rule may contain multiple items or combination of items such as following:multiple items or combination of items such as following:Yellow Peppers IMPLIES Red Peppers, Bananas, and BakeryYellow Peppers IMPLIES Red Peppers, Bananas, and Bakery
Associations are written as A B, where A is called antecedent or Associations are written as A B, where A is called antecedent or left-hand side(LHS) and B is called consequent or right-hand left-hand side(LHS) and B is called consequent or right-hand side(RHS).side(RHS).
– Ex: “If people buy printer then they buy catridge”» The antecedent is “buy printer” and the consequent is “buy
catridge”
39
Association RulesAssociation Rules
Market Basket AnalysisMarket Basket Analysis-Necessary to have a list of transactions and -Necessary to have a list of transactions and what was purchased in each one.what was purchased in each one.-Ex:-Ex:Transaction 1: Frozen Pizza, Cola, MilkTransaction 1: Frozen Pizza, Cola, MilkTransaction 2: Milk, potato chips,Transaction 2: Milk, potato chips,Transaction 3: Cola, Frozen pizzaTransaction 3: Cola, Frozen pizzaTransaction 4: Milk, pretzelsTransaction 4: Milk, pretzelsTransaction 5: Cola, pretzelsTransaction 5: Cola, pretzels
40
Association RulesAssociation Rules
Frozen Frozen PizzaPizza
MilkMilk ColaCola Potato Potato ChipsChips
PretzelsPretzels
Frozen PizzaFrozen Pizza 22 11 22 00 00
MilkMilk 11 33 11 11 11
ColaCola 22 11 33 00 11
Potato ChipsPotato Chips 00 11 00 11 00
PretzelsPretzels 00 11 11 00 22
41
Association RulesAssociation Rules
Measures of AssociationMeasures of Association– SupportSupport- the support measure refers to the - the support measure refers to the
percentage of baskets in the analysis where the percentage of baskets in the analysis where the rule is true, that is where both the left-hand side rule is true, that is where both the left-hand side and the right-hand side of the association are and the right-hand side of the association are found.found.
– ConfidenceConfidence» The percentage of baskets from the analysis having the The percentage of baskets from the analysis having the
left-hand side item that also contain the right-hand side left-hand side item that also contain the right-hand side item is found via the confidence measure. This measure item is found via the confidence measure. This measure is different from support in that confidence is the is different from support in that confidence is the probability that the right-hand side item is present given probability that the right-hand side item is present given that we know the left-hand side item is in the basket.that we know the left-hand side item is in the basket.
» Calculated as a ratio:Calculated as a ratio:(frequency of A and B)/(frequency of A)(frequency of A and B)/(frequency of A)
42
Association RulesAssociation Rules
Measures of AssociationMeasures of Association
-The support measure-The support measure• for the rulefor the rule
““Cola IMPLIES Frozen Pizza ” is 40%Cola IMPLIES Frozen Pizza ” is 40%
““Frozen Pizza IMPLIES Cola” is 40%Frozen Pizza IMPLIES Cola” is 40%• single itemsingle item
““Milk” is 60%Milk” is 60%
(Note: support considers only the combination and not the (Note: support considers only the combination and not the direction.)direction.)
43
Association RulesAssociation Rules
Measures of AssociationMeasures of Association– ConfidenceConfidence
““Milk IMPLIES Potato Chips” has Milk IMPLIES Potato Chips” has confidence: confidence:
==(frequency of A and B)(frequency of A and B) / / (frequency of A)(frequency of A)
==20%20% / / 60%60%
= = 33%33%
44
Data Mining vs. KDDData Mining vs. KDD
Knowledge Discovery in Databases Knowledge Discovery in Databases (KDD):(KDD): process of finding useful process of finding useful information and patterns in data.information and patterns in data.
Data Mining:Data Mining: Use of algorithms to Use of algorithms to extract the information and patterns extract the information and patterns derived by the KDD process. derived by the KDD process.
45
KDD ProcessKDD Process
Selection ( Pre-Mining 1):Selection ( Pre-Mining 1): Obtain data from various Obtain data from various sources.sources.
Preprocessing (Pre-Mining 2) :Preprocessing (Pre-Mining 2) : Cleanse data. Cleanse data. Transformation (Pre-Mining 3):Transformation (Pre-Mining 3): Convert to common Convert to common
format. Transform to new format.format. Transform to new format. Data Mining:Data Mining: Obtain desired results. Obtain desired results. Interpretation/Evaluation (Post-Mining):Interpretation/Evaluation (Post-Mining): Present Present
results to user in meaningful manner.results to user in meaningful manner.
Modified from [FPSS96C]
46
KDD Process Ex: Web LogKDD Process Ex: Web Log Selection:Selection:
– Select log data (dates and locations) to useSelect log data (dates and locations) to use Preprocessing:Preprocessing:
– Remove identifying URLsRemove identifying URLs– Remove error logsRemove error logs
Transformation:Transformation: – Sessionize (sort and group)Sessionize (sort and group)
Data Mining:Data Mining: – Identify and count patternsIdentify and count patterns– Construct data structureConstruct data structure
Interpretation/Evaluation:Interpretation/Evaluation:– Identify and display frequently accessed sequences.Identify and display frequently accessed sequences.
Potential User Applications:Potential User Applications:– Cache predictionCache prediction– PersonalisationPersonalisation
47
Data Mining DevelopmentData Mining Development•Similarity Measures•Hierarchical Clustering•IR Systems•Imprecise Queries•Textual Data•Web Search Engines
•Bayes Theorem•Regression Analysis•EM Algorithm•K-Means Clustering•Time Series Analysis
•Neural Networks•Decision Tree Algorithms
•Algorithm Design Techniques•Algorithm Analysis•Data Structures
•Relational Data Model•SQL•Association Rule Algorithms•Data Warehousing•Scalability Techniques
48
Data mining: What it can’t do
tell the value of the patterns to the tell the value of the patterns to the organizationorganization
replace skilled business analysts or replace skilled business analysts or managersmanagers
automatically discover solutions without automatically discover solutions without guidanceguidance