PowerPoint Presentation -Eco
description
Transcript of PowerPoint Presentation -Eco
-
Indian Institute of Management (IIM),Rohtak
Recap and Discussion For Mid Term Exam
Indian Institute of Management (IIM),Rohtak
20 Objective Question : 20 minute
Test : 120 Minute Total Duration:140 Minute
Indian Institute of Management (IIM),Rohtak
Basic E-R notation
Relationship degrees specify number of entity types involved
Entity symbols
A special entity that is also a relationship
Relationship symbols
Relationship cardinalities specify how many of each entity type is allowed
Attribute symbols
A composite attribute
An attribute broken into component parts
Entity with multivalued attribute (Skill) and derived attribute (Years_Employed)
Multivalued an employee can have more than one skill
Derived from date employed and current date
-
Indian Institute of Management (IIM),Rohtak
Both outpatients and resident patients are cared for by a responsible physician
Only resident patients are assigned to a bed
Indian Institute of Management (IIM),Rohtak
Total specialization rule
A patient must be either an outpatient or a resident patient
Indian Institute of Management (IIM),Rohtak
Partial specialization rule
A vehicle could be a car, a truck, or neither
Indian Institute of Management (IIM),Rohtak
Disjoint rule
A patient can either be outpatient or resident, but not both
-
Indian Institute of Management (IIM),Rohtak
A part may be both purchased and manufactured
Overlap rule Example of supertype/subtype hierarchy
Indian Institute of Management (IIM),Rohtak
Indian Institute of Management (IIM),Rohtak
Primary Key Foreign Key (implements 1:N relationship between customer and order)
Combined, these are a composite primary key (uniquely identifies the order line)individually they are foreign keys (implement M:N relationship between order and product)
-
Indian Institute of Management (IIM),Rohtak
Onetomany relationship between original entity and new relation
Multivalued attribute becomes a separate relation with foreign key
Indian Institute of Management (IIM),Rohtak
Manufactured ? Purchased ?
Indian Institute of Management (IIM),Rohtak Indian Institute of Management (IIM),Rohtak
Order_id Order_date
Customer_id
Customer_Name
Customer_Address
Product_id
Product_details
Product_finish
Unit_price
Ordered_qianitiy
1006 10/24/2006
2 Value Furniture
Plano,TX 7 Dining Table
Natural Ash
800 2
1006 10/24/2006
2 Value Furniture
Plano,TX 5 Writers Desk
Chrry 325 2
1006 10/24/2006
2 Value Furniture
Plano,TX 4 Entertainment Center
Natural Maple
650 1
1007 10/25/2006
6 Furniture Gallery
Boulder Co
11 4-Dr Dresser
Oak 500 4
1007 10/25/2006
6 Furniture Gallery
Boulder Co
4 Entertainment Center
Natural Maple
650 3
Identify Dependency and Relationship in Tabular form
-
Indian Institute of Management (IIM),Rohtak
Functional Dependency Diagram
Customer_ID Customer_Name, Customer_Address
Product_ID Product_Description, Product_Finish, Unit_Price Order_ID, Product_ID Order_Quantity
Order_ID Order_Date, Customer_ID, Customer_Name, Customer_Address
Indian Institute of Management (IIM),Rohtak
Functional Dependency Diagram
Indian Institute of Management (IIM),Rohtak
Identify Dependency and Relationship in Tabular form St_ID L_Name F_Name Phone_
No St_ Lic
Lic_No Ticket# Date Code
Fine
38249 Brown Thomas 111-7804
FL BRY 123
15634 10/17/08 2 $25
38249 Brown Thomas 111-7804
FL BRY 123
16017 11/13/08 1 $15
82453 Green Sally 391-1689
AL TRE 141
14987 10/05/08 3 $100
82453 Green Sally 391-1689
AL TRE 141
16293 11/13/08 1 $15
82453 Green Sally 391-1689
AL TRE-141
17892 12/13/08 2 $25
Indian Institute of Management (IIM),Rohtak
Identify Dependency and Relationship in Tabular form St_ID L_Name F_Name Phone_No St_
Lic Lic_No Ticket# Date Code Fine
38249 Brown Thomas 111-7804 FL BRY 123 15634 10/17/08 2 $25
38249 Brown Thomas 111-7804 FL BRY 123 16017 11/13/08 1 $15 82453 Green Sally 391-1689 AL TRE 141 14987 10/05/08 3 $100 82453 Green Sally 391-1689 AL TRE 141 16293 11/13/08 1 $15 82453 Green Sally 391-1689 AL TRE-141 17892 12/13/08 2 $25
-
Indian Institute of Management (IIM),Rohtak
Data Warehouse? Defined in many different ways, but not rigorously.
A decision support database that is maintained separately from the
organizations operational database
Support information processing by providing a solid platform of
consolidated, historical data for analysis.
A data warehouse is a subject-oriented, integrated, time-variant, and
nonvolatile collection of data in support of managements decision-
making process.W. H. Inmon
Data warehousing:
The process of constructing and using data warehouses
Indian Institute of Management (IIM),Rohtak
Data Warehouse: A Multi-Tiered Architecture
Data Warehouse
Extract Transform Load Refresh
OLAP Engine
Analysis Query Reports
Monitor &
Integrator Metadata
Data Sources Front-End Tools
Serve
Data Marts
Operational DBs
Other sources
Data Storage
OLAP Server
Indian Institute of Management (IIM),Rohtak
Conceptual Modeling of Data Warehouses Modeling data warehouses: dimensions & measures
Star schema: A fact table in the middle connected to a set of dimension tables
Snowflake schema: A refinement of star schema where some dimensional hierarchy is normalized into a set of smaller dimension tables, forming a shape similar to snowflake
Fact constellations: Multiple fact tables share dimension tables, viewed as a collection of stars, therefore called
galaxy schema or fact constellation
A Sample Data Cube Total annual sales of TV in U.S.A. Date
C
o
u
n
t
r
y
sum
sum TV
VCR PC
1Qtr 2Qtr 3Qtr 4Qtr U.S.A
Canada
Mexico
sum
-
Indian Institute of Management (IIM),Rohtak
Typical OLAP Operations Roll up (drill-up): summarize data
by climbing up hierarchy or by dimension reduction Drill down (roll down): reverse of roll-up
from higher level summary to lower level summary or detailed data, or introducing new dimensions
Slice and dice: project and select Pivot (rotate):
reorient the cube, visualization, 3D to series of 2D planes
Cube Aggregation: Roll-up day 2 s1 s2 s3p1 44 4
p2 s1 s2 s3p1 12 50p2 11 8
day 1
s1 s2 s3p1 56 4 50p2 11 8
s1 s2 s3sum 67 12 50
sump1 110p2 19
129
. . .
drill-down
rollup
Example: computing sums
Indian Institute of Management (IIM),Rohtak 27
Data Mining: A KDD Process
Data miningcore of knowledge discovery process
Data Cleaning
Data Integration
Databases
Data Warehouse
Task-relevant Da
Selection
Data Mining
Pattern Evaluation
Indian Institute of Management (IIM),Rohtak
Data Mining Functionalities Concept description: Characterization and discrimination
Generalize, summarize, and contrast data characteristics, e.g., dry vs. wet regions
Association (correlation and causality)
Bread Butter [0.5%, 75%]
Classification and Prediction
Construct models (functions) that describe and distinguish classes or concepts for future prediction
E.g., classify countries based on climate, or classify cars based on gas mileage
Presentation: decision-tree, classification rule, neural network
Predict some unknown or missing numerical values
-
Indian Institute of Management (IIM),Rohtak
Data Mining Functionalities (2)
Cluster analysis Class label is unknown: Group data to form new classes, e.g.,
cluster houses to find distribution patterns Maximizing intra-class similarity & minimizing interclass
similarity Outlier analysis
Outlier: a data object that does not comply with the general behavior of the data
Noise or exception? No! useful in fraud detection, rare events analysis
Other pattern-directed or statistical analyses
Indian Institute of Management (IIM),Rohtak
Association Rule a concept of Mining
A `rule is something like this: If a basket contains Bread and Butter , then it also contains
Milk Any such rule has two associated measures: 1. confidence when the `if part is true, how often is the
`then bit true? This is the same as accuracy.
Confidence (A )
2. coverage or support how much of the database contains
support(A B) =
Indian Institute of Management (IIM),Rohtak
Transaction ID Items Bought1 Trouser, Shirt, Jacket2 Trouser,Jacket3 Trouser, Jeans4 Shirt, Sweatshirt
If the minimum support is 50%, then {Trouser, Jacket} is the only 2- itemset that satisfies the minimum support.
Frequent Itemset Support{Trouser} 75%{Shirt} 50%{Jacket} 50%{Trouser, Jacket} 50%
If the minimum confidence is 50%, then the only two rules generated from this 2-itemset, that have confidence greater than 50%, are: Trouser Jacket Support=50%, Confidence=66% Jacket Trouser Support=50%, Confidence=100%
Indian Institute of Management (IIM),Rohtak
The Apriori Algorithm: Basics
Computational Complexity Given d unique items:
Total number of possible association
-
Indian Institute of Management (IIM),Rohtak Indian Institute of Management (IIM),Rohtak
Step 1: Generating 1-itemset Frequent Pattern
In the first iteration of the algorithm, each item is a member of the set of candidate.
The set of frequent 1-itemsets, L1 , consists of the candidate 1-itemsets satisfying minimum support.
Indian Institute of Management (IIM),Rohtak
Step 2: Generating 2-itemset Frequent Pattern [Cont.]
Indian Institute of Management (IIM),Rohtak
Step 2: Generating 2-itemset Frequent Pattern Itemset Sup.Count
{I1} 6
{I2} 7
{I3} 6
{I4} 2
{I5} 2
-
Indian Institute of Management (IIM),Rohtak
Step 3: Generating 3-itemset Frequent Pattern
Indian Institute of Management (IIM),Rohtak
Step 3: Generating 3-itemset Frequent Pattern [Cont.] Based on the Apriori property that all subsets of a frequent itemset
must also be frequent, we can determine that four latter candidates cannot possibly be frequent. How ?
For example , lets take {I1, I2, I3}. The 2-item subsets of it are {I1, I2}, {I1, I3} & {I2, I3}. Since all 2-item subsets of {I1, I2, I3} are members of L2, We will keep {I1, I2, I3} in C3.
Lets take another example of {I2, I3, I5} which shows how the pruning is performed. The 2-item subsets are {I2, I3}, {I2, I5} & {I3,I5}.
BUT, {I3, I5} is not a member of L2 and hence it is not frequent violating Apriori Property. Thus We will have to remove {I2, I3, I5} from C3.
Therefore, C3 = {{I1, I2, I3}, {I1, I2, I5}} after checking for all members of result of Join operation for Pruning.
Now, the transactions in D are scanned in order to determine L3, consisting of those candidates 3-itemsets in C3 having minimum support.
Indian Institute of Management (IIM),Rohtak
Step 3: Generating 3-itemset Frequent Pattern
Indian Institute of Management (IIM),Rohtak
Step 4: Generating 4-itemset Frequent Pattern
-
Indian Institute of Management (IIM),Rohtak
Step 5: Generating Association Rules from Frequent Itemsets
Indian Institute of Management (IIM),Rohtak
Step 5: Generating Association Rules from Frequent Itemsets [Cont.]
Indian Institute of Management (IIM),Rohtak
Best Found Result @WEKA
Best rules found: 1. I5=Y 2 ==> I1=Y 2 conf:(1) 2. I4=Y 2 ==> I2=Y 2 conf:(1) 3. I5=Y 2 ==> I2=Y 2 conf:(1) 4. I2=Y I5=Y 2 ==> I1=Y 2 conf:(1) 5. I1=Y I5=Y 2 ==> I2=Y 2 conf:(1) 6. I5=Y 2 ==> I1=Y I2=Y 2 conf:(1)
Indian Institute of Management (IIM),Rohtak
Criticism to Support and Confidence Strong Rules Are Not Necessarily Interesting
Total transactions 10,000 C:computers, V: video
V: 7,500 C: 6,000 C and V: 4,000 Min_support: 0.3 min_conf:0.60
Consider the rule: Buy(X: computer) buy(X: video)
Support : = 4000/10000 = 0.4 Confidence: P(C and V) /P(C) = 4000/6000 =%66 Strong BUT The probablity of buying a video is 0.75 buying a comuter reduces the probablity of buying a video From 0.75 to 0.66 Computer and video are negatively correlated
-
Indian Institute of Management (IIM),Rohtak
Lift of A B Lift OR corrAB = : P(A and B)/P(A)*P(B) Ratio of probablity of buying A and B divided by buying A and B independently Or it can be interpreted as:
Conditional probablity of buying B given that A is purchased divided by unconditional probablity of buying B
taking both P(A) and P(B) in consideration
P(A^B)=P(B)*P(A), if A and B are independent events
A and B negatively correlated, if the value is less than 1;
otherwise A and B positively correlated Indian Institute of Management (IIM),Rohtak
4000
3500
2000
500
6000 4000
7500
2500
10000
V
not V
C not C
From the table, we can see that the probability of purchasing a computer game is P({game}) = 0.60, the probability of purchasing a video is P({video}) = 0.75, and the probability of purchasing both is P({game, video}) = 0.40. According to the rule : P({game, video})/(P({game}) P({video})) = 0.40/(0.60 0.75) = 0.89. Because this value is less than 1, there is a negative correlation between the occurrence of {game} and {video}.
Indian Institute of Management (IIM),Rohtak
Process (1): Model Construction
Training Data
NAME RANK YEARS TENUREDMike Assistant Prof 3 noMary Assistant Prof 7 yesBill Professor 2 yesJim Associate Prof 7 yesDave Assistant Prof 6 noAnne Associate Prof 3 no
Classification Algorithms
IF rank = professor OR years > 6 THEN tenured = yes
Classifier (Model)
Indian Institute of Management (IIM),Rohtak
Process (2): Using the Model in Prediction
Classifier
Testing Data
NAME RANK YEARS TENUREDTom Assistant Prof 2 noMerlisa Associate Prof 7 noGeorge Professor 5 yesJoseph Assistant Prof 7 yes
Unseen Data
(Jeff, Professor, 4)
Tenured?
-
Indian Institute of Management (IIM),Rohtak
Decision Tree Induction: Training Dataset age income student credit_rating buys_computer
40 low yes fair yes>40 low yes excellent no3140 low yes excellent yes40 low yes excellent no3140 low yes excellent yes40 low yes excellent no3140 low yes excellent yes
-
Indian Institute of Management (IIM),Rohtak
age income student credit_rating buys_computer40 low yes fair yes>40 low yes excellent no3140 low yes excellent yes40 low yes excellent no
-
Indian Institute of Management (IIM),Rohtak
For 40 low yes excellent no40 low yes excellent no40
Indian Institute of Management (IIM),Rohtak
age income student credit_rating buys_computer40 low yes fair yes>40 low yes excellent no40
Indian Institute of Management (IIM),Rohtak
age income student credit_rating buys_computer40 low yes fair yes>40 low yes excellent no40
-
Indian Institute of Management (IIM),Rohtak
For >40
age?
overcast
?? ??
40
yes
31..40
student?
no yes
yes no
age income student credit_rating buys_computer40 low yes fair yes>40 low yes excellent no40,medium,no,fair,?
Most programming languages and calculators do not have a log2 function. Use a conversion factor Take log function of 2, and divide by it. Example: log10(2) = .301 Then divide to get log2(n): log10(3/5) / .301 = log2(3/5)
Indian Institute of Management (IIM),Rohtak 63
Bayes Classifier: Training Dataset age income studentcredit_rating_comp
40 low yes fair yes>40 low yes excellent no3140 low yes excellent yes
-
Indian Institute of Management (IIM),Rohtak
Bayes Classifier: An Example X = (age
-
Indian Institute of Management (IIM),Rohtak
So, we go to the second point (2, 5) and we will calculate the distance to each of the three means, by using the distance function:
point mean1 x1, y1 x2, y2 (2, 5) (2, 10) (a, b) = |x2 x1| + |y2 y1| (point, mean1) = |x2 x1| + |y2 y1| = |2 2| + |10 5| = 0 + 5 = 5
point mean2 x1, y1 x2, y2 (2, 5) (5, 8) (a, b) = |x2 x1| + |y2 y1| (point, mean2) = |x2 x1| + |y2 y1| = |5 2| + |8 5| = 3 + 3 = 6
point mean3 x1, y1 x2, y2 (2, 5) (1, 2) (a, b) = |x2 x1| + |y2 y1| (point, mean2) = |x2 x1| + |y2 y1| = |1 2| + |2 5| = 1 + 3 = 4
(2, 10) (5, 8) (1, 2) Point Dist Mean 1 Dist Mean 2 Dist Mean 3 Cluster A1 (2, 10) 0 5 9 1 A2 (2, 5) 5 6 4 3 A3 (8, 4) A4 (5, 8) A5 (7, 5) A6 (6, 4) A7 (1, 2) A8 (4, 9)
we fill in the rest of the table, and place each point in one of the clusters:
(2, 10) (5, 8) (1, 2) Point Dist Mean 1 Dist Mean 2 Dist Mean 3 Cluster A1 (2, 10) 0 5 9 1 A2 (2, 5) 5 6 4 3 A3 (8, 4) 12 7 9 2 A4 (5, 8) 5 0 10 2 A5 (7, 5) 10 5 9 2 A6 (6, 4) 10 5 7 2 A7 (1, 2) 9 10 0 3 A8 (4, 9) 3 2 10 2
Cluster 1 Cluster 2 Cluster 3 (2, 10) (8, 4),(5,8)(7,5)(6,4)(4,9) (2, 5)(1,2)
Indian Institute of Management (IIM),Rohtak
Next, we need to re-compute the new cluster centers (means). We do so, by taking the mean of all points in each cluster. For Cluster 1, we only have one point A1(2, 10), which was the old mean, so the cluster center remains the same. For Cluster 2, we have ( (8+5+7+6+4)/5, (4+8+5+4+9)/5 ) = (6, 6) For Cluster 3, we have ( (2+1)/2, (5+2)/2 ) = (1.5, 3.5)
The initial cluster centers are shown in red dot. The new cluster centers are shown in red x.
Indian Institute of Management (IIM),Rohtak
That was Iteration1 (epoch1). Next, we go to Iteration2 (epoch2), Iteration3, and so on until the means do not change anymore. In Iteration2, we basically repeat the process from Iteration1 this time using the new means we computed.
Indian Institute of Management (IIM),Rohtak
-
Indian Institute of Management (IIM),Rohtak