PowerPoint Presentation -Eco

Indian Institute of Management (IIM),Rohtak

Recap and Discussion For Mid Term Exam


20 Objective Question : 20 minute

Test : 120 Minute Total Duration:140 Minute


Basic E-R notation

Relationship degrees specify number of entity types involved

Entity symbols

A special entity that is also a relationship

Relationship symbols

Relationship cardinalities specify how many of each entity type is allowed

Attribute symbols

A composite attribute

An attribute broken into component parts

Entity with multivalued attribute (Skill) and derived attribute (Years_Employed)

Multivalued an employee can have more than one skill

Derived from date employed and current date


Both outpatients and resident patients are cared for by a responsible physician

Only resident patients are assigned to a bed


Total specialization rule

A patient must be either an outpatient or a resident patient


Partial specialization rule

A vehicle could be a car, a truck, or neither


Disjoint rule

A patient can either be outpatient or resident, but not both


A part may be both purchased and manufactured

Overlap rule Example of supertype/subtype hierarchy



Primary Key Foreign Key (implements 1:N relationship between customer and order)

Combined, these are a composite primary key (uniquely identifies the order line)individually they are foreign keys (implement M:N relationship between order and product)


Onetomany relationship between original entity and new relation

Multivalued attribute becomes a separate relation with foreign key


Manufactured ? Purchased ?

Indian Institute of Management (IIM),Rohtak Indian Institute of Management (IIM),Rohtak

Order_id Order_date

Customer_id

Customer_Name

Customer_Address

Product_id

Product_details

Product_finish

Unit_price

Ordered_qianitiy

1006 10/24/2006

2 Value Furniture

Plano,TX 7 Dining Table

Natural Ash

800 2

1006 10/24/2006

2 Value Furniture

Plano,TX 5 Writers Desk

Chrry 325 2

1006 10/24/2006

2 Value Furniture

Plano,TX 4 Entertainment Center

Natural Maple

650 1

1007 10/25/2006

6 Furniture Gallery

Boulder Co

11 4-Dr Dresser

Oak 500 4

1007 10/25/2006

6 Furniture Gallery

Boulder Co

4 Entertainment Center

Natural Maple

650 3

Identify Dependency and Relationship in Tabular form


Functional Dependency Diagram

Customer_ID Customer_Name, Customer_Address

Product_ID Product_Description, Product_Finish, Unit_Price Order_ID, Product_ID Order_Quantity

Order_ID Order_Date, Customer_ID, Customer_Name, Customer_Address


Functional Dependency Diagram


Identify Dependency and Relationship in Tabular form St_ID L_Name F_Name Phone_

No St_ Lic

Lic_No Ticket# Date Code

Fine

38249 Brown Thomas 111-7804

FL BRY 123

15634 10/17/08 2 $25

38249 Brown Thomas 111-7804

FL BRY 123

16017 11/13/08 1 $15

82453 Green Sally 391-1689

AL TRE 141

14987 10/05/08 3 $100

82453 Green Sally 391-1689

AL TRE 141

16293 11/13/08 1 $15

82453 Green Sally 391-1689

AL TRE-141

17892 12/13/08 2 $25


Identify Dependency and Relationship in Tabular form St_ID L_Name F_Name Phone_No St_

Lic Lic_No Ticket# Date Code Fine

38249 Brown Thomas 111-7804 FL BRY 123 15634 10/17/08 2 $25

38249 Brown Thomas 111-7804 FL BRY 123 16017 11/13/08 1 $15 82453 Green Sally 391-1689 AL TRE 141 14987 10/05/08 3 $100 82453 Green Sally 391-1689 AL TRE 141 16293 11/13/08 1 $15 82453 Green Sally 391-1689 AL TRE-141 17892 12/13/08 2 $25


Data Warehouse? Defined in many different ways, but not rigorously.

A decision support database that is maintained separately from the

organizations operational database

Support information processing by providing a solid platform of

consolidated, historical data for analysis.

A data warehouse is a subject-oriented, integrated, time-variant, and

nonvolatile collection of data in support of managements decision-

making process.W. H. Inmon

Data warehousing:

The process of constructing and using data warehouses


Data Warehouse: A Multi-Tiered Architecture

Data Warehouse

Extract Transform Load Refresh

OLAP Engine

Analysis Query Reports

Monitor &

Integrator Metadata

Data Sources Front-End Tools

Serve

Data Marts

Operational DBs

Other sources

Data Storage

OLAP Server


Conceptual Modeling of Data Warehouses Modeling data warehouses: dimensions & measures

Star schema: A fact table in the middle connected to a set of dimension tables

Snowflake schema: A refinement of star schema where some dimensional hierarchy is normalized into a set of smaller dimension tables, forming a shape similar to snowflake

Fact constellations: Multiple fact tables share dimension tables, viewed as a collection of stars, therefore called

galaxy schema or fact constellation

A Sample Data Cube Total annual sales of TV in U.S.A. Date

C

o

u

n

t

r

y

sum

sum TV

VCR PC

1Qtr 2Qtr 3Qtr 4Qtr U.S.A

Canada

Mexico

sum


Typical OLAP Operations Roll up (drill-up): summarize data

by climbing up hierarchy or by dimension reduction Drill down (roll down): reverse of roll-up

from higher level summary to lower level summary or detailed data, or introducing new dimensions

Slice and dice: project and select Pivot (rotate):

reorient the cube, visualization, 3D to series of 2D planes

Cube Aggregation: Roll-up day 2 s1 s2 s3p1 44 4

p2 s1 s2 s3p1 12 50p2 11 8

day 1

s1 s2 s3p1 56 4 50p2 11 8

s1 s2 s3sum 67 12 50

sump1 110p2 19

129

. . .

drill-down

rollup

Example: computing sums

Indian Institute of Management (IIM),Rohtak 27

Data Mining: A KDD Process

Data miningcore of knowledge discovery process

Data Cleaning

Data Integration

Databases

Data Warehouse

Task-relevant Da

Selection

Data Mining

Pattern Evaluation


Data Mining Functionalities Concept description: Characterization and discrimination

Generalize, summarize, and contrast data characteristics, e.g., dry vs. wet regions

Association (correlation and causality)

Bread Butter [0.5%, 75%]

Classification and Prediction

Construct models (functions) that describe and distinguish classes or concepts for future prediction

E.g., classify countries based on climate, or classify cars based on gas mileage

Presentation: decision-tree, classification rule, neural network

Predict some unknown or missing numerical values


Data Mining Functionalities (2)

Cluster analysis Class label is unknown: Group data to form new classes, e.g.,

cluster houses to find distribution patterns Maximizing intra-class similarity & minimizing interclass

similarity Outlier analysis

Outlier: a data object that does not comply with the general behavior of the data

Noise or exception? No! useful in fraud detection, rare events analysis

Other pattern-directed or statistical analyses


Association Rule a concept of Mining

A `rule is something like this: If a basket contains Bread and Butter , then it also contains

Milk Any such rule has two associated measures: 1. confidence when the `if part is true, how often is the

`then bit true? This is the same as accuracy.

Confidence (A )

2. coverage or support how much of the database contains

support(A B) =


Transaction ID Items Bought1 Trouser, Shirt, Jacket2 Trouser,Jacket3 Trouser, Jeans4 Shirt, Sweatshirt

If the minimum support is 50%, then {Trouser, Jacket} is the only 2- itemset that satisfies the minimum support.

Frequent Itemset Support{Trouser} 75%{Shirt} 50%{Jacket} 50%{Trouser, Jacket} 50%

If the minimum confidence is 50%, then the only two rules generated from this 2-itemset, that have confidence greater than 50%, are: Trouser Jacket Support=50%, Confidence=66% Jacket Trouser Support=50%, Confidence=100%


The Apriori Algorithm: Basics

Computational Complexity Given d unique items:

Total number of possible association

Indian Institute of Management (IIM),Rohtak Indian Institute of Management (IIM),Rohtak

Step 1: Generating 1-itemset Frequent Pattern

In the first iteration of the algorithm, each item is a member of the set of candidate.

The set of frequent 1-itemsets, L1 , consists of the candidate 1-itemsets satisfying minimum support.


Step 2: Generating 2-itemset Frequent Pattern [Cont.]


Step 2: Generating 2-itemset Frequent Pattern Itemset Sup.Count

{I1} 6

{I2} 7

{I3} 6

{I4} 2

{I5} 2




Step 3: Generating 3-itemset Frequent Pattern [Cont.] Based on the Apriori property that all subsets of a frequent itemset

must also be frequent, we can determine that four latter candidates cannot possibly be frequent. How ?

For example , lets take {I1, I2, I3}. The 2-item subsets of it are {I1, I2}, {I1, I3} & {I2, I3}. Since all 2-item subsets of {I1, I2, I3} are members of L2, We will keep {I1, I2, I3} in C3.

Lets take another example of {I2, I3, I5} which shows how the pruning is performed. The 2-item subsets are {I2, I3}, {I2, I5} & {I3,I5}.

BUT, {I3, I5} is not a member of L2 and hence it is not frequent violating Apriori Property. Thus We will have to remove {I2, I3, I5} from C3.

Therefore, C3 = {{I1, I2, I3}, {I1, I2, I5}} after checking for all members of result of Join operation for Pruning.

Now, the transactions in D are scanned in order to determine L3, consisting of those candidates 3-itemsets in C3 having minimum support.






Step 5: Generating Association Rules from Frequent Itemsets


Step 5: Generating Association Rules from Frequent Itemsets [Cont.]


Best Found Result @WEKA

Best rules found: 1. I5=Y 2 ==> I1=Y 2 conf:(1) 2. I4=Y 2 ==> I2=Y 2 conf:(1) 3. I5=Y 2 ==> I2=Y 2 conf:(1) 4. I2=Y I5=Y 2 ==> I1=Y 2 conf:(1) 5. I1=Y I5=Y 2 ==> I2=Y 2 conf:(1) 6. I5=Y 2 ==> I1=Y I2=Y 2 conf:(1)


Criticism to Support and Confidence Strong Rules Are Not Necessarily Interesting

Total transactions 10,000 C:computers, V: video

V: 7,500 C: 6,000 C and V: 4,000 Min_support: 0.3 min_conf:0.60

Consider the rule: Buy(X: computer) buy(X: video)

Support : = 4000/10000 = 0.4 Confidence: P(C and V) /P(C) = 4000/6000 =%66 Strong BUT The probablity of buying a video is 0.75 buying a comuter reduces the probablity of buying a video From 0.75 to 0.66 Computer and video are negatively correlated


Lift of A B Lift OR corrAB = : P(A and B)/P(A)*P(B) Ratio of probablity of buying A and B divided by buying A and B independently Or it can be interpreted as:

Conditional probablity of buying B given that A is purchased divided by unconditional probablity of buying B

taking both P(A) and P(B) in consideration

P(A^B)=P(B)*P(A), if A and B are independent events

A and B negatively correlated, if the value is less than 1;

otherwise A and B positively correlated Indian Institute of Management (IIM),Rohtak

4000

3500

2000

500

6000 4000

7500

2500

10000

V

not V

C not C

From the table, we can see that the probability of purchasing a computer game is P({game}) = 0.60, the probability of purchasing a video is P({video}) = 0.75, and the probability of purchasing both is P({game, video}) = 0.40. According to the rule : P({game, video})/(P({game}) P({video})) = 0.40/(0.60 0.75) = 0.89. Because this value is less than 1, there is a negative correlation between the occurrence of {game} and {video}.


Process (1): Model Construction

Training Data

NAME RANK YEARS TENUREDMike Assistant Prof 3 noMary Assistant Prof 7 yesBill Professor 2 yesJim Associate Prof 7 yesDave Assistant Prof 6 noAnne Associate Prof 3 no

Classification Algorithms

IF rank = professor OR years > 6 THEN tenured = yes

Classifier (Model)


Process (2): Using the Model in Prediction

Classifier

Testing Data

NAME RANK YEARS TENUREDTom Assistant Prof 2 noMerlisa Associate Prof 7 noGeorge Professor 5 yesJoseph Assistant Prof 7 yes

Unseen Data

(Jeff, Professor, 4)

Tenured?


Decision Tree Induction: Training Dataset age income student credit_rating buys_computer

40 low yes fair yes>40 low yes excellent no3140 low yes excellent yes40 low yes excellent no3140 low yes excellent yes40 low yes excellent no3140 low yes excellent yes


age income student credit_rating buys_computer40 low yes fair yes>40 low yes excellent no3140 low yes excellent yes40 low yes excellent no


For 40 low yes excellent no40 low yes excellent no40


age income student credit_rating buys_computer40 low yes fair yes>40 low yes excellent no40


age income student credit_rating buys_computer40 low yes fair yes>40 low yes excellent no40


For >40

age?

overcast

?? ??

40

yes

31..40

student?

no yes

yes no

age income student credit_rating buys_computer40 low yes fair yes>40 low yes excellent no40,medium,no,fair,?

Most programming languages and calculators do not have a log2 function. Use a conversion factor Take log function of 2, and divide by it. Example: log10(2) = .301 Then divide to get log2(n): log10(3/5) / .301 = log2(3/5)

Indian Institute of Management (IIM),Rohtak 63

Bayes Classifier: Training Dataset age income studentcredit_rating_comp

40 low yes fair yes>40 low yes excellent no3140 low yes excellent yes


Bayes Classifier: An Example X = (age


So, we go to the second point (2, 5) and we will calculate the distance to each of the three means, by using the distance function:

point mean1 x1, y1 x2, y2 (2, 5) (2, 10) (a, b) = |x2 x1| + |y2 y1| (point, mean1) = |x2 x1| + |y2 y1| = |2 2| + |10 5| = 0 + 5 = 5



(2, 10) (5, 8) (1, 2) Point Dist Mean 1 Dist Mean 2 Dist Mean 3 Cluster A1 (2, 10) 0 5 9 1 A2 (2, 5) 5 6 4 3 A3 (8, 4) A4 (5, 8) A5 (7, 5) A6 (6, 4) A7 (1, 2) A8 (4, 9)

we fill in the rest of the table, and place each point in one of the clusters:

(2, 10) (5, 8) (1, 2) Point Dist Mean 1 Dist Mean 2 Dist Mean 3 Cluster A1 (2, 10) 0 5 9 1 A2 (2, 5) 5 6 4 3 A3 (8, 4) 12 7 9 2 A4 (5, 8) 5 0 10 2 A5 (7, 5) 10 5 9 2 A6 (6, 4) 10 5 7 2 A7 (1, 2) 9 10 0 3 A8 (4, 9) 3 2 10 2

Cluster 1 Cluster 2 Cluster 3 (2, 10) (8, 4),(5,8)(7,5)(6,4)(4,9) (2, 5)(1,2)


Next, we need to re-compute the new cluster centers (means). We do so, by taking the mean of all points in each cluster. For Cluster 1, we only have one point A1(2, 10), which was the old mean, so the cluster center remains the same. For Cluster 2, we have ( (8+5+7+6+4)/5, (4+8+5+4+9)/5 ) = (6, 6) For Cluster 3, we have ( (2+1)/2, (5+2)/2 ) = (1.5, 3.5)

The initial cluster centers are shown in red dot. The new cluster centers are shown in red x.


That was Iteration1 (epoch1). Next, we go to Iteration2 (epoch2), Iteration3, and so on until the means do not change anymore. In Iteration2, we basically repeat the process from Iteration1 this time using the new means we computed.


PowerPoint Presentation -Eco

Documents

Transcript of PowerPoint Presentation -Eco