Data Mining. Jim Jim ’ s cows Which cows should I breed??

Data MiningData Mining

JimJim Jim’s cowsJim’s cows

Which cows should I

breed??

Which cows should I breed??

Suppose I know the weight, age and Suppose I know the weight, age and health of each cow?health of each cow?

And suppose I know their behavior, And suppose I know their behavior, preferred mating months, milk preferred mating months, milk production, nutritional habits, production, nutritional habits, immune system data…?immune system data…?

Suppose I have 50 cows.Suppose I have 50 cows. Now suppose I have 100,000 cows…Now suppose I have 100,000 cows…

““understanding” dataunderstanding” data

Trying to find patterns in data is not new: Trying to find patterns in data is not new: hunters seek patterns in animal migration, hunters seek patterns in animal migration, politicians in voting habits, people in their politicians in voting habits, people in their partner’s behavior, etc. partner’s behavior, etc.

However, the amount of available data is However, the amount of available data is increasing very fast (exponentially?).increasing very fast (exponentially?).

This gives greater opportunities to extract This gives greater opportunities to extract valuable information from the data.valuable information from the data.

But it also makes the task of “understanding” But it also makes the task of “understanding” the data with conventional tools very difficult.the data with conventional tools very difficult.

Data MiningData Mining Data Mining:Data Mining: The process of discovering The process of discovering

patterns in data, usually stored in a Database. patterns in data, usually stored in a Database. The patterns lead to advantages (economic or The patterns lead to advantages (economic or other).other).

Two extremes for the expression of the patterns:Two extremes for the expression of the patterns:1.1. ““Black Box”:Black Box”: “Breed cow Zehava, Petra and Paulina”“Breed cow Zehava, Petra and Paulina”2.2. ““Transparent Box” (Structural Patterns):Transparent Box” (Structural Patterns): “Breed “Breed

cows with age<4 and weight >300 or cows with calm cows with age<4 and weight >300 or cows with calm behavior and >90 liters of milk production per month”behavior and >90 liters of milk production per month”

Data Mining is about techniques for finding and Data Mining is about techniques for finding and describing describing Structural PatternsStructural Patterns in data. in data.

The techniques (algorithms) are usually from the The techniques (algorithms) are usually from the field of field of Machine LearningMachine Learning..

The weather exampleThe weather exampleOutlookOutlookTemperatTemperat

ureureHumidityHumidityWindyWindyPlayPlay

SunnySunnyHotHotHighHighFalseFalseNoNo

SunnySunnyHotHotHighHighTrueTrueNo No

OvercastOvercastHotHotHighHighFalseFalseYesYes

RainyRainyMildMildHighHighFalseFalseYesYes

RainyRainyCoolCoolNormalNormalFalseFalseYesYes

RainyRainyCoolCoolNormalNormalTrue True NoNo

OvercastOvercastCoolCoolNormalNormalTrueTrueYesYes

SunnySunnyMildMildHighHighFalseFalseNoNo

SunnySunnyCoolCoolNormalNormalFalseFalseYesYes

RainyRainyMildMildNormalNormalFalseFalseYesYes

SunnySunnyMildMildNormalNormalTrueTrueYesYes

OvercastOvercastMildMildHighHighTrueTrueYesYes

OvercastOvercastHotHotNormalNormalFalseFalseYesYes

RainyRainyMildMildhighhighTrueTruenono

The weather example The weather example contcont..

A set of rules learned from this data could be A set of rules learned from this data could be presented in a presented in a Decision ListDecision List:: If outlook=sunny and humidity=high then If outlook=sunny and humidity=high then play=noplay=no ElseIf outlook=rainy and windy=true then ElseIf outlook=rainy and windy=true then play=noplay=no ElseIf outlook=overcast then ElseIf outlook=overcast then play=yesplay=yes ElseIf humidity=normal then ElseIf humidity=normal then play=yesplay=yes Else Else play=yesplay=yes

This is an example of This is an example of Classification RulesClassification Rules We could also look forWe could also look for Association Rules: Association Rules:

If temperature=cool then humidity=normalIf temperature=cool then humidity=normal If windy=false and play=no then outlook=sunny If windy=false and play=no then outlook=sunny

and and humidity=highhumidity=high

Example ContExample Cont..

The previous example is very The previous example is very simplified. Real Databases will simplified. Real Databases will probably:probably:

1.1. Contain Numerical values as well.Contain Numerical values as well.

2.2. Contain “Noise” and errors.Contain “Noise” and errors.

3.3. Be a lot larger.Be a lot larger. And the analysis we are asked to And the analysis we are asked to

perform might not be of Association perform might not be of Association Rules, but rather Decision Trees, Rules, but rather Decision Trees, Neural Networks, etc.Neural Networks, etc.

Another ExampleAnother Example A classic example is a Database which holds A classic example is a Database which holds

data concerning all purchases in a data concerning all purchases in a supermarket.supermarket.

Each Each Shopping BasketShopping Basket is a list of items that is a list of items that were bought in a single purchase by some were bought in a single purchase by some customer. customer.

Such huge DB’s which are saved for long Such huge DB’s which are saved for long periods of time are called periods of time are called Data WarehousesData Warehouses..

It is extremely valuable for the manager of It is extremely valuable for the manager of the store to extract Association Rules from the store to extract Association Rules from the huge Data Warehouse. the huge Data Warehouse.

It is even more valuable if this information It is even more valuable if this information can be associated with the person buying, can be associated with the person buying, hence the Club Memberships… hence the Club Memberships…

Supermarket ExampleSupermarket Example

For example, if Beer and Diapers are For example, if Beer and Diapers are found to be bought together often, this found to be bought together often, this might encourage the manager to give a might encourage the manager to give a discount for purchasing Beer, Diapers discount for purchasing Beer, Diapers and a new product together.and a new product together.

Another example: if older people are Another example: if older people are found to be more “loyal” to a certain found to be more “loyal” to a certain brand than young people, a manager brand than young people, a manager might not promote a new brand of might not promote a new brand of shampoo, intended for older people.shampoo, intended for older people.

Data Mining techniques in Data Mining techniques in some huji coursessome huji courses

TechniqueTechniqueCourseCourse

Decision TreesDecision TreesArtificial IntelligenceArtificial Intelligence

Perceptron, SVM, Perceptron, SVM, PCAPCA……

Intro. to Machine Intro. to Machine LearningLearning

Neural NetworksNeural NetworksNeural Networks 1, Neural Networks 1, 22..

K-Nearest NeighborK-Nearest NeighborComputational Computational GeometryGeometry

Association RulesAssociation RulesDatabasesDatabases

transiditem

111111penpen

111111inkink

111111milkmilk

111111juicejuice

112112penpen

112112inkink

112 112 milkmilk

113113penpen

113113milkmilk

114114penpen

114 114 inkink

114114juicejuice

The Purchases RelationItemset: A set of

items

Support of an itemset: the fraction of transactions that contain all items in the itemset.What is the support

of:

1.{pen}?

2.{pen, ink}?

3.{pen, juice}?

Frequent ItemsetsFrequent Itemsets We would like to find items that are We would like to find items that are

purchased together in high frequency- purchased together in high frequency- Frequent ItemsetsFrequent Itemsets. .

We look for itemsets which have a We look for itemsets which have a support > minSupport.support > minSupport.

If minSupport is set to 0.7, then the If minSupport is set to 0.7, then the frequent itemsets in our example would be:frequent itemsets in our example would be:

{pen}, {ink}, {milk}, {pen, ink}, {pen, {pen}, {ink}, {milk}, {pen, ink}, {pen, milk}milk}

The A-Priori property of frequent The A-Priori property of frequent itemsets: itemsets: Every subset of a frequent Every subset of a frequent itemset is also a frequent itemset.itemset is also a frequent itemset.

Algorithm for finding Algorithm for finding Frequent itemsetsFrequent itemsets

The idea (based on the A-prori property): first The idea (based on the A-prori property): first identify frequent itemsets of size 1, then try to identify frequent itemsets of size 1, then try to expand them.expand them.

By considering only itemsets obtained by By considering only itemsets obtained by enlarging frequent itemsets, we greatly reduce enlarging frequent itemsets, we greatly reduce the number of candidate frequent itemsets.the number of candidate frequent itemsets.

A single scan of the table is enough to A single scan of the table is enough to determine which candidate itemsets which determine which candidate itemsets which were generated, are frequent.were generated, are frequent.

The algorithm terminates when no new The algorithm terminates when no new frequent itemsets are found in an iteration.frequent itemsets are found in an iteration.

Algorithm for finding Algorithm for finding Frequent itemsetsFrequent itemsets

foreach item, check if it is a frequent itemset; foreach item, check if it is a frequent itemset; (appears in >minSupport of the transactions)(appears in >minSupport of the transactions)

k=1;k=1;repeatrepeat

foreach new frequent itemset Iforeach new frequent itemset Ikk with k items: with k items:Generate all itemsets IGenerate all itemsets Ik+1 k+1 with k+1 items, such with k+1 items, such

that Ithat Ik k is contained inis contained in IIk+1.k+1.

scan all transactions once and add itemsets scan all transactions once and add itemsets that have support > minSupport.that have support > minSupport.

k++k++until until no new frequent itemsets are foundno new frequent itemsets are found

Finding Frequent itemsets, on Finding Frequent itemsets, on table “Purchases”, with table “Purchases”, with

minSupport=0.7minSupport=0.7 In the first run, the following single itemsets In the first run, the following single itemsets

are found to be frequent: {are found to be frequent: {penpen}, {}, {inkink}, {}, {milkmilk}.}. Now we generate the candidates for k=2: {Now we generate the candidates for k=2: {penpen, ,

inkink}, {}, {penpen, , milkmilk}, {}, {penpen,, juice juice}, {}, {inkink, , milkmilk}, }, {{inkink,, juice juice} and {} and {milkmilk,, juice juice}.}.

By scanning the relation, we determine that the By scanning the relation, we determine that the following are frequent: {following are frequent: {penpen, , inkink}, {}, {penpen, , milkmilk}.}.

Now we generate the candidates for k=3: {Now we generate the candidates for k=3: {penpen, , inkink, , milkmilk}, {}, {penpen, , milkmilk, , juicejuice}, {}, {penpen, , inkink, , juicejuice}.}.

By scanning the relation, we determine that By scanning the relation, we determine that none of these are frequent, and the algorithm none of these are frequent, and the algorithm ends with: { {ends with: { {penpen}, {}, {inkink}, {}, {milkmilk}, {}, {penpen, , inkink}, {}, {penpen, , milkmilk} }} }

Algorithm refinementAlgorithm refinement:: More complex algorithms use the same tools: More complex algorithms use the same tools:

iterative generationiterative generation and and testing of candidate testing of candidate itemsetsitemsets..

One important refinement:One important refinement: after the after the candidate-generation phase, and before the scan candidate-generation phase, and before the scan of the relation (A-priori), eliminate candidate of the relation (A-priori), eliminate candidate itemsets in which there is a subset which is not itemsets in which there is a subset which is not frequent. This is due to the A-Priori property.frequent. This is due to the A-Priori property.

In the second iteration, this means we would In the second iteration, this means we would eliminate {eliminate {penpen,, juice juice}, {}, {inkink,, juice juice} and {} and {milkmilk,, juicejuice} as candidates since {} as candidates since {juicejuice} is not } is not frequent. So we only check {frequent. So we only check {penpen, , inkink}, {}, {penpen, , milkmilk} and {} and {inkink, , milkmilk}. }.

So only {So only {penpen, , ink,ink, milkmilk} is generated as a } is generated as a candidate, but it is eliminated before the scan candidate, but it is eliminated before the scan because {because {inkink, , milkmilk} is not frequent.} is not frequent.

So we don’t perform the 3So we don’t perform the 3rdrd iteration of the iteration of the relation. relation.

Not frequen

t

Association RulesAssociation Rules Up until now we discussed identification of Up until now we discussed identification of

frequent item sets. We now wish to go one step frequent item sets. We now wish to go one step further.further.

An association rule is of the structure An association rule is of the structure {pen}=> {ink}{pen}=> {ink}

It should be read as: “if a pen is purchased in a It should be read as: “if a pen is purchased in a transaction, it is likely that ink will also be transaction, it is likely that ink will also be purchased in that transaction”.purchased in that transaction”.

It describes the data in the DB (past). It describes the data in the DB (past). Extrapolation to future transactions should be Extrapolation to future transactions should be done with caution.done with caution.

More formally, an Association Rule is More formally, an Association Rule is LHS=>RHS, where both LHS and RHS are sets LHS=>RHS, where both LHS and RHS are sets of items, and implies that if every item in LHS of items, and implies that if every item in LHS was purchased in a transaction, it is likely that was purchased in a transaction, it is likely that the items in RHS are purchased as well.the items in RHS are purchased as well.

Measures for Association Measures for Association RulesRules

1.1. Support of “LHS=>RHS”Support of “LHS=>RHS” is the support is the support of the itemset (LHS of the itemset (LHS UU RHS). In other RHS). In other words: the words: the fraction of transactions that contain all items in (LHS LHS U U RHS)RHS) .

2. Confidence of “LHS=>RHS”:“LHS=>RHS”: Consider Consider all transactions which contain all items in all transactions which contain all items in LHS. The fraction of these transactions LHS. The fraction of these transactions that also contain all items in RHS, is the that also contain all items in RHS, is the confidence of RHS. confidence of RHS.

The confidence of a rule is an indication of The confidence of a rule is an indication of the strength of the rule.the strength of the rule.

What is the support of {pen}=>{ink}? And the confidence?

What is the support of {ink}=>{pen}? And the confidence?

Finding Association rulesFinding Association rules

A user can ask for rules with minimum A user can ask for rules with minimum support minSup and minimum confidence support minSup and minimum confidence minConf.minConf.

Firstly, all frequent itemsets with Firstly, all frequent itemsets with support>minSup are computed with the support>minSup are computed with the previous Algorithm.previous Algorithm.

Secondly, rules are generated using the Secondly, rules are generated using the frequent itemsets, and checked for frequent itemsets, and checked for minConf.minConf.

Finding Association rulesFinding Association rules

Find all frequent itemsets using the Find all frequent itemsets using the previous alg.previous alg.

For each frequent itemset X with For each frequent itemset X with support S(X):support S(X):For each division of X into 2 itemsets:For each division of X into 2 itemsets:

Devide X into 2 itemsets LHS and RHS.Devide X into 2 itemsets LHS and RHS.

The Confidence of LHS=>RHS is S(X)/S(LHS). The Confidence of LHS=>RHS is S(X)/S(LHS).

We computed S(LHS) in the previous We computed S(LHS) in the previous algorithm (because LHS is frequent algorithm (because LHS is frequent since X is frequent). since X is frequent).

Generalized association Generalized association rulesrulestransiddateitem

1111111.5.991.5.99penpen

1111111.5.991.5.99inkink

1111111.5.991.5.99MilkMilk

1111111.5.991.5.99juicejuice

11211210.5.9910.5.99penpen

11211210.5.9910.5.99inkink

112 112 10.5.9910.5.99milkmilk

11311315.5.9915.5.99PenPen

11311315.5.9915.5.99milkmilk

1141141.6.991.6.99PenPen

114 114 1.6.991.6.99InkInk

1141141.6.991.6.99juicejuice

We would like to know if the rule {pen}=>{juice} is different on the first day of the month compared to other days. How?

What are its support and confidence

generally ?

And on the first days of the

month ?

Generalized Generalized association rulesassociation rules

By specifying different attributes to group By specifying different attributes to group by (date in the last example), we can by (date in the last example), we can come up with interesting rules which we come up with interesting rules which we would otherwise miss.would otherwise miss.

Another example would be to group by Another example would be to group by location and check if the same rules apply location and check if the same rules apply for customers from Jerusalem compared for customers from Jerusalem compared to Tel Aviv. to Tel Aviv.

By comparing the support and confidence By comparing the support and confidence of the rules we can observe differences in of the rules we can observe differences in the data on different conditions. the data on different conditions.

Caution in predictionCaution in prediction

When we find a pattern in the data, we wish When we find a pattern in the data, we wish to use it for prediction (that is in many case to use it for prediction (that is in many case the point).the point).

However, we have to be cautious about this. However, we have to be cautious about this. For example: suppose {pen}=>{ink} has a For example: suppose {pen}=>{ink} has a

high support and confidence. We might give high support and confidence. We might give a discount on pens in order to increase sales a discount on pens in order to increase sales of pens and therefore also in sales of ink.of pens and therefore also in sales of ink.

However, this assumes a However, this assumes a causal link causal link between {pen} and {ink}. between {pen} and {ink}.

Caution in predictionCaution in prediction Suppose pens and pencils are always sold Suppose pens and pencils are always sold

together (for example because customers tend together (for example because customers tend to buy writing instruments together).to buy writing instruments together).

We would then also get the rule We would then also get the rule {pencil}=>{ink} with the same support and {pencil}=>{ink} with the same support and confidence as {pen}=>{ink}confidence as {pen}=>{ink}

However, it is clear there is no causal link However, it is clear there is no causal link between buying pencils and buying ink. between buying pencils and buying ink.

If we promoted pencils it would not cause an If we promoted pencils it would not cause an increase in sales of ink, despite high support increase in sales of ink, despite high support and confidence.and confidence.

The chance to infer “wrong” rules (rules The chance to infer “wrong” rules (rules which are not causal links) decreases as the which are not causal links) decreases as the DB size increases, but we should keep in mind DB size increases, but we should keep in mind that such rules do come up.that such rules do come up.

Therefore, the generated rules are a only good Therefore, the generated rules are a only good starting point for identifying causal links. starting point for identifying causal links.

Classification and Classification and Regression rulesRegression rules

Consider the following relation: Consider the following relation: InsuranceInfo(age integer, carType string, InsuranceInfo(age integer, carType string, highRisk bool)highRisk bool)

The relation holds information about The relation holds information about current customers. current customers.

The company wants to use the data in The company wants to use the data in order to predict if a new customer, whose order to predict if a new customer, whose age and carType are known, is at high risk age and carType are known, is at high risk (and therefore charge higher insurance fee (and therefore charge higher insurance fee of course).of course).

Such a rule for example could be “if age is Such a rule for example could be “if age is between 18 and 23, and carType is either between 18 and 23, and carType is either ‘sports’ or ‘truck’, the risk is high”.‘sports’ or ‘truck’, the risk is high”.

Classification and Classification and Regression rulesRegression rules

Such rules, where we are only interested Such rules, where we are only interested in predicting one attribute are special.in predicting one attribute are special.

The attribute which we predict is called The attribute which we predict is called the the DependentDependent attribute. attribute.

The other attributes are called the The other attributes are called the PredictorPredictor attributes. attributes.

If the dependant attribute is If the dependant attribute is categoricalcategorical, , we call such rules we call such rules classification rulesclassification rules..

If the dependent attribute is If the dependent attribute is numericalnumerical, , we call such rules we call such rules regression rulesregression rules..

Regression in a nutshellRegression in a nutshell

NameName

BlooBlood d

pres. pres. (BP)(BP)

Milk Milk AveraAvera

ge ge (MA)(MA)

AGEAGENOCNOCRatiRatingng

MonaMona727266552299

LisaLisa797944661177

MarryMarry898988334433

QuirriQuirri565666552299

PaulaPaula777722664477

AbdulAbdul90901010778833

VickyVicky6969445533??

Jim’s cows

(training set)

new cow (test set)

Regression in a nutshellRegression in a nutshell Assume that the Rate is a linear Assume that the Rate is a linear

combination of the other attributes:combination of the other attributes:Rate= wRate= w0 0 + w+ w11*BP + w*BP + w22*MA + w*MA + w33*AGE + *AGE +

ww44*NOC*NOC

Our goal is thus to find wOur goal is thus to find w0, 0, ww1, 1, ww2, 2, ww3, 3, ww4 4

(which actually means how strongly each (which actually means how strongly each attribute affects the Rate)attribute affects the Rate)

We thus want to minimize:We thus want to minimize:

ΣΣ(Rate(Rate(i)(i)-[w-[w00++ ww1*1*BPBP(i)(i) ++ ww2*2*MAMA(i)(i)

++ ww3*3*AGEAGE(i)(i) ++

ww4*4*NOCNOC(i) (i) ])])

i

Prediction of Rate using

w0-w4

Real Rate

i=Cow number

Regression in a nutshellRegression in a nutshell

This minimization is pretty straightforward This minimization is pretty straightforward (though outside the scope of this course).(though outside the scope of this course).

It will give better coefficients the larger the It will give better coefficients the larger the “training set” is.“training set” is.

The assumption that the sum is linear is The assumption that the sum is linear is wrong in many cases. Hence the use of SVM.wrong in many cases. Hence the use of SVM.

Notice this only deals with the case of all Notice this only deals with the case of all attributes being numerical.attributes being numerical.

All this and more in Intro. to Machine All this and more in Intro. to Machine Learning courseLearning course

Data Mining. Jim Jim ’ s cows Which cows should I breed??

Documents

Transcript of Data Mining. Jim Jim ’ s cows Which cows should I breed??