Data Mining. Jim Which cow should I buy?? Jim ’ s cows RatingAGE Milk Avg. (MA) Name Good56Mona...

30
Data Mining Data Mining
  • date post

    21-Dec-2015
  • Category

    Documents

  • view

    217
  • download

    2

Transcript of Data Mining. Jim Which cow should I buy?? Jim ’ s cows RatingAGE Milk Avg. (MA) Name Good56Mona...

Data MiningData Mining

JimJim

Which cow should I

buy??

Jim’s cowsJim’s cows

NameNameMilk Milk Avg. Avg. (MA)(MA)

AGEAGERatingRating

MonaMona6655GoodGood

LisaLisa4466BadBad

MaryMary8833GoodGood

QuirriQuirri6655BadBad

PaulaPaula2266GoodGood

AbdulAbdul101077BadBad

Cows on Cows on salesale

NameNameMilk Milk Avg. Avg. (MA)(MA)

AGAGEE

PhilPhil5533

CollinsCollins3322

LarryLarry9955

BirdBird2255

Which cow should I

buy??

Which cow should I buy??

And suppose I know their:And suppose I know their: BehaviorBehavior Preferred mating monthsPreferred mating months Milk productionMilk production Nutritional habitsNutritional habits Immune system dataImmune system data ……

Now suppose I have 10,000 cows…Now suppose I have 10,000 cows…

““understanding” dataunderstanding” data Looking for patterns is not new: Looking for patterns is not new:

Hunters seek patterns in animal migrationHunters seek patterns in animal migration Politicians seek patterns in voting habitsPoliticians seek patterns in voting habits … …

Available data is increasing very fast Available data is increasing very fast (exponentially?)(exponentially?)

Greater opportunities to extract valuable Greater opportunities to extract valuable information information

But “understanding” the data becomes But “understanding” the data becomes more difficultmore difficult

Data MiningData Mining Data Mining:Data Mining: The process of discovering The process of discovering

patterns in data, usually stored in a patterns in data, usually stored in a Database. The patterns lead to Database. The patterns lead to advantages (economic or other).advantages (economic or other).

Very fast growing area of researchVery fast growing area of research Because:Because:

Huge databases (Walmart-20 mil Huge databases (Walmart-20 mil transactions/day)transactions/day)

Automatic data capture of transactions (Bar Automatic data capture of transactions (Bar code, satellites, scanners, cameras, etc.)code, satellites, scanners, cameras, etc.)

Large financial advantageLarge financial advantage Evolving analytical methodsEvolving analytical methods

Data Mining techniques in Data Mining techniques in some huji coursessome huji courses

TechniqueTechniqueCourseCourse

Decision TreesDecision TreesArtificial IntelligenceArtificial Intelligence

EM, Perceptron, SVM, EM, Perceptron, SVM, PCAPCA……

Intro. to Machine Intro. to Machine LearningLearning

Intro. to information Intro. to information processing and processing and

LearningLearning

Neural NetworksNeural NetworksNeural Networks 1, 2Neural Networks 1, 2..

K-Nearest NeighborK-Nearest NeighborComputational Computational GeometryGeometry

Data MiningData Mining

Two extremes for the expression of Two extremes for the expression of the patterns:the patterns:

1.1. ““Black Box”:Black Box”: “Buy cow Zehava, Petra “Buy cow Zehava, Petra and Paulina”and Paulina”

2.2. ““Transparent Box” (Structural Transparent Box” (Structural Patterns):Patterns): “Buy cows with age<4 and “Buy cows with age<4 and weight >300 or cows with calm weight >300 or cows with calm behavior and >90 liters of milk behavior and >90 liters of milk production per month”production per month”

The weather exampleThe weather exampleOutlookOutlookTemp.Temp.HumidityHumidityWindyWindyPlayPlay

SunnySunnyHotHotHighHighFalseFalseNoNo

SunnySunnyHotHotHighHighTrueTrueNo No

OvercastOvercastHotHotHighHighFalseFalseYesYes

RainyRainyMildMildHighHighFalseFalseYesYes

RainyRainyCoolCoolNormalNormalFalseFalseYesYes

RainyRainyCoolCoolNormalNormalTrue True NoNo

OvercastOvercastCoolCoolNormalNormalTrueTrueYesYes

SunnySunnyMildMildHighHighFalseFalseNoNo

SunnySunnyCoolCoolNormalNormalFalseFalseYesYes

Today is Overcast, mild temperature, high Today is Overcast, mild temperature, high humidity, and windy. Will we play?humidity, and windy. Will we play?

Questions one can askQuestions one can ask A set of rules learned from this data could be A set of rules learned from this data could be

presented in a presented in a Decision ListDecision List:: If outlook=sunny and humidity=high then If outlook=sunny and humidity=high then play=noplay=no ElseIf outlook=rainy and windy=true then ElseIf outlook=rainy and windy=true then play=noplay=no ElseIf outlook=overcast then ElseIf outlook=overcast then play=yesplay=yes ElseIf humidity=normal then ElseIf humidity=normal then play=yesplay=yes Else Else play=yesplay=yes

This is an example of This is an example of Classification RulesClassification Rules We could also look forWe could also look for Association Rules: Association Rules:

If temperature=cool then humidity=normalIf temperature=cool then humidity=normal If windy=false and play=no then outlook=sunny If windy=false and play=no then outlook=sunny

and and humidity=highhumidity=high

Example ContExample Cont.. The previous example is very The previous example is very

simplified. Real Databases will simplified. Real Databases will probably:probably:

1.1. Contain Numerical values as well.Contain Numerical values as well.

2.2. Contain “Noise” and errors (stochastic).Contain “Noise” and errors (stochastic).

3.3. Be a lot larger.Be a lot larger. And the analysis we are asked to And the analysis we are asked to

perform might not be of Association perform might not be of Association Rules, but rather Decision Trees, Rules, but rather Decision Trees, Neural Networks, etc.Neural Networks, etc.

CautionCaution David Rhine was a parapsychologist in the 1930-David Rhine was a parapsychologist in the 1930-

1950’s1950’s He hypothesized that some people have Extra-He hypothesized that some people have Extra-

Sensory Perception (ESP)Sensory Perception (ESP) He asked people to say if 10 hidden cards are red or He asked people to say if 10 hidden cards are red or

blue.blue. He discovered that almost 1 in every 1000 people He discovered that almost 1 in every 1000 people

has ESP !has ESP ! He told these people that they have ESP and called He told these people that they have ESP and called

them in for another testthem in for another test He discovered almost all of them had lost their ESP !He discovered almost all of them had lost their ESP ! He concluded that…He concluded that… You shouldn’t tell people they have ESP, it makes You shouldn’t tell people they have ESP, it makes

them loose it.them loose it.[Source: J. Ullman][Source: J. Ullman]

Another ExampleAnother Example Classic example: Database of purchases in Classic example: Database of purchases in

a supermarketa supermarket Such huge DB’s which are saved for long Such huge DB’s which are saved for long

periods of time are called periods of time are called Data Data WarehousesWarehouses

It is extremely valuable for the manager of It is extremely valuable for the manager of the store to extract Association Rules from the store to extract Association Rules from the Data Warehousethe Data Warehouse

It is even more valuable if this information It is even more valuable if this information can be associated with the person buying, can be associated with the person buying, hence the Club Memberships… hence the Club Memberships…

Each Each Shopping BasketShopping Basket is a list of items is a list of items that were bought in a single purchase by that were bought in a single purchase by some customersome customer

Supermarket ExampleSupermarket Example

Beer and Diapers were found to often Beer and Diapers were found to often be bought together by men, so they be bought together by men, so they were placed in the same aislewere placed in the same aisle

[Datamining urban legend][Datamining urban legend]

transiditem

111111penpen

111111inkink

111111milkmilk

111111juicejuice

112112penpen

112112inkink

112 112 milkmilk

113113penpen

113113milkmilk

114114penpen

114 114 inkink

114114juicejuice

The Purchases RelationItemset: A set of

items

Support of an itemset: the fraction of transactions that contain all items in the itemset.What is the Support

of:

1.{pen}?

2.{pen, ink}?

3.{pen, juice}?

Frequent ItemsetsFrequent Itemsets We would like to find items that are We would like to find items that are

purchased together at high frequency- purchased together at high frequency- Frequent ItemsetsFrequent Itemsets. .

We look for itemsets which have a We look for itemsets which have a

support > minSupport.support > minSupport. If minSupport is set to 0.7, then the If minSupport is set to 0.7, then the

frequent itemsets in our example would be:frequent itemsets in our example would be:

{pen}, {ink}, {milk}, {pen, ink}, {pen, {pen}, {ink}, {milk}, {pen, ink}, {pen, milk}milk}

The A-Priori property of frequent The A-Priori property of frequent itemsets: itemsets: Every subset of a frequent Every subset of a frequent itemset is also a frequent itemset.itemset is also a frequent itemset.

Algorithm for finding Frequent Algorithm for finding Frequent itemsetsitemsets

Suppose we have n items.Suppose we have n items. The naïve approach: for every subset of items, The naïve approach: for every subset of items,

check if it is frequent.check if it is frequent. Very expensiveVery expensive Improvement (based on the A-priori property): Improvement (based on the A-priori property):

first identify frequent itemsets of size 1, then try first identify frequent itemsets of size 1, then try to expand them.to expand them.

Greatly reduces the number of candidate Greatly reduces the number of candidate frequent itemsets.frequent itemsets.

A single scan of the table is enough to determine A single scan of the table is enough to determine which candidate itemsets, are frequent.which candidate itemsets, are frequent.

The algorithm terminates when no new frequent The algorithm terminates when no new frequent itemsets are found in an iteration.itemsets are found in an iteration.

Algorithm for finding Frequent Algorithm for finding Frequent itemsetsitemsets

foreach item, check if it is a frequent itemset; foreach item, check if it is a frequent itemset; (appears in >minSupport of the transactions)(appears in >minSupport of the transactions)

k=1;k=1;repeatrepeat

foreach new frequent itemset Iforeach new frequent itemset Ikk with k items: with k items:Generate all itemsets IGenerate all itemsets Ik+1 k+1 with k+1 items, such with k+1 items, such

that Ithat Ik k is contained inis contained in IIk+1.k+1.

scan all transactions once and add itemsets scan all transactions once and add itemsets that have support > minSupport.that have support > minSupport.

k++k++until until no new frequent itemsets are foundno new frequent itemsets are found

transiditem

111111penpen

111111inkink

111111milkmilk

111111juicejuice

112112penpen

112112inkink

112 112 milkmilk

113113penpen

113113milkmilk

114114penpen

114 114 inkink

114114juicejuice

Finding Frequent itemsets, on table Finding Frequent itemsets, on table “Purchases”, with minSupport=0.7“Purchases”, with minSupport=0.7

In the first run, the following single itemsets are In the first run, the following single itemsets are found to be frequent: {found to be frequent: {penpen}, {}, {inkink}, {}, {milkmilk}.}.

Now we generate the candidates for k=2: {Now we generate the candidates for k=2: {penpen, , inkink}, {}, {penpen, , milkmilk}, {}, {penpen,, juice juice}, {}, {inkink, , milkmilk}, }, {{inkink,, juice juice} and {} and {milkmilk,, juice juice}.}.

By scanning the relation, we determine that the By scanning the relation, we determine that the following are frequent: {following are frequent: {penpen, , inkink}, {}, {penpen, , milkmilk}.}.

Now we generate the candidates for k=3: {Now we generate the candidates for k=3: {penpen, , inkink, , milkmilk}, {}, {penpen, , milkmilk, , juicejuice}, {}, {penpen, , inkink, , juicejuice}.}.

By scanning the relation, we determine that By scanning the relation, we determine that none of these are frequent, and the algorithm none of these are frequent, and the algorithm ends with: { {ends with: { {penpen}, {}, {inkink}, {}, {milkmilk}, {}, {penpen, , inkink}, {}, {penpen, , milkmilk} }} }

Algorithm refinementAlgorithm refinement:: One important refinement:One important refinement: after the candidate- after the candidate-

generation phase, and before the scan of the generation phase, and before the scan of the relation (A-priori), eliminate candidate itemsets in relation (A-priori), eliminate candidate itemsets in which there is a subset which is not frequent. This which there is a subset which is not frequent. This is due to the A-Priori property.is due to the A-Priori property.

In the second iteration, this means we would In the second iteration, this means we would eliminate {eliminate {penpen,, juice juice}, {}, {inkink,, juice juice} and {} and {milkmilk,, juicejuice} as candidates since {} as candidates since {juicejuice} is not frequent. } is not frequent. So we only check {So we only check {penpen, , inkink}, {}, {penpen, , milkmilk} and {} and {inkink, , milkmilk}. }.

So only {So only {penpen, , ink,ink, milkmilk} is generated as a } is generated as a candidate, but it is eliminated before the scan candidate, but it is eliminated before the scan because {because {inkink, , milkmilk} is not frequent.} is not frequent.

So we don’t perform the 3So we don’t perform the 3rdrd iteration of the relation. iteration of the relation. More complex algorithms use the same tools: More complex algorithms use the same tools:

iterative generationiterative generation and and testing of candidate testing of candidate itemsetsitemsets..

Association RulesAssociation Rules Up until now we discussed identification of Up until now we discussed identification of

frequent item sets. We now wish to go one step frequent item sets. We now wish to go one step further.further.

An An association ruleassociation rule is of the structure is of the structure {pen}=> {ink}{pen}=> {ink}

Meaning: “if a pen is purchased in a transaction, Meaning: “if a pen is purchased in a transaction, it is likely that ink will also be purchased in that it is likely that ink will also be purchased in that transaction”.transaction”.

It describes the data in the DB (past). It describes the data in the DB (past). Extrapolation to future transactions should be Extrapolation to future transactions should be done with caution.done with caution.

More formally, an Association Rule is LHS=>RHS, More formally, an Association Rule is LHS=>RHS, where both LHS and RHS are sets of items, and where both LHS and RHS are sets of items, and implies that if every item in LHS was purchased implies that if every item in LHS was purchased in a transaction, it is likely that the items in RHS in a transaction, it is likely that the items in RHS are purchased as well.are purchased as well.

Measures for Association Measures for Association RulesRules

1.1. Support Support of “LHS=>RHS”of “LHS=>RHS” is the is the support of the itemset (LHS support of the itemset (LHS UU RHS). In RHS). In other words: the other words: the fraction of transactions that contain all items in (LHS LHS U U RHS)RHS) .

2. Confidence of “LHS=>RHS”:“LHS=>RHS”: Consider all transactions which contain Consider all transactions which contain all items in LHS. The fraction of these all items in LHS. The fraction of these transactions that also contain all items transactions that also contain all items in RHS, is the confidence of RHS. in RHS, is the confidence of RHS. =S(LHS U RHS)/S(LHS) =S(LHS U RHS)/S(LHS)

The confidence of a rule is an The confidence of a rule is an indication of the strength of the rule.indication of the strength of the rule.

transiditem

111111penpen

111111inkink

111111milkmilk

111111juicejuice

112112penpen

112112inkink

112 112 milkmilk

113113penpen

113113milkmilk

114114penpen

114 114 inkink

114114juicejuice

What is the support of {pen}=>{ink}? And the

confidence ?What is the support of {ink}=>{pen}? And the

confidence ?

Finding Association rulesFinding Association rules

A user can ask for rules with minimum A user can ask for rules with minimum support support minSup minSup and minimum confidence and minimum confidence minConfminConf..

Firstly, all frequent itemsets with Firstly, all frequent itemsets with support>minSup are computed with the support>minSup are computed with the previous Algorithm.previous Algorithm.

Secondly, rules are generated using the Secondly, rules are generated using the frequent itemsets, and checked for minConf.frequent itemsets, and checked for minConf.

Finding Association rulesFinding Association rules

Find all frequent itemsets using the previous Find all frequent itemsets using the previous alg.alg.

For each frequent itemset X with support S(X):For each frequent itemset X with support S(X):For each division of X into 2 itemsets, LHS and For each division of X into 2 itemsets, LHS and

RHS:RHS:Calculate the Confidence of LHS=>RHS : S(X)/S(LHS). Calculate the Confidence of LHS=>RHS : S(X)/S(LHS).

We computed S(LHS) in the previous algorithm We computed S(LHS) in the previous algorithm (because LHS is frequent since X is frequent). (because LHS is frequent since X is frequent).

Generalized association Generalized association rulesrulestransiddateitem

1111111.5.991.5.99penpen

1111111.5.991.5.99inkink

1111111.5.991.5.99MilkMilk

1111111.5.991.5.99juicejuice

11211210.5.9910.5.99penpen

11211210.5.9910.5.99inkink

112 112 10.5.9910.5.99milkmilk

11311315.5.9915.5.99PenPen

11311315.5.9915.5.99milkmilk

1141141.6.991.6.99PenPen

114 114 1.6.991.6.99InkInk

1141141.6.991.6.99juicejuice

We would like to know if the rule {pen}=>{juice} is different on the first day of the month compared to other days. How?

What are its support and confidence

generally ?

And on the first days of the

month ?

Generalized association Generalized association rulesrules

By specifying different attributes to group by By specifying different attributes to group by (date in the last example), we can come up (date in the last example), we can come up with interesting rules which we would with interesting rules which we would otherwise miss.otherwise miss.

Another example would be to group by Another example would be to group by location and check if the same rules apply for location and check if the same rules apply for customers from Jerusalem compared to Tel customers from Jerusalem compared to Tel Aviv. Aviv.

By comparing the support and confidence of By comparing the support and confidence of the rules we can observe differences in the the rules we can observe differences in the data on different conditions. data on different conditions.

Caution in predictionCaution in prediction When we find a pattern in the data, we wish When we find a pattern in the data, we wish

to use it for prediction (that is in many case to use it for prediction (that is in many case the point).the point).

However, we have to be cautious about this. However, we have to be cautious about this. For example: suppose {pen}=>{ink} has a For example: suppose {pen}=>{ink} has a

high support and confidence. We might give high support and confidence. We might give a discount on pens in order to increase a discount on pens in order to increase sales of pens and therefore also in sales of sales of pens and therefore also in sales of ink.ink.

However, this assumes a However, this assumes a causal link causal link between {pen} and {ink}. between {pen} and {ink}.

Caution in predictionCaution in prediction Suppose pens and pencils are sold together a Suppose pens and pencils are sold together a

lot lot We would then also get the rule We would then also get the rule

{pencil}=>{ink} with high support and {pencil}=>{ink} with high support and confidenceconfidence

However, it is clear there is no causal link However, it is clear there is no causal link between buying pencils and buying ink. between buying pencils and buying ink.

If we promoted pencils it would not cause an If we promoted pencils it would not cause an increase in sales of ink, despite high support increase in sales of ink, despite high support and confidence.and confidence.

The chance to infer “wrong” rules (rules which The chance to infer “wrong” rules (rules which are not causal links) decreases as the DB size are not causal links) decreases as the DB size increases, but we should keep in mind that increases, but we should keep in mind that such rules do come up.such rules do come up.

Therefore, the generated rules are a only good Therefore, the generated rules are a only good starting point for identifying causal links. starting point for identifying causal links.