Business Intelligence & Data Mining-9

download Business Intelligence & Data Mining-9

of 35

Transcript of Business Intelligence & Data Mining-9

  • 8/10/2019 Business Intelligence & Data Mining-9

    1/35

    Association Rules

  • 8/10/2019 Business Intelligence & Data Mining-9

    2/35

    Market Basket Analysis

  • 8/10/2019 Business Intelligence & Data Mining-9

    3/35

    Association Rules

    Usually applied to market baskets but otherapplications are possible

    Useful Rules contain novel and actionable

    information: e.g. On Thursdays grocery customers are

    likely to buy diapers and beer together

    Trivial Rules contain already known information: e.g.

    People who buy maintenance agreements are the ones

    who have also bought large appliances

    Some novel rules may not be useful: e.g. New

    hardware stores most commonly sell toilet rings

  • 8/10/2019 Business Intelligence & Data Mining-9

    4/35

    Association Rule: Basic Concepts

    Given: (1) a set of transactions, (2) each transaction is aset of items (e.g. purchased by a customer in a visit)

    Find: (all ?)rules that correlate the presence of one set

    of items with that of another set of items

    E.g., 98% of people who purchase tires and auto accessories

    also get automotive services done

    Applications

    Retailing (What other products should the store stocks up)

    Attached mailing in direct marketing

    Market Basket Analysis (what do people buy together?)

    Catalog design (Which items should appear next to each

    other)

  • 8/10/2019 Business Intelligence & Data Mining-9

    5/35

    What Is Association Rule Mining?

    Association rule mining: Finding frequent patterns, associations, correlations, or

    causal structures among sets of items or objects intransaction databases, relational databases, and otherinformation repositories.

    Examples. Rule form: Body ead [support, confidence].

    buys(x, diapers) buys(x, beers) [0.5%, 60%]

    major(x, CS) ^ takes(x, DB) grade(x, CS, A)[1%, 75%]

  • 8/10/2019 Business Intelligence & Data Mining-9

    6/35

    Rule Measures: Support and Confidence

    Find all the rules X & Y Z with

    minimum confidence and support

    support,s, probability that atransaction contains {X & Y &Z}

    confidence, c, conditional

    probability that a transactionhaving {X & Y} also contains Z

    Transaction ID Items Bought

    2000 A,B,C

    1000 A,C4000 A,D

    5000 B,E,F

    Let minimum support 50%,

    and minimum confidence

    50%, we have

    A C (50%, 66.6%)

    C A (50%, 100%)

    Customer

    buys diaper

    Customer

    buys both

    Customer

    buys beer

  • 8/10/2019 Business Intelligence & Data Mining-9

    7/35

    Mining Association RulesAn Example

    For rule A C:

    support = support({A &C}) = 50%

    confidence = support({A &C})/support({A}) = 66.6%

    The Apriori principle (Agarwal, 1995):

    Any subset of a frequent itemset must also be frequent

    Transaction ID Items Bought

    2000 A,B,C

    1000 A,C

    4000 A,D

    5000 B,E,F

    Frequent Itemset Support

    {A} 75%{B} 50%

    {C} 50%{A,C} 50%

    Min. support 50%

    Min. confidence 50%

    Frequent Itemsets

  • 8/10/2019 Business Intelligence & Data Mining-9

    8/35

    Mining Frequent Itemsets: the Key Step

    Find thefrequent itemsets: the sets of items that

    have minimum support

    A subset of a frequent itemset must also be a frequent

    itemset

    i.e., if {AB} is a frequent itemset, both {A} and {B} should be

    frequent itemsets

    Iteratively find frequent itemsets with cardinality from 1

    to k (k-itemset)

    Use the frequent itemsets to generate association

    rules.

  • 8/10/2019 Business Intelligence & Data Mining-9

    9/35

    The Apriori Algorithm

    Generate C1: all 1 unique items

    Generate L1: all 1 unique items with minimum

    support

    Join Step: Ck is generated forming Cartesian Product ofLk-1with L1. Since, any (k-1)-itemset that is not frequent

    cannot be a subset of a frequent k-itemset

    Prune Step: Lk-1 is generated by selecting from Ck itemsetsthose with minimum support

  • 8/10/2019 Business Intelligence & Data Mining-9

    10/35

    The Apriori Algorithm Example

    TID Items100 1 3 4

    200 2 3 5

    300 1 2 3 5

    400 2 5

    Database Ditemset sup.

    {1} 2

    {2} 3

    {3} 3

    {4} 1

    {5} 3

    itemset sup.

    {1} 2

    {2} 3

    {3} 3

    {5} 3

    Scan D

    C1 L1

    itemset{1 2}

    {1 3}

    {1 5}

    {2 3}

    {2 5}

    {3 5}

    itemset sup

    {1 2} 1

    {1 3} 2

    {1 5} 1

    {2 3} 2

    {2 5} 3{3 5} 2

    itemset sup

    {1 3} 2

    {2 3} 2

    {2 5} 3

    {3 5} 2

    L2

    C2 C2Scan D

    C3 L3itemset

    {1 3 5}

    {2 3 5}

    Scan D itemset sup

    {2 3 5} 2

  • 8/10/2019 Business Intelligence & Data Mining-9

    11/35

    Is Apriori Fast Enough?

    Performance Bottlenecks

    The core of the Apriori algorithm: Use frequent (k 1)-itemsets to generate candidate frequent k-itemsets

    Use database scan and matching to collect counts for the candidate

    itemsets

    The bottleneck ofApriori: candidate generation

    Huge candidate sets:

    104 frequent 1-itemset will generate 107 candidate 2 -itemsets

    To discover a frequent pattern of size 100, e.g., {a1, a2, , a100}, one

    needs to generate 2

    100

    10

    30

    candidates. Multiple scans of database:

    Needs (n +1 ) scans, n is the length of the longest pattern

  • 8/10/2019 Business Intelligence & Data Mining-9

    12/35

    Methods to Improve Aprioris Efficiency

    Hash-based itemset counting: A k-itemset whose

    corresponding hashing bucket count is below the threshold

    cannot be frequent

    Transaction reduction: A transaction that does not contain

    any frequent k-itemset is useless in subsequent scans

    Sampling: mining on a subset of given data, lower support

    threshold + a method to determine the completeness

    Dynamic itemset counting: add new candidate itemsets only

    when all of their subsets are estimated to be frequent

  • 8/10/2019 Business Intelligence & Data Mining-9

    13/35

    How to Count Supports of Candidates?

    Why is counting supports of candidates a problem? The total number of candidates can be very huge

    One transaction may contain many candidates

    Method:

    Candidate itemsets are stored in a hash-tree

    Leaf node of hash-tree contains a list of itemsets and

    counts

    Interior node contains a hash table

    Subset function: finds all the candidates contained in atransaction

  • 8/10/2019 Business Intelligence & Data Mining-9

    14/35

    Criticism of Support and Confidence

    Example 1: (Agarwal & Yu, PODS98)

    Among 5000 students 3000 play basketball 3750 eat cereal 2000 both play basket ball and eat cereal

    play basketball eat cereal[40%, 66.7%] is misleading because the

    overall percentage of students eating cereal is 75% which is higher than

    66.7%.

    not play basketball eat cereal[35%, 87.5%] lower support but

    higher confidence!

    play basketball not eat cereal[20%, 33.3%] is more informative,

    although with lower support and confidence

    basketball not basketball sum(row)cereal 2000 1750 3750

    not cereal 1000 250 1250

    sum(col.) 3000 2000 5000

  • 8/10/2019 Business Intelligence & Data Mining-9

    15/35

    Criticism of Support and Confidence

    Example 2: X and Y: positively

    correlated,

    X and Z, Y and Z: negatively

    correlated

    support and confidence ofX=>Z dominates

    We need a measure of

    dependent or correlated events

    P(B|A)/P(B) is called the lift

    of rule A => B

    X 1 1 1 1 0 0 0 0

    Y 1 1 0 0 0 0 0 0

    Z 0 1 1 1 1 1 1 1

    X=>Y 25% 50%

    X=>Z 37.5% 75%

    Y=>Z 12.5% 50%

  • 8/10/2019 Business Intelligence & Data Mining-9

    16/35

    Other Interestingness Measures: lift

    Lift = P(B|A)/P(B) =

    takes both P(A) and P(B) into consideration

    P(A^B)=P(B)*P(A), if A and B are independent events

    A and B negatively correlated, if lift < 1

    If lift > 1, A and B positively correlated

    )()()(

    BPAPBAP

    X 1 1 1 1 0 0 0 0

    Y 1 1 0 0 0 0 0 0Z 0 1 1 1 1 1 1 1

    Rule Support Lift

    X=>Y 25% 2

    X=>Z 37.50% 0.86

    Y=>Z 12.50% 0.57

  • 8/10/2019 Business Intelligence & Data Mining-9

    17/35

    Extensions

  • 8/10/2019 Business Intelligence & Data Mining-9

    18/35

    Multiple-Level Association Rules

    Items often form hierarchy. Items at the lower level are

    expected to have lower

    support.

    Rules regarding itemsets atappropriate levels could be

    quite useful.

    Transaction database can be

    encoded based on

    dimensions and levels

    Food

    breadmilk

    skim

    SunsetFraser

    2% whitewheat

  • 8/10/2019 Business Intelligence & Data Mining-9

    19/35

    Mining Multi-Level Associations

    A top-down, progressive deepening approach:

    First find high-level strong rules (Ancestors):milk bread [20%, 60%].

    Then find their lower-level weaker rules (Descendants):

    2% milk wheat bread [6%, 50%].

    Variations of mining multiple-level association rules. Level-crossed association rules:

    2% milk Wonderwheatbread

    Association rules with multiple, alternative hierarchies:

    2% milk Wonderbread

  • 8/10/2019 Business Intelligence & Data Mining-9

    20/35

    Multi-level Association: UniformSupport vs. Reduced Support

    Uniform Support: the same minimum support for all levels

    + One minimum support threshold. No need to examine itemsetscontaining any item whose ancestors do not have minimum

    support.

    Lower level items do not occur as frequently. If supportthreshold

    too high miss low level associations

    too low generate too many high level associations

    Reducing Support: reduced minimum support at lower

    levels

    Needs modification to the basic algorithm

  • 8/10/2019 Business Intelligence & Data Mining-9

    21/35

    Uniform Support

    Multi-level mining with uniform support

    Milk

    [support = 10%]

    2% Milk

    [support = 6%]

    Skim Milk

    [support = 4%]

    Level 1

    min_sup = 5%

    Level 2

    min_sup = 5%

  • 8/10/2019 Business Intelligence & Data Mining-9

    22/35

  • 8/10/2019 Business Intelligence & Data Mining-9

    23/35

    Multi-level Association:Redundancy Filtering

    Some rules may be redundant due to ancestor

    relationships between items.

    Example

    milk wheat bread [support = 8%, confidence = 70%]

    2% milk wheat bread [support = 2%, confidence = 72%]

    We say the first rule is an ancestor of the second rule.

    A rule is redundant if its support is close to the

    expected value, based on the rules ancestor.

  • 8/10/2019 Business Intelligence & Data Mining-9

    24/35

    Multi-Level Mining: Progressive

    Deepening

    A top-down, progressive deepening approach: First mine high-level frequent items:

    milk (15%), bread (10%) Then mine their lower-level weaker frequent itemsets:

    2% milk (5%), wheat bread (4%)

    Different min-support threshold across multi-levelslead to different algorithms:

    If adopting the same min-supportacross multi-levelsthen reject tif any of ts ancestor is infrequent.

    If adopting reduced min-supportat lower levelsthen examine only those descendants whose ancestors support is

    frequent and whose own support is > reduced min-support

  • 8/10/2019 Business Intelligence & Data Mining-9

    25/35

    Sequence pattern mining

    Sequence events database

    Consists of sequences of values or events changing

    with time

    Data is recorded at regular intervals

    Characteristic sequence events characteristics

    Trend, cycle, seasonal, irregular

    Examples

    Financial: stock price, inflation

    Biomedical:blood pressure

    Meteorological:precipitation

  • 8/10/2019 Business Intelligence & Data Mining-9

    26/35

    Two types of sequence data

    Event series

    Record events that happens at certain time

    E.g. network logins

    Time series

    Record changes of certain (typically numeric)

    values over time

    E.g. stock price movements, blood pressure

  • 8/10/2019 Business Intelligence & Data Mining-9

    27/35

    Event series

    Series can be represented in two ways:

    As a sequence (string) of events.

    Empty space if no events occur at a certain time

    Hard to represent multiple events

    As a set of tuples: {(time, event)}

    Allow for multiple event at the same time

  • 8/10/2019 Business Intelligence & Data Mining-9

    28/35

    Types of interesting info

    Which events happen often (not too interesting)

    What group of event happen often

    People who rent Star Wars also rent Star Trek

    What sequence of event happen often Renting Star Wars, then Empire Strikes Back, then

    Return of the Jedi in that order

    Association of events within a time window

    People who rent Star Wars tend to rent Empire

    Strikes Back within one week

  • 8/10/2019 Business Intelligence & Data Mining-9

    29/35

    Similarity/Difference with

    Association Rules

    Similarities:

    Groups of events : frequent item sets

    Associations : Association rules

    Differences:

    Notion of (time) windows:

    People who rent Star Wars tend to rent Empire Strikes

    Back within one week

    Ordering of events is important

  • 8/10/2019 Business Intelligence & Data Mining-9

    30/35

    Episodes

    A partially ordered sequence of events

    B

    A

    Parallel

    (B follows A OR

    A follows B)

    A B

    Serial

    (B follows A)

    B

    A

    C

    General

    (order between

    A & B unknownor immaterial but A

    & B precede C)

  • 8/10/2019 Business Intelligence & Data Mining-9

    31/35

    Sub-episode / super-episode

    If A, B & C occur within a time window:

    A & B is a sub-episode of A, B & C

    A,B & C is the super-episode of A, B, C, A & B,

    B & C

  • 8/10/2019 Business Intelligence & Data Mining-9

    32/35

    Frequent episodes / Episode Rules

    Frequent episodes Find episodes that appear often

    Episode rules

    Used to emphasize the effect of events on

    episodes Support/confidence as defined in association

    rules

    Example (window size = 11)

    AB-C-DEABE-F-A-DFECDAABBCDE

  • 8/10/2019 Business Intelligence & Data Mining-9

    33/35

    Episode Rules : Example

    Meaning: Given episode on the left appear,episode on the right appears 80% of the time.

    This essentially says that when (A,B) appears,then C appears (within a given window size)

    B

    A

    C

    B

    A

    Window size 10: Support 4%, Confidence 80%

  • 8/10/2019 Business Intelligence & Data Mining-9

    34/35

    Mining episode rules

    Apriori principle for episode

    An episode is frequent if and only if all its sub-

    episode is frequent

    Thus apriori-based algorithm can be applied

    However, there are a few tricky issues

  • 8/10/2019 Business Intelligence & Data Mining-9

    35/35

    Mining episode rules

    Recognizing episode in sequences Parallel episode: standard association rules techniques

    Serial/General episode: Finite state machine basedconstruction

    Alternative: Count parallel episodes first, then use them to

    generate candidate episodes of other types

    Counting number of windows

    One event appears in n windows for window size w

    O.K. if sequence long, as the ratios even out.

    However when sequence size is small, the edges can

    dominate