Outline Knowledge discovery in databases. Data warehousing. Data mining. Different types of data...

18
Outline • Knowledge discovery in databases. • Data warehousing. • Data mining. • Different types of data mining. • The Apriori algorithm for generating association rules .

Transcript of Outline Knowledge discovery in databases. Data warehousing. Data mining. Different types of data...

Page 1: Outline Knowledge discovery in databases. Data warehousing. Data mining. Different types of data mining. The Apriori algorithm for generating association.

Outline

• Knowledge discovery in databases.

• Data warehousing.

• Data mining.

• Different types of data mining.

• The Apriori algorithm for generating association rules .

Page 2: Outline Knowledge discovery in databases. Data warehousing. Data mining. Different types of data mining. The Apriori algorithm for generating association.

Knowledge Discovery in Databases

•What is it?

•Why do we need it?

•How does data mining fit in?

Page 3: Outline Knowledge discovery in databases. Data warehousing. Data mining. Different types of data mining. The Apriori algorithm for generating association.

Steps of a KDD Process

• Learn the application domain.

• Create a target dataset.

• Data cleaning and preprocessing.

• Choose the type of data mining to perform.

• Pick an algorithm.

• Data mining!

• Interpretation.

Page 4: Outline Knowledge discovery in databases. Data warehousing. Data mining. Different types of data mining. The Apriori algorithm for generating association.

Data Warehousing• How does it differ from a database?

– Databases provide support for:• Queries over current data

• Persistent storage

• Atomic updates

– Data warehouses provide support for:• Storage of all data, current or not

• Details or summaries

• Metadata

• Integrated data (to reduce time in cleaning data)

Page 5: Outline Knowledge discovery in databases. Data warehousing. Data mining. Different types of data mining. The Apriori algorithm for generating association.

Types of Data Mining

•Classification Rules

•Clustering

•Sequence similarity

•Sampling/summarizing

•Association Rules

Page 6: Outline Knowledge discovery in databases. Data warehousing. Data mining. Different types of data mining. The Apriori algorithm for generating association.

Classification Rules

• Rules that partition data into separate groups.

• Used to classify people as good/bad credit risks, etc.

• A variation is the BestN problem; decide the best N of a set for a given problem (such as find the best N people to mail ski vacation packages to).

Page 7: Outline Knowledge discovery in databases. Data warehousing. Data mining. Different types of data mining. The Apriori algorithm for generating association.

Clustering

• Goal is to put data tuples into a class.

• Map each data tuple to a point in n dimensional space, and identify clusters based on spatial proximity.

• Differs from classification rules because here the idea is to group based on similarity overall, not to find the ones that lead to a similar outcome.

Page 8: Outline Knowledge discovery in databases. Data warehousing. Data mining. Different types of data mining. The Apriori algorithm for generating association.

Sequence Similarity

• Used where there are time series or ordered data.

• The goal is generally to give an example and look for “similar” patterns.

• Example: “When AT&T stock goes up on 2 consecutive days and DEC stock does not fall during this period, IBM stock goes up the next day 75% of the time.”

Page 9: Outline Knowledge discovery in databases. Data warehousing. Data mining. Different types of data mining. The Apriori algorithm for generating association.

Sampling/Summarization

• Sampling: finding samples of the data to carry out more detailed analysis on. Goal is to get the best sample.

• Summarization: finding summaries of all of the data. Goal is to help people figure out what to do with their data, or to prepare reports.

Page 10: Outline Knowledge discovery in databases. Data warehousing. Data mining. Different types of data mining. The Apriori algorithm for generating association.

Association Rules

•Rules that express when two or more items are found in the same “basket” of data.

•Used to try to find when certain members of the data cause other members: example, people who buy diapers tend to buy beer.

Page 11: Outline Knowledge discovery in databases. Data warehousing. Data mining. Different types of data mining. The Apriori algorithm for generating association.

Support and Confidence

• Association rules are measured in terms of

– support = a and b occur together in at lest s% of the n baskets

– confidence = of all of the baskets containing a, at least c% also contain b.

Page 12: Outline Knowledge discovery in databases. Data warehousing. Data mining. Different types of data mining. The Apriori algorithm for generating association.

Association Rules Algorithms

• Focuses on measuring support for “itemsets” (the number of transactions that contain the data set). Confidence is an easier problem and is figured out later.

• The naïve method: Start with all the itemsets of size 1. Find the “large” itemsets. Combine to find itemsets of size 2, etc.

• Apriori algorithm: Tests to make sure that not only does each step combine only large itemsets, but that every subset of the set is also a large itemset.

Page 13: Outline Knowledge discovery in databases. Data warehousing. Data mining. Different types of data mining. The Apriori algorithm for generating association.

Apriori Algorithm

• In general, the set of large itemsets of size k is referred to as Lk and Ck is the candidate set of size k

• For each itemset in Lk-1, if all of the items in the set are the same except the last item, then the two itemsets are combined and this is put into the set Ck, which is the list of candidates for large itemsets of size k.

Page 14: Outline Knowledge discovery in databases. Data warehousing. Data mining. Different types of data mining. The Apriori algorithm for generating association.

• Check to see if for each itemset in Ck all of the subsets are in L k-1. This allows you to discard some itemsets without having to check for them in the dataset

• Count the support of the remaining items in Ck and remove those without enough support. This is now the set Lk.

• Bump k and repeat steps 2 through 4 until there is only one set remaining in Lk. When there is only one set in Lk you have found all of the large itemsets.

Page 15: Outline Knowledge discovery in databases. Data warehousing. Data mining. Different types of data mining. The Apriori algorithm for generating association.

ExampleFind all Itemsets with support of 2

Items

1 3 4

2 3 5

1 2 3 5

2 5

Page 16: Outline Knowledge discovery in databases. Data warehousing. Data mining. Different types of data mining. The Apriori algorithm for generating association.

Itemsets of size 1

Items

1 3 4

2 3 5

1 2 3 5

2 5

Itemset Support

{1}

{2}

{3}

{4}

{5}

Page 17: Outline Knowledge discovery in databases. Data warehousing. Data mining. Different types of data mining. The Apriori algorithm for generating association.

C2

Itemset{1 2}{1 3}{1 5}{2 3}{2 5}{3 5}

Items

1 3 4

2 3 5

1 2 3 5

2 5

Page 18: Outline Knowledge discovery in databases. Data warehousing. Data mining. Different types of data mining. The Apriori algorithm for generating association.

Finding C3

Itemset Support

{1 3} 2

{2 3} 2

{2 5} 3

{3 5} 2