Data Mining Jim King. What is Data Mining? A.k.a. knowledge discovery The search for previously...

19
Data Mining Data Mining Jim King Jim King

Transcript of Data Mining Jim King. What is Data Mining? A.k.a. knowledge discovery The search for previously...

Page 1: Data Mining Jim King. What is Data Mining?  A.k.a. knowledge discovery The search for previously unknown relationships in large data setsThe search for.

Data MiningData Mining

Jim KingJim King

Page 2: Data Mining Jim King. What is Data Mining?  A.k.a. knowledge discovery The search for previously unknown relationships in large data setsThe search for.

What is Data Mining?What is Data Mining?

A.k.a. knowledge discoveryA.k.a. knowledge discovery• The search for previously unknown The search for previously unknown

relationships in large data setsrelationships in large data sets Why?Why?

• Improved technology allows for vast Improved technology allows for vast quantities of data to be gatheredquantities of data to be gathered

• Those relationships can perhaps be used Those relationships can perhaps be used to make future decisions and strategiesto make future decisions and strategies

Page 3: Data Mining Jim King. What is Data Mining?  A.k.a. knowledge discovery The search for previously unknown relationships in large data setsThe search for.

How do we Data Mine?How do we Data Mine?

Three considerations to be madeThree considerations to be made• ClassificationClassification• AssociationAssociation• SequentialSequential

Page 4: Data Mining Jim King. What is Data Mining?  A.k.a. knowledge discovery The search for previously unknown relationships in large data setsThe search for.

ClassificationClassification

Generate grouping rulesGenerate grouping rules• Future data can then be classified Future data can then be classified

quicklyquickly

Example: Disease classification Example: Disease classification based on symptoms may lead to based on symptoms may lead to better treatmentsbetter treatments

Page 5: Data Mining Jim King. What is Data Mining?  A.k.a. knowledge discovery The search for previously unknown relationships in large data setsThe search for.

AssociationAssociation

Two conditions occur togetherTwo conditions occur together

Presumptive Objective

With some probability (confidence)With some probability (confidence)

Cond1 => Cond2

Page 6: Data Mining Jim King. What is Data Mining?  A.k.a. knowledge discovery The search for previously unknown relationships in large data setsThe search for.

SequentialSequential

Event B follows Event AEvent B follows Event A

Ex. In e-commerce, what links do Ex. In e-commerce, what links do people follow?people follow?• After following links to a product, how After following links to a product, how

often do they buy?often do they buy?

Page 7: Data Mining Jim King. What is Data Mining?  A.k.a. knowledge discovery The search for previously unknown relationships in large data setsThe search for.

Classification AlgorithmsClassification Algorithms

Hard clustering vs. Soft clusteringHard clustering vs. Soft clustering• Collection of classes { C1, C2, .. Cn }Collection of classes { C1, C2, .. Cn }• Arbitrary Object OArbitrary Object O• Soft Clustering: Classes may overlap Soft Clustering: Classes may overlap

where an object belongs to multiple where an object belongs to multiple classesclasses

• Hard Clustering: Every object may Hard Clustering: Every object may belong to only one class. No overlapbelong to only one class. No overlap

Page 8: Data Mining Jim King. What is Data Mining?  A.k.a. knowledge discovery The search for previously unknown relationships in large data setsThe search for.

ClassificationClassification

One way: AgglomerativeOne way: Agglomerative• Every object is its own clusterEvery object is its own cluster• Find two objects with least distanceFind two objects with least distance• Combine into one clusterCombine into one cluster• Stop when only one cluster remainsStop when only one cluster remains• Returns hierarchy of the clusteringReturns hierarchy of the clustering

Need to decide on some distance functionNeed to decide on some distance function

Page 9: Data Mining Jim King. What is Data Mining?  A.k.a. knowledge discovery The search for previously unknown relationships in large data setsThe search for.

ClassificationClassification

Another way: Division methodAnother way: Division method• Everything initially in one clusterEverything initially in one cluster• Split into two clustersSplit into two clusters• Split each new cluster into two more Split each new cluster into two more

clustersclusters• Stop when can’t divide any moreStop when can’t divide any more

Requires more computational power, but Requires more computational power, but usually worse resultsusually worse results

Page 10: Data Mining Jim King. What is Data Mining?  A.k.a. knowledge discovery The search for previously unknown relationships in large data setsThe search for.

Association AlgorithmsAssociation Algorithms

Given constraints, minimize the Given constraints, minimize the criteria need for a conditioncriteria need for a condition

Bought cereal & eggs -> Bought milkBought cereal & eggs -> Bought milk• 80% confidence80% confidence

Bought cereal -> Bought milkBought cereal -> Bought milk• 90% confidence90% confidence

Page 11: Data Mining Jim King. What is Data Mining?  A.k.a. knowledge discovery The search for previously unknown relationships in large data setsThe search for.

AssociationAssociation

Prune conditions which fall below Prune conditions which fall below minimum improvement yields minimum improvement yields simplificationssimplifications

Other constraints:Other constraints:• Minimum confidence ( 30% with A Minimum confidence ( 30% with A

include B)include B)• Minimum support ( 2% have both A and Minimum support ( 2% have both A and

B)B)

Page 12: Data Mining Jim King. What is Data Mining?  A.k.a. knowledge discovery The search for previously unknown relationships in large data setsThe search for.

Sequential AlgorithmsSequential Algorithms

People buy basic camping equipmentPeople buy basic camping equipment Later buy other items relatedLater buy other items related

Starting with basic item sets, try to Starting with basic item sets, try to concatenate and find the resulting concatenate and find the resulting set among customer behaviorset among customer behavior

Page 13: Data Mining Jim King. What is Data Mining?  A.k.a. knowledge discovery The search for previously unknown relationships in large data setsThe search for.

SequentialSequential

If resulting item set is not supported If resulting item set is not supported (at all or above a threshold), drop it(at all or above a threshold), drop it

Sequences do not have to be Sequences do not have to be contiguouscontiguous• i.e. A customer buys A then B then C, i.e. A customer buys A then B then C,

sequence A then C is validsequence A then C is valid

Page 14: Data Mining Jim King. What is Data Mining?  A.k.a. knowledge discovery The search for previously unknown relationships in large data setsThe search for.

Case Study - SchulWebCase Study - SchulWeb

Search Site for schools in GermanySearch Site for schools in Germany How to improve performance and How to improve performance and

user satisfaction?user satisfaction?

Use log to track user navigation Use log to track user navigation patterns (i.e. What URLs requested, patterns (i.e. What URLs requested, what order?)what order?)

Extract Information from theseExtract Information from these

Page 15: Data Mining Jim King. What is Data Mining?  A.k.a. knowledge discovery The search for previously unknown relationships in large data setsThe search for.

Interpretations of MiningInterpretations of Mining

Users don’t like to type textUsers don’t like to type text Prefer to select from available choicesPrefer to select from available choices

What were they looking for?What were they looking for?• Schools close to some regionSchools close to some region• Used option to specify a state (for location)Used option to specify a state (for location)• Used option to specify a school type (to limit Used option to specify a school type (to limit

search size)search size)

Page 16: Data Mining Jim King. What is Data Mining?  A.k.a. knowledge discovery The search for previously unknown relationships in large data setsThe search for.

Changes MadeChanges Made

Made “Near Town” DefaultMade “Near Town” Default• Made option obvious, people started to Made option obvious, people started to

useuse• Limited region size further, short lists Limited region size further, short lists

producedproduced• Shorter lists less intimidating, more Shorter lists less intimidating, more

people found what they needpeople found what they need

Page 17: Data Mining Jim King. What is Data Mining?  A.k.a. knowledge discovery The search for previously unknown relationships in large data setsThe search for.

ConclusionsConclusions

Data mining is a useful tool with Data mining is a useful tool with multiple algorithms that can be multiple algorithms that can be tuned for specific taskstuned for specific tasks

Can benefit business, medicine, Can benefit business, medicine, sciencescience

More efficient algorithms needed to More efficient algorithms needed to speed up data mining processspeed up data mining process

Page 18: Data Mining Jim King. What is Data Mining?  A.k.a. knowledge discovery The search for previously unknown relationships in large data setsThe search for.

ConclusionsConclusions

Making Data mining easier to useMaking Data mining easier to use• Data with rich descriptions (more fields)Data with rich descriptions (more fields)• More Data/RecordsMore Data/Records• Controlled/Reliable Data Collection Controlled/Reliable Data Collection

(automated vs. manual)(automated vs. manual)• Way to evaluate resultsWay to evaluate results• Integrate information gained back into Integrate information gained back into

systemsystem

Page 19: Data Mining Jim King. What is Data Mining?  A.k.a. knowledge discovery The search for previously unknown relationships in large data setsThe search for.

Final Questions?Final Questions?

www.cs.unr.edu/~kingwww.cs.unr.edu/~king