Data Mining Jim King. What is Data Mining? A.k.a. knowledge discovery The search for previously...
-
Upload
allen-brooks -
Category
Documents
-
view
212 -
download
0
Transcript of Data Mining Jim King. What is Data Mining? A.k.a. knowledge discovery The search for previously...
Data MiningData Mining
Jim KingJim King
What is Data Mining?What is Data Mining?
A.k.a. knowledge discoveryA.k.a. knowledge discovery• The search for previously unknown The search for previously unknown
relationships in large data setsrelationships in large data sets Why?Why?
• Improved technology allows for vast Improved technology allows for vast quantities of data to be gatheredquantities of data to be gathered
• Those relationships can perhaps be used Those relationships can perhaps be used to make future decisions and strategiesto make future decisions and strategies
How do we Data Mine?How do we Data Mine?
Three considerations to be madeThree considerations to be made• ClassificationClassification• AssociationAssociation• SequentialSequential
ClassificationClassification
Generate grouping rulesGenerate grouping rules• Future data can then be classified Future data can then be classified
quicklyquickly
Example: Disease classification Example: Disease classification based on symptoms may lead to based on symptoms may lead to better treatmentsbetter treatments
AssociationAssociation
Two conditions occur togetherTwo conditions occur together
Presumptive Objective
With some probability (confidence)With some probability (confidence)
Cond1 => Cond2
SequentialSequential
Event B follows Event AEvent B follows Event A
Ex. In e-commerce, what links do Ex. In e-commerce, what links do people follow?people follow?• After following links to a product, how After following links to a product, how
often do they buy?often do they buy?
Classification AlgorithmsClassification Algorithms
Hard clustering vs. Soft clusteringHard clustering vs. Soft clustering• Collection of classes { C1, C2, .. Cn }Collection of classes { C1, C2, .. Cn }• Arbitrary Object OArbitrary Object O• Soft Clustering: Classes may overlap Soft Clustering: Classes may overlap
where an object belongs to multiple where an object belongs to multiple classesclasses
• Hard Clustering: Every object may Hard Clustering: Every object may belong to only one class. No overlapbelong to only one class. No overlap
ClassificationClassification
One way: AgglomerativeOne way: Agglomerative• Every object is its own clusterEvery object is its own cluster• Find two objects with least distanceFind two objects with least distance• Combine into one clusterCombine into one cluster• Stop when only one cluster remainsStop when only one cluster remains• Returns hierarchy of the clusteringReturns hierarchy of the clustering
Need to decide on some distance functionNeed to decide on some distance function
ClassificationClassification
Another way: Division methodAnother way: Division method• Everything initially in one clusterEverything initially in one cluster• Split into two clustersSplit into two clusters• Split each new cluster into two more Split each new cluster into two more
clustersclusters• Stop when can’t divide any moreStop when can’t divide any more
Requires more computational power, but Requires more computational power, but usually worse resultsusually worse results
Association AlgorithmsAssociation Algorithms
Given constraints, minimize the Given constraints, minimize the criteria need for a conditioncriteria need for a condition
Bought cereal & eggs -> Bought milkBought cereal & eggs -> Bought milk• 80% confidence80% confidence
Bought cereal -> Bought milkBought cereal -> Bought milk• 90% confidence90% confidence
AssociationAssociation
Prune conditions which fall below Prune conditions which fall below minimum improvement yields minimum improvement yields simplificationssimplifications
Other constraints:Other constraints:• Minimum confidence ( 30% with A Minimum confidence ( 30% with A
include B)include B)• Minimum support ( 2% have both A and Minimum support ( 2% have both A and
B)B)
Sequential AlgorithmsSequential Algorithms
People buy basic camping equipmentPeople buy basic camping equipment Later buy other items relatedLater buy other items related
Starting with basic item sets, try to Starting with basic item sets, try to concatenate and find the resulting concatenate and find the resulting set among customer behaviorset among customer behavior
SequentialSequential
If resulting item set is not supported If resulting item set is not supported (at all or above a threshold), drop it(at all or above a threshold), drop it
Sequences do not have to be Sequences do not have to be contiguouscontiguous• i.e. A customer buys A then B then C, i.e. A customer buys A then B then C,
sequence A then C is validsequence A then C is valid
Case Study - SchulWebCase Study - SchulWeb
Search Site for schools in GermanySearch Site for schools in Germany How to improve performance and How to improve performance and
user satisfaction?user satisfaction?
Use log to track user navigation Use log to track user navigation patterns (i.e. What URLs requested, patterns (i.e. What URLs requested, what order?)what order?)
Extract Information from theseExtract Information from these
Interpretations of MiningInterpretations of Mining
Users don’t like to type textUsers don’t like to type text Prefer to select from available choicesPrefer to select from available choices
What were they looking for?What were they looking for?• Schools close to some regionSchools close to some region• Used option to specify a state (for location)Used option to specify a state (for location)• Used option to specify a school type (to limit Used option to specify a school type (to limit
search size)search size)
Changes MadeChanges Made
Made “Near Town” DefaultMade “Near Town” Default• Made option obvious, people started to Made option obvious, people started to
useuse• Limited region size further, short lists Limited region size further, short lists
producedproduced• Shorter lists less intimidating, more Shorter lists less intimidating, more
people found what they needpeople found what they need
ConclusionsConclusions
Data mining is a useful tool with Data mining is a useful tool with multiple algorithms that can be multiple algorithms that can be tuned for specific taskstuned for specific tasks
Can benefit business, medicine, Can benefit business, medicine, sciencescience
More efficient algorithms needed to More efficient algorithms needed to speed up data mining processspeed up data mining process
ConclusionsConclusions
Making Data mining easier to useMaking Data mining easier to use• Data with rich descriptions (more fields)Data with rich descriptions (more fields)• More Data/RecordsMore Data/Records• Controlled/Reliable Data Collection Controlled/Reliable Data Collection
(automated vs. manual)(automated vs. manual)• Way to evaluate resultsWay to evaluate results• Integrate information gained back into Integrate information gained back into
systemsystem
Final Questions?Final Questions?
www.cs.unr.edu/~kingwww.cs.unr.edu/~king