Http:// © 2006 Intelliware Development Inc. An Introduction to Data Mining Concepts Tim Eapen and...

49
http://www.intelliware.ca © 2006 Intelliware Development Inc. An Introduction to Data Mining Concepts Tim Eapen and B.C. Holmes Intelliware Development

Transcript of Http:// © 2006 Intelliware Development Inc. An Introduction to Data Mining Concepts Tim Eapen and...

Page 1: Http:// © 2006 Intelliware Development Inc. An Introduction to Data Mining Concepts Tim Eapen and B.C. Holmes Intelliware Development.

http://www.intelliware.ca — © 2006 Intelliware Development Inc.

An Introduction to Data Mining Concepts

Tim Eapen and B.C. Holmes

Intelliware Development

Page 2: Http:// © 2006 Intelliware Development Inc. An Introduction to Data Mining Concepts Tim Eapen and B.C. Holmes Intelliware Development.

http://www.intelliware.ca — © 2006 Intelliware Development Inc.

Agenda

Introduction to data mining The typical steps What were we trying to accomplish Bayesian Categorization

An example

Data Clustering k-means clustering Interesting conclusions

Other Stuff Java and Data Mining

Page 3: Http:// © 2006 Intelliware Development Inc. An Introduction to Data Mining Concepts Tim Eapen and B.C. Holmes Intelliware Development.

http://www.intelliware.ca — © 2006 Intelliware Development Inc.

What is Data Mining?

Data mining is the discovery of useful information from data Data mining touches on many of the same problems as machine

learning and artificial intelligence

This is a huge topic, and we can’t hope to do more than just touch on it, today

Page 4: Http:// © 2006 Intelliware Development Inc. An Introduction to Data Mining Concepts Tim Eapen and B.C. Holmes Intelliware Development.

http://www.intelliware.ca — © 2006 Intelliware Development Inc.

Some Crazy Examples

Here are some interesting examples of useful information gleaned from data: “Diapers and beer”

People who buy diapers are also likely to buy beer. Put potato chips in between them and the sales of all three items go up

Google ad-words: “digital cameras” is worth more than “digital camera”

Airline traveler behaviours Amazon.ca

“other people who bought this DVD liked such-and-such”

Page 5: Http:// © 2006 Intelliware Development Inc. An Introduction to Data Mining Concepts Tim Eapen and B.C. Holmes Intelliware Development.

http://www.intelliware.ca — © 2006 Intelliware Development Inc.

The Data Mining Process

CleanseExtract the

“Good Stuff”Identify Patterns

Gather the DataVet the results

Page 6: Http:// © 2006 Intelliware Development Inc. An Introduction to Data Mining Concepts Tim Eapen and B.C. Holmes Intelliware Development.

http://www.intelliware.ca — © 2006 Intelliware Development Inc.

What We Were Trying to Accomplish

Tim, Tom and I were working on the WhatAmITaking.com project WhatAmITaking.com is a wiki / repository that collects information

about medications Data is all available from public sources, including:

Government drug reference database Wikipedia Open License publications available through the (U.S.) National Institute

for Health News articles

Concept: want to using data mining techniques on publications and news

First steps: we wanted to try to emulate the Google news-style categorization and “topic” correlation

Page 7: Http:// © 2006 Intelliware Development Inc. An Introduction to Data Mining Concepts Tim Eapen and B.C. Holmes Intelliware Development.

http://www.intelliware.ca — © 2006 Intelliware Development Inc.

But Along the Way…

We learned some interesting things about the field of Data Mining

Page 8: Http:// © 2006 Intelliware Development Inc. An Introduction to Data Mining Concepts Tim Eapen and B.C. Holmes Intelliware Development.

http://www.intelliware.ca — © 2006 Intelliware Development Inc.

News: Obtaining News

How do we get news? Need to build a “bot” or a “web crawler” that goes out to a large

number of web sites and GETs the interesting content. Nice additions: look for links to other pieces of news

Some complications: There’s a “Good Internet Citizen” standard (the robots.txt file

standard) that should be respected If the site has a robots.txt file that says “bots keep out”, you shouldn’t

crawl their site. How do you determine what’s a story and what’s not?

That’s a hard problem: too big a topic for this presentation

Page 9: Http:// © 2006 Intelliware Development Inc. An Introduction to Data Mining Concepts Tim Eapen and B.C. Holmes Intelliware Development.

http://www.intelliware.ca — © 2006 Intelliware Development Inc.

Data Cleansing

You would not believe how bad some news sites are with respect to their content. Poor formatting bad encoding problems Clear problems related to converting the content from another format

(e.g. Word)

Two interesting word-related cleansing problems The “US spelling” versus “British spelling” problem Root words

Some of it looks deliberately obfuscated

Page 10: Http:// © 2006 Intelliware Development Inc. An Introduction to Data Mining Concepts Tim Eapen and B.C. Holmes Intelliware Development.

http://www.intelliware.ca — © 2006 Intelliware Development Inc.

Extracting Interesting Stuff

Your typical web page news article has a lot of extra stuff on it: banner ads, menus, links to “related stories”, navigation widgets, etc.

Almost all word manipulation problems talks about “stop words”: words that are so common they provide no significant meaning in analysis of text: the he she said it etc…

Page 11: Http:// © 2006 Intelliware Development Inc. An Introduction to Data Mining Concepts Tim Eapen and B.C. Holmes Intelliware Development.

http://www.intelliware.ca — © 2006 Intelliware Development Inc.

Two Interesting Topics

Categorization I know what the groups are, and I want to assign a group to any

particular data point E.g.: News is categorized: Sports, Health, Finance, World News,

National, etc.

Data Clustering I have a lot of data, and I want to find some mechanism for finding

meaningful groups E.g.: News “events”

Page 12: Http:// © 2006 Intelliware Development Inc. An Introduction to Data Mining Concepts Tim Eapen and B.C. Holmes Intelliware Development.

http://www.intelliware.ca — © 2006 Intelliware Development Inc.

Bayesian Analysis

A Delightful Example

Page 13: Http:// © 2006 Intelliware Development Inc. An Introduction to Data Mining Concepts Tim Eapen and B.C. Holmes Intelliware Development.

http://www.intelliware.ca — © 2006 Intelliware Development Inc.

The Problem

HEALTH

SPORTS

TECHNOLOGY

BUSINESS NEWS

ENTERTAINMENT

•Given a random news article, how can we determine what category it belongs to?

Page 14: Http:// © 2006 Intelliware Development Inc. An Introduction to Data Mining Concepts Tim Eapen and B.C. Holmes Intelliware Development.

http://www.intelliware.ca — © 2006 Intelliware Development Inc.

In Light of New Evidence…

Do some detective work! Start off with a hypothesis Collect evidence The evidence will be either consistent or inconsistent with a given

hypothesis As more evidence is accumulated, the degree of belief in the initial

hypothesis will change A hypothesis with a very high degree of belief may be accepted as

true Likewise, a hypothesis with a very low degree of belief may be

considered false How do we measure this degree of belief?

Page 15: Http:// © 2006 Intelliware Development Inc. An Introduction to Data Mining Concepts Tim Eapen and B.C. Holmes Intelliware Development.

http://www.intelliware.ca — © 2006 Intelliware Development Inc.

Bayes’ Theorem

Page 16: Http:// © 2006 Intelliware Development Inc. An Introduction to Data Mining Concepts Tim Eapen and B.C. Holmes Intelliware Development.

http://www.intelliware.ca — © 2006 Intelliware Development Inc.

Bayes’ Theorem

Page 17: Http:// © 2006 Intelliware Development Inc. An Introduction to Data Mining Concepts Tim Eapen and B.C. Holmes Intelliware Development.

http://www.intelliware.ca — © 2006 Intelliware Development Inc.

Bayes’ Theorem

Page 18: Http:// © 2006 Intelliware Development Inc. An Introduction to Data Mining Concepts Tim Eapen and B.C. Holmes Intelliware Development.

http://www.intelliware.ca — © 2006 Intelliware Development Inc.

An Edible Example

•10 Chocolate Chip Cookies•30 Oatmeal Cookies

•20 Chocolate Chip Cookies•20 Oatmeal Cookies

Page 19: Http:// © 2006 Intelliware Development Inc. An Introduction to Data Mining Concepts Tim Eapen and B.C. Holmes Intelliware Development.

http://www.intelliware.ca — © 2006 Intelliware Development Inc.

State a Hypothesis

Little Johnny picks a bowl at random Little Johnny picks a cookie at random The cookie turns out to be an oatmeal cookie How probable is it that Johnny picked the cookie out of bowl #1?

Page 20: Http:// © 2006 Intelliware Development Inc. An Introduction to Data Mining Concepts Tim Eapen and B.C. Holmes Intelliware Development.

http://www.intelliware.ca — © 2006 Intelliware Development Inc.

Consider the Evidence

•Probability of selecting an Oatmeal cookie given Johnny chooses bowl #1

•Probability of selecting an Oatmeal cookie given Johnny chooses bowl #2

Page 21: Http:// © 2006 Intelliware Development Inc. An Introduction to Data Mining Concepts Tim Eapen and B.C. Holmes Intelliware Development.

http://www.intelliware.ca — © 2006 Intelliware Development Inc.

An Edible Example

•Bayes’ Theorem gives the following result

•Notice that initially the prior probability that the cookie came from bowl #1 was P(H1) = 0.5•In light of evidence E, the probability that the cookie came from bowl #1 increased to P(H1|E) = 0.6

Page 22: Http:// © 2006 Intelliware Development Inc. An Introduction to Data Mining Concepts Tim Eapen and B.C. Holmes Intelliware Development.

http://www.intelliware.ca — © 2006 Intelliware Development Inc.

Back to our problem…•Given a random news article, how can we determine what category it belongs to?

OF COURSE WE CAN!!!USE BAYESIAN ANALYSIS

Page 23: Http:// © 2006 Intelliware Development Inc. An Introduction to Data Mining Concepts Tim Eapen and B.C. Holmes Intelliware Development.

http://www.intelliware.ca — © 2006 Intelliware Development Inc.

Naïve Bayes Classifier To categorize a news article use a Naïve Bayes Classifier A simple probabilistic classifier based on some naïve independence

assumptions Can be ‘trained’

Naïve Probabilistic Model The probability model for a classifier is conditional:

Given an news article with n words …Let C represent a category of news (i.e. Health)Let Fn represent the frequency with which that nth word appears in articles from category C

Page 24: Http:// © 2006 Intelliware Development Inc. An Introduction to Data Mining Concepts Tim Eapen and B.C. Holmes Intelliware Development.

http://www.intelliware.ca — © 2006 Intelliware Development Inc.

Naïve Probabilistic Model

•We can express our probability model using Bayes’ Theorem

•Solving this is difficult so we make some simplifying assumptions:

•Denominator is constant•Naively assume that each feature (word frequency) Fi is conditionally independent of every other feature Fj (i j)

Page 25: Http:// © 2006 Intelliware Development Inc. An Introduction to Data Mining Concepts Tim Eapen and B.C. Holmes Intelliware Development.

http://www.intelliware.ca — © 2006 Intelliware Development Inc.

Naïve Probabilistic Model

Problems with our assumptions Words have context Assuming that the frequency (Fi) of word i is independent of

the frequency (Fj) of word j is untrue

For example the words ‘War’ and ‘Afghanistan’ are more likely to appear in the same article than the words ‘War’ and ‘Tuna’

Benefits of our assumptions It simplifies our math algorithm

Page 26: Http:// © 2006 Intelliware Development Inc. An Introduction to Data Mining Concepts Tim Eapen and B.C. Holmes Intelliware Development.

http://www.intelliware.ca — © 2006 Intelliware Development Inc.

Naïve Probabilistic Model

)|()( CFpCpn

ii

i

•We can approximate that the probability that an article belongs to category C as the product of a ‘prior’ probability that the article belongs to that category multiplied by the product of individual word frequencies for that category

Page 27: Http:// © 2006 Intelliware Development Inc. An Introduction to Data Mining Concepts Tim Eapen and B.C. Holmes Intelliware Development.

http://www.intelliware.ca — © 2006 Intelliware Development Inc.

A Simple Algorithm for Classifying An Article

Given a random article with n words to classify the article in one of several possible categories do the following:

For each possible categoryCalculate the probability that article X belongs to that

category by considering the prior probability and word

frequencies

• Classify the article as belonging to the category with the highest probability

Page 28: Http:// © 2006 Intelliware Development Inc. An Introduction to Data Mining Concepts Tim Eapen and B.C. Holmes Intelliware Development.

http://www.intelliware.ca — © 2006 Intelliware Development Inc.

A Simple Example

Consider this very simple article …

hockey

puck

•For simplicity consider that there are only two possible categories:•Sports•News

Page 29: Http:// © 2006 Intelliware Development Inc. An Introduction to Data Mining Concepts Tim Eapen and B.C. Holmes Intelliware Development.

http://www.intelliware.ca — © 2006 Intelliware Development Inc.

A Simple Example … Consider the following word frequencies:

Word Category Frequency

hockey Sports 98%

puck Sports 96%

hockey News 2%

puck News 4%

1. Let C = Sports: p(C)=0.5, p(F1|C)=0.98 and p(F2|C)=0.96

p(C|F1,F2) = 0.5x0.98x0.96=0.47042. Let C = News: p(C)=0.5, p(F1|C)=0.02 and p(F2|C)=0.04

p(C|F1,F2) = 0.5x0.02x0.04=0.0004

Page 30: Http:// © 2006 Intelliware Development Inc. An Introduction to Data Mining Concepts Tim Eapen and B.C. Holmes Intelliware Development.

http://www.intelliware.ca — © 2006 Intelliware Development Inc.

Gathering the Evidence

So where do the frequencies we use come from? To perform Bayesian analysis, it is important to have a large

‘corpus’ of articles This corpus is what we use to determine the word frequencies

used in categorizing a given article This corpus would grow over time This corpus is what we use to ‘train’ our Bayesian classifier

Page 31: Http:// © 2006 Intelliware Development Inc. An Introduction to Data Mining Concepts Tim Eapen and B.C. Holmes Intelliware Development.

http://www.intelliware.ca — © 2006 Intelliware Development Inc.

What We Actually Did First step was to gather a ‘corpus’ of articles This corpus would be used to train our Bayesian classifier Initially started by gathering 5000 articles Number of articles in the corpus would grow over time Built a simple, little ‘NewsFinder’ utility that would regularly go to

http://news.google.ca/ and gather articles Google has seven categories of news

News Finder

world Canada Health business science sports entertainment

Page 32: Http:// © 2006 Intelliware Development Inc. An Introduction to Data Mining Concepts Tim Eapen and B.C. Holmes Intelliware Development.

http://www.intelliware.ca — © 2006 Intelliware Development Inc.

Bayesian Classifier

Started with an open-source package from sourceforge called classifier4j: available at http://classifier4j.sourceforge.net/

Created a SimpleClassifier This classifier has an instance of our Bayesian classifier which does

all the Bayesian analysis for us The classifier also has a WordDataSource: a simple map that

correlates a frequency with a given word in a given category Used our corpus of articles to train the our classifier (fill up our word

data source)

Page 33: Http:// © 2006 Intelliware Development Inc. An Introduction to Data Mining Concepts Tim Eapen and B.C. Holmes Intelliware Development.

http://www.intelliware.ca — © 2006 Intelliware Development Inc.

Issues To Consider

Making sure that the corpus was clean This was part of ‘cleansing’ the data as we gather it Had to actually tweak Classifier4j because the algorithm wasn’t

correct

Page 34: Http:// © 2006 Intelliware Development Inc. An Introduction to Data Mining Concepts Tim Eapen and B.C. Holmes Intelliware Development.

http://www.intelliware.ca — © 2006 Intelliware Development Inc.

Clustering

What is a Cluster, anyway?

Page 35: Http:// © 2006 Intelliware Development Inc. An Introduction to Data Mining Concepts Tim Eapen and B.C. Holmes Intelliware Development.

http://www.intelliware.ca — © 2006 Intelliware Development Inc.

Data Clustering

Data clustering is the process of taking “points” in some n-dimensional space, and grouping them into some understandable group.

That’s kind of “math-y” sounding. How does that relate to news? This is the fundamental question: trying to decide good “measures” is

the key success criteria I want to defer the answer for now

There are two fundamental approaches: Centroid

Guess certain centres of clusters, and iteratively refine them

Hierarchical Assume that each point is a cluster, and iteratively merge them until

“good” clusters emerge

Page 36: Http:// © 2006 Intelliware Development Inc. An Introduction to Data Mining Concepts Tim Eapen and B.C. Holmes Intelliware Development.

http://www.intelliware.ca — © 2006 Intelliware Development Inc.

Another Key Consideration

The field of Data Mining spends a lot of time thinking about one special problem: Often, there’s too much data to fit into memory; any algorithms that

try to “cluster” information must think about the special problem of data not fitting into memory

I’m not going to say too much about this problem

Page 37: Http:// © 2006 Intelliware Development Inc. An Introduction to Data Mining Concepts Tim Eapen and B.C. Holmes Intelliware Development.

http://www.intelliware.ca — © 2006 Intelliware Development Inc.

k-Means Algorithm

One of the fundamental centroid-based algorithms is called the “k-means” algorithm

Assume you have a number of points of data and you want to cluster these points into some number of clusters (k) You don’t really need to know what the clusters represent, just some

arbitrary number of clusters

Page 38: Http:// © 2006 Intelliware Development Inc. An Introduction to Data Mining Concepts Tim Eapen and B.C. Holmes Intelliware Development.

http://www.intelliware.ca — © 2006 Intelliware Development Inc.

Step One: Pick k=3 objects

Page 39: Http:// © 2006 Intelliware Development Inc. An Introduction to Data Mining Concepts Tim Eapen and B.C. Holmes Intelliware Development.

http://www.intelliware.ca — © 2006 Intelliware Development Inc.

Step Two: Create initial Groupings

Groups are based on distance from initial points

Page 40: Http:// © 2006 Intelliware Development Inc. An Introduction to Data Mining Concepts Tim Eapen and B.C. Holmes Intelliware Development.

http://www.intelliware.ca — © 2006 Intelliware Development Inc.

Step Three: Find the “centres”/means

Page 41: Http:// © 2006 Intelliware Development Inc. An Introduction to Data Mining Concepts Tim Eapen and B.C. Holmes Intelliware Development.

http://www.intelliware.ca — © 2006 Intelliware Development Inc.

Step Four: Re-jig the clusters

Page 42: Http:// © 2006 Intelliware Development Inc. An Introduction to Data Mining Concepts Tim Eapen and B.C. Holmes Intelliware Development.

http://www.intelliware.ca — © 2006 Intelliware Development Inc.

Repeat until the Clusters don’t change

Page 43: Http:// © 2006 Intelliware Development Inc. An Introduction to Data Mining Concepts Tim Eapen and B.C. Holmes Intelliware Development.

http://www.intelliware.ca — © 2006 Intelliware Development Inc.

But How Do You Decide on k?

A key question to ask is “how many clusters is the right number?” Try a bunch of different values, and map distance

1 2 3 4 5

Page 44: Http:// © 2006 Intelliware Development Inc. An Introduction to Data Mining Concepts Tim Eapen and B.C. Holmes Intelliware Development.

http://www.intelliware.ca — © 2006 Intelliware Development Inc.

Converting from Words to Points

One idea: There are about 100,000,000 English words. Consider an n-Dimensional space, where n = 100,000,000 Frequency of a particular word in an article can be considered a

distance in one dimension of the n-Dimensional space.

Page 45: Http:// © 2006 Intelliware Development Inc. An Introduction to Data Mining Concepts Tim Eapen and B.C. Holmes Intelliware Development.

http://www.intelliware.ca — © 2006 Intelliware Development Inc.

Unintuitive Conclusions

When dealing with points in n-Dimensional space, where n is very large (say > 100), most points are about as far away as average.

Page 46: Http:// © 2006 Intelliware Development Inc. An Introduction to Data Mining Concepts Tim Eapen and B.C. Holmes Intelliware Development.

http://www.intelliware.ca — © 2006 Intelliware Development Inc.

Determining a Good Measuring Stick

So how do you deal with the problem of large dimensional spaces?

Try to determine a smaller set of “interesting” dimensions. Try this: Pick an article In that article try to find 25 “interesting” words What’s “interesting”?

Try 10 of the most common words in the article (excluding stop words) Pick 10 of the most significant “classification” words (e.g. certain words

are strongly correlated with health articles. Find the 10 most strongly correlated, that also have high frequency of occurrence in the article)

Pick 5 unusual words

Now you’ve got some measuring stick. Now measure other articles according to this measuring stick, and

figure out distance

Page 47: Http:// © 2006 Intelliware Development Inc. An Introduction to Data Mining Concepts Tim Eapen and B.C. Holmes Intelliware Development.

http://www.intelliware.ca — © 2006 Intelliware Development Inc.

Java and Data Mining

There a few (but not many) Java initiatives relating to Data Mining Bayesian Classifier: - Classifier4J

Used this initially, and discovered that the algorithm wasn’t correctly implemented

Weka Created by a number of Data Mining professors The same group has published a Data Mining book with some references

to Weka (but it’s a heavy math book)

YALE (“Yet Another Learning Environment”) There’s a Java Community Process around coming up with a

consistent Java API for data mining JSR 73 and JSR 247 javax.datamining

Page 48: Http:// © 2006 Intelliware Development Inc. An Introduction to Data Mining Concepts Tim Eapen and B.C. Holmes Intelliware Development.

http://www.intelliware.ca — © 2006 Intelliware Development Inc.

Other Topics (Use Wikipedia)

w-shingling Concept Mining

Page 49: Http:// © 2006 Intelliware Development Inc. An Introduction to Data Mining Concepts Tim Eapen and B.C. Holmes Intelliware Development.

http://www.intelliware.ca — © 2006 Intelliware Development Inc.

Crazy Ideas that Might Make Interesting Experiments Could you perform data mining on code? What if you parsed Camel Case variable and class names and

performed text clustering on classes. Could you find interesting relationships between classes? In different projects?

What could you learn if you tried to perform clustering on a bunch of open source web frameworks? How must similarity and/or difference do they have?