The Magical Art of Extracting Meaning From Data
Transcript of The Magical Art of Extracting Meaning From Data
-
8/3/2019 The Magical Art of Extracting Meaning From Data
1/51
The Magical Art of Extracting
Meaning From DataLuis Rei
@[email protected]://luisrei.com
Data Mining For The Web
http://luisrei.com/http://luisrei.com/mailto:[email protected]:[email protected] -
8/3/2019 The Magical Art of Extracting Meaning From Data
2/51
Outline
Introduction
Recommender Systems Classification
Clustering
-
8/3/2019 The Magical Art of Extracting Meaning From Data
3/51
The greatest problem of today is how to teach people to ignore theirrelevant, how to refuse to know things, before they are suffocated. Fortoo many facts are as bad as none at all.(W.H. Auden)
The key in business is to know something that nobody else knows.(Aristotle Onassis)
-
8/3/2019 The Magical Art of Extracting Meaning From Data
4/51
DATA
Luis Rei
25
codebits
4
MEANING
Luis Rei
25
NAME
PERSON
AGE
PHOTO
WEBSITE
http://luisrei.com/http://luisrei.com/http://luisrei.com/http://luisrei.com/http://luisrei.com/http://luisrei.com/ -
8/3/2019 The Magical Art of Extracting Meaning From Data
5/51
Tools
Python vs C or C++
feedparser, BeautifulSoup (scrap web pages)
NumPy, SciPy
Weka
R
Libraries
http://mloss.org/software/
http://en.wikipedia.org/wiki/Bayesian_spam_filterhttp://en.wikipedia.org/wiki/Bayesian_spam_filterhttp://en.wikipedia.org/wiki/Bayesian_spam_filterhttp://en.wikipedia.org/wiki/Bayesian_spam_filter -
8/3/2019 The Magical Art of Extracting Meaning From Data
6/51
Down The Rabbit Hole In 2006, google search crawler used
850TB of data. Total web history isaround 3PB
Think of all the audio, photos & videos
Thats a lot of data
Open formats (HTML, RSS, PDF, ...)
Everyone + their dog has an API
facebook, twitter, flickr, last.fm,delicious, digg, gowalla, ...
Think about:
news articles published every day
status updates / day
-
8/3/2019 The Magical Art of Extracting Meaning From Data
7/51
Recommendations
-
8/3/2019 The Magical Art of Extracting Meaning From Data
8/51
The Netflix Prize
In October 2006 Netflix launched an open competition for the bestcollaborative filtering algorithm
at least 10% improvement over netflixs own algorithm
Predict user ratings for films based on previous ratings (by all users)
US$1,000,000 prize won in Sep 2009
-
8/3/2019 The Magical Art of Extracting Meaning From Data
9/51
The Three ActsI: The Pledge
The magician shows you something ordinary. But of course... itprobably isn't.
II: The Turn
The magician takes the ordinary something and makes it dosomething extraordinary. Now you're looking for the secret...
III: The PrestigeBut you wouldn't clap yet. Because making something disappear
isn't enough; you have to bring it back.
-
8/3/2019 The Magical Art of Extracting Meaning From Data
10/51
Collaborative Filtering
I. Collect Preferences
II. Find Similar Users
or Items
III. Recommend
-
8/3/2019 The Magical Art of Extracting Meaning From Data
11/51
I. Collecting Preferences
yes/no votes
Ratings in stars
Purchase history
Who you follow/whos yourfriend.
The music you listen to or themovies you watch
Comments (Bad, Great, Lousy, ...)
-
8/3/2019 The Magical Art of Extracting Meaning From Data
12/51
II. Similarity
Euclidean Distance
Olsen Twins - notice the similarity!> 0.0 (positive correlation)
< 1.0 (not equal)
Same eyes, nose, ...Different hair color, dress, earings, ...
Pearson Correlation
(a-b)2
-
8/3/2019 The Magical Art of Extracting Meaning From Data
13/51
III. Recommend
-
8/3/2019 The Magical Art of Extracting Meaning From Data
14/51
-
8/3/2019 The Magical Art of Extracting Meaning From Data
15/51
Users Vs Items
Find similar items instead of similar users! Same recommendation process:
just switch users with items & vice versa (conceptually)
Why? Works for new users Might be more accurate (might not) It can be useful to have both
-
8/3/2019 The Magical Art of Extracting Meaning From Data
16/51
Cross-Validation How good are the recommendations?
Partitioning the data: Training set vs Test set Size of the sets? 95/5
Variance
Multiple rounds with different partitions How many rounds? 1? 2? 100?
Measure of goodness (or rather, the error): RootMean Square Error
-
8/3/2019 The Magical Art of Extracting Meaning From Data
17/51
Case Study: Francesinhas.com
Django project by 1 programmer
Users give ratings to restaurants
0 to 5 stars (0-100 internally)
Challenge: recommend usersrestaurants they will probably like
-
8/3/2019 The Magical Art of Extracting Meaning From Data
18/51
User Similarity
normalize
-
8/3/2019 The Magical Art of Extracting Meaning From Data
19/51
Restaurant Similarit
-
8/3/2019 The Magical Art of Extracting Meaning From Data
20/51
-
8/3/2019 The Magical Art of Extracting Meaning From Data
21/51
Allows you to show similar restaurants in a restaurants page
R d
-
8/3/2019 The Magical Art of Extracting Meaning From Data
22/51
Recommend(based on user similarity)
-
8/3/2019 The Magical Art of Extracting Meaning From Data
23/51
(based on restaurant similarity)
-
8/3/2019 The Magical Art of Extracting Meaning From Data
24/51
restaurant recommendationscan be based on user or restaurant similarity
(its restaurant)
-
8/3/2019 The Magical Art of Extracting Meaning From Data
25/51
Case Study: Twitter Follow
Recommend users to followUsers dont have ratings implied rating:
follow (binary)
Recommend users that thepeople the target userfollows also follow (but that thetarget user doesnt)
this was stuff I presented @codebits in 2008before twitter had follow recommendations
(code was rewritten)
-
8/3/2019 The Magical Art of Extracting Meaning From Data
26/51
Similarity
Scoring
-
8/3/2019 The Magical Art of Extracting Meaning From Data
27/51
A KNN in 1 minute
Calculate the nearest neighbors (similarity)
e.g. the other users with the highest number of equal ratingsto the customer
For the k nearest neighbors:
neighbor base predictor (e.g. avg rating for neighbor)
s += sim * (rating - nbp) d += sim
prediction = cbp + s/d (cbp = customer base predictor, e.g. average customer rating)
-
8/3/2019 The Magical Art of Extracting Meaning From Data
28/51
Classifying
Assign an item into a category An email as spam (document classification) A set of symptoms to a particular disease
A signature to an individual (biometric identification) An individual as credit worthy (credit scoring) An image as a particular letter (Optical Character Recognition)
Item
Category
Item
-
8/3/2019 The Magical Art of Extracting Meaning From Data
29/51
Common Algorithms
Supervised Neural Networks
Support Vector Machines
Genetic Algorithms Naive Bayes Classifier
Unsupervised: Usually done via Clustering (clustering hypothesis)
i.e. similar contents => similar classification
-
8/3/2019 The Magical Art of Extracting Meaning From Data
30/51
Naive Bayes Classifier
I. Train
II. Calculate Probabillities
III. Classify
-
8/3/2019 The Magical Art of Extracting Meaning From Data
31/51
Case Study: A Spam Filter
The item (document) is an email message
2 Categories: Spam and Ham What do we need?
fc: {'python': {'spam': 0, 'ham': 6}, 'the': {'spam': 3, 'ham': 3}}
cc: {'ham': 6, 'spam': 6}
-
8/3/2019 The Magical Art of Extracting Meaning From Data
32/51
Feature Extraction
Input data can be way too large Think every pixel of an image
It can also be mostly useless
A signature is the same regardless of color (B&Wwill suffice)
And incredibly redundant (lots of data, little info) The solution is too transform the input into asmaller representation - a features vector!
A feature is either present or not
-
8/3/2019 The Magical Art of Extracting Meaning From Data
33/51
Get Features
Word Vector: Features are words(basic for doc classfication)
An item (document) is an email message and can: contain a word (feature is present)
not contain a word (feature is absent)
[date', 'don', 'mortgage', 'taint', you, how, delay, ...]
Other ideas: use capitalization, stemming, tlf-idf
-
8/3/2019 The Magical Art of Extracting Meaning From Data
34/51
I. Training
For every training example (item, category):1.Extract the items features
2.For each feature: Increment the count for this (feature, category) pair
3.Increment the categorycount(+1 example)
fc: {'feature': {'category': count, ...}
cc: {'category': count, ...}
-
8/3/2019 The Magical Art of Extracting Meaning From Data
35/51
II. ProbabilitiesP(word | category) the probability that a word is in a particular category (classification)
P(w | c) =P(c w)
P(c)
Assumed Probability
a weight of 1 means the assumed probability is weighted the same as one word
-
8/3/2019 The Magical Art of Extracting Meaning From Data
36/51
P(Document | Category) probability that a given doc belongs in a particular category
= P(w1 | c) P(w2 | c) ... P(wn | c) for every word in the document
Yeah thats nice... but what we want isP(Category | Document)!
*note: Decimal vs float
-
8/3/2019 The Magical Art of Extracting Meaning From Data
37/51
III. Bayes Theorem
-
8/3/2019 The Magical Art of Extracting Meaning From Data
38/51
III. Bayes Theorem
P(c | d) = P(d | c) x P(c)P(d)
P(d | c) = P(w1 | c) P(w2 | c) ... P(wn | c)
can be ignored
-
8/3/2019 The Magical Art of Extracting Meaning From Data
39/51
-
8/3/2019 The Magical Art of Extracting Meaning From Data
40/51
Ifyoure thinking of filtering spam, go with akismet If you really want to do your own bayesian spam filter,
a good start is wikipedia
Training datasets are available online - for spam andpretty much everything else
http://en.wikipedia.org/wiki/Bayesian_spam_filter
http://akismet.com/
http://spamassassin.apache.org/publiccorpus/
http://en.wikipedia.org/wiki/Bayesian_spam_filterhttp://en.wikipedia.org/wiki/Bayesian_spam_filterhttp://en.wikipedia.org/wiki/Bayesian_spam_filterhttp://en.wikipedia.org/wiki/Bayesian_spam_filterhttp://en.wikipedia.org/wiki/Bayesian_spam_filterhttp://en.wikipedia.org/wiki/Bayesian_spam_filterhttp://en.wikipedia.org/wiki/Bayesian_spam_filterhttp://en.wikipedia.org/wiki/Bayesian_spam_filterhttp://en.wikipedia.org/wiki/Bayesian_spam_filterhttp://en.wikipedia.org/wiki/Bayesian_spam_filterhttp://en.wikipedia.org/wiki/Bayesian_spam_filterhttp://en.wikipedia.org/wiki/Bayesian_spam_filterhttp://localhost/Users/rei/Projects/codebits10/presentation/img/clustering/Cluster-2.svg -
8/3/2019 The Magical Art of Extracting Meaning From Data
41/51
Clustering
Find structure in datasets:
Groups of things, people, concepts Unsupervised (i.e. there is no training)
Common algorithms:
Hierarchical clustering K-means
Non Negative Matrix Approximation
A, B, C, D, F, G, I, J
A, C
B, D, GF
I, J
Non Negative Matrix
http://localhost/Users/rei/Projects/codebits10/presentation/img/clustering/Cluster-2.svghttp://localhost/Users/rei/Projects/codebits10/presentation/img/clustering/Cluster-2.svg -
8/3/2019 The Magical Art of Extracting Meaning From Data
42/51
Non Negative MatrixApproximation (or Factorization)
I. Get the data
in matrix form!
II. Factorize the matrix
III.Present the results
yeah the matrix is kind of magic
-
8/3/2019 The Magical Art of Extracting Meaning From Data
43/51
Case Study: News Clustering
I Th D
-
8/3/2019 The Magical Art of Extracting Meaning From Data
44/51
I. The Data
[[7, 8, 1, 10, ...]
[2, 0, 16, 1, ...][22, 3, 0, 0, ...][9, 12, 5, 4, ...]
...]]
Matrix
word vectorarticle vector
[sapo, codebits, haiti, iraq, ...]
[A, B, C, D, ...]
value
(word frequency/article)
Article D contains the word iraq 4 times
item(article)
property (word)
-
8/3/2019 The Magical Art of Extracting Meaning From Data
45/51
-
8/3/2019 The Magical Art of Extracting Meaning From Data
46/51
II. Factorize
[[7, 8][2, 0]]
[[1, 0][2, 3]]
x=[[23, 24][2, 0]]
data matrix = features matrix x weights matrix
word
feature article
feature
importance of the word to the feature
how much the feature applies to the article
-
8/3/2019 The Magical Art of Extracting Meaning From Data
47/51
http://public.procoders.net/nnma/py_nnma:
k- the number of features to find (i.e. number of clusters)
http://en.wikipedia.org/wiki/Bayesian_spam_filterhttp://en.wikipedia.org/wiki/Bayesian_spam_filterhttp://en.wikipedia.org/wiki/Bayesian_spam_filterhttp://en.wikipedia.org/wiki/Bayesian_spam_filter -
8/3/2019 The Magical Art of Extracting Meaning From Data
48/51
For every feature:
Display the top X words(from the featuresmatrix)
Display the top Y articlesfor this feature (from theweights matrix)
III. The Results
-
8/3/2019 The Magical Art of Extracting Meaning From Data
49/51
*note: this was created using an OPML file exported from my google
-
8/3/2019 The Magical Art of Extracting Meaning From Data
50/51
['adobe', 'flash', 'platform', 'acrobat', 'software', 'reader'](0.0014202284481846406, u"Apple, Adobe, and Openness: Let's Get Real")(0.00049914481067248734, u'Piggybacking on Adobe Acrobat and others')(0.00047202214371591086, u'CVE-2010-3654 - New dangerous 0-day authplay library adobe products')
['macbook', 'hard', 'only', 'much', 'drive', 'screen'](0.0017976618817123543, u'The new MacBook Air')(0.00067015549607138966, u'Revisiting Solid State Hard Drives')(0.00035732495413261966, u"The new MacBook Air's SSD performance")
['apps', 'mobile', 'business', 'other', 'good', 'application'](0.0013598162030796167, u'Which mobile apps are making good money?')
(0.00054549656743046277, u'An open enhancement request to the Mobile Safari team for sanebookmarklet installation or alternatives')(0.00040802131970223176, u'Google Apps highlights \u2013 10/29/2010')
['quot', 'strike', 'operations', 'forces', 'some', 'afghan'](0.002464522414843272, u'Kandahar diary: Watching conventional forces conduct a successful COIN')(0.00027058999725999285, u'How universities can help in our wars - By Tom Ricks')(0.00026940637538539202, u'This Weekend\u2019s News: Afghanistan\u2019s Long-Term Stability')
note: this was created using an OPML file exported from my googlereader (260 subscriptions)
F d f h B i
-
8/3/2019 The Magical Art of Extracting Meaning From Data
51/51
Food for the Brain
Machine LearningNeural Networks:
A Comprehensive Foundation
Programming Collective Intelligence:Building Smart Web 2.0 Applications
Toby Segaran
Data Mining: Practical MachineLearning Tools and Techniques