The Magical Art of Extracting Meaning From Data

8/3/2019 The Magical Art of Extracting Meaning From Data

1/51

The Magical Art of Extracting

Meaning From DataLuis Rei

@[email protected]://luisrei.com

Data Mining For The Web
http://luisrei.com/http://luisrei.com/mailto:[email protected]:[email protected]


2/51

Outline

Introduction

Recommender Systems Classification

Clustering


3/51

The greatest problem of today is how to teach people to ignore theirrelevant, how to refuse to know things, before they are suffocated. Fortoo many facts are as bad as none at all.(W.H. Auden)

The key in business is to know something that nobody else knows.(Aristotle Onassis)


4/51

DATA

Luis Rei

25

codebits

4

MEANING

Luis Rei

25

NAME

PERSON

AGE

PHOTO

WEBSITE
http://luisrei.com/http://luisrei.com/http://luisrei.com/http://luisrei.com/http://luisrei.com/http://luisrei.com/


5/51

Tools

Python vs C or C++

feedparser, BeautifulSoup (scrap web pages)

NumPy, SciPy

Weka

R

Libraries

http://mloss.org/software/
http://en.wikipedia.org/wiki/Bayesian_spam_filterhttp://en.wikipedia.org/wiki/Bayesian_spam_filterhttp://en.wikipedia.org/wiki/Bayesian_spam_filterhttp://en.wikipedia.org/wiki/Bayesian_spam_filter


6/51

Down The Rabbit Hole In 2006, google search crawler used

850TB of data. Total web history isaround 3PB

Think of all the audio, photos & videos

Thats a lot of data

Open formats (HTML, RSS, PDF, ...)

Everyone + their dog has an API

facebook, twitter, flickr, last.fm,delicious, digg, gowalla, ...

Think about:

news articles published every day

status updates / day


7/51

Recommendations


8/51

The Netflix Prize

In October 2006 Netflix launched an open competition for the bestcollaborative filtering algorithm

at least 10% improvement over netflixs own algorithm

Predict user ratings for films based on previous ratings (by all users)

US$1,000,000 prize won in Sep 2009


9/51

The Three ActsI: The Pledge

The magician shows you something ordinary. But of course... itprobably isn't.

II: The Turn

The magician takes the ordinary something and makes it dosomething extraordinary. Now you're looking for the secret...

III: The PrestigeBut you wouldn't clap yet. Because making something disappear

isn't enough; you have to bring it back.


10/51

Collaborative Filtering

I. Collect Preferences

II. Find Similar Users

or Items

III. Recommend


11/51

I. Collecting Preferences

yes/no votes

Ratings in stars

Purchase history

Who you follow/whos yourfriend.

The music you listen to or themovies you watch

Comments (Bad, Great, Lousy, ...)


12/51

II. Similarity

Euclidean Distance

Olsen Twins - notice the similarity!> 0.0 (positive correlation)

< 1.0 (not equal)

Same eyes, nose, ...Different hair color, dress, earings, ...

Pearson Correlation

(a-b)2


13/51

III. Recommend


14/51


15/51

Users Vs Items

Find similar items instead of similar users! Same recommendation process:

just switch users with items & vice versa (conceptually)

Why? Works for new users Might be more accurate (might not) It can be useful to have both


16/51

Cross-Validation How good are the recommendations?

Partitioning the data: Training set vs Test set Size of the sets? 95/5

Variance

Multiple rounds with different partitions How many rounds? 1? 2? 100?

Measure of goodness (or rather, the error): RootMean Square Error


17/51

Case Study: Francesinhas.com

Django project by 1 programmer

Users give ratings to restaurants

0 to 5 stars (0-100 internally)

Challenge: recommend usersrestaurants they will probably like


18/51

User Similarity

normalize


19/51

Restaurant Similarit


20/51


21/51

Allows you to show similar restaurants in a restaurants page

R d


22/51

Recommend(based on user similarity)


23/51

(based on restaurant similarity)


24/51

restaurant recommendationscan be based on user or restaurant similarity

(its restaurant)


25/51

Case Study: Twitter Follow

Recommend users to followUsers dont have ratings implied rating:

follow (binary)

Recommend users that thepeople the target userfollows also follow (but that thetarget user doesnt)

this was stuff I presented @codebits in 2008before twitter had follow recommendations

(code was rewritten)


26/51

Similarity

Scoring


27/51

A KNN in 1 minute

Calculate the nearest neighbors (similarity)

e.g. the other users with the highest number of equal ratingsto the customer

For the k nearest neighbors:

neighbor base predictor (e.g. avg rating for neighbor)

s += sim * (rating - nbp) d += sim

prediction = cbp + s/d (cbp = customer base predictor, e.g. average customer rating)


28/51

Classifying

Assign an item into a category An email as spam (document classification) A set of symptoms to a particular disease

A signature to an individual (biometric identification) An individual as credit worthy (credit scoring) An image as a particular letter (Optical Character Recognition)

Item

Category

Item


29/51

Common Algorithms

Supervised Neural Networks

Support Vector Machines

Genetic Algorithms Naive Bayes Classifier

Unsupervised: Usually done via Clustering (clustering hypothesis)

i.e. similar contents => similar classification


30/51

Naive Bayes Classifier

I. Train

II. Calculate Probabillities

III. Classify


31/51

Case Study: A Spam Filter

The item (document) is an email message

2 Categories: Spam and Ham What do we need?

fc: {'python': {'spam': 0, 'ham': 6}, 'the': {'spam': 3, 'ham': 3}}

cc: {'ham': 6, 'spam': 6}


32/51

Feature Extraction

Input data can be way too large Think every pixel of an image

It can also be mostly useless

A signature is the same regardless of color (B&Wwill suffice)

And incredibly redundant (lots of data, little info) The solution is too transform the input into asmaller representation - a features vector!

A feature is either present or not


33/51

Get Features

Word Vector: Features are words(basic for doc classfication)

An item (document) is an email message and can: contain a word (feature is present)

not contain a word (feature is absent)

[date', 'don', 'mortgage', 'taint', you, how, delay, ...]

Other ideas: use capitalization, stemming, tlf-idf


34/51

I. Training

For every training example (item, category):1.Extract the items features

2.For each feature: Increment the count for this (feature, category) pair

3.Increment the categorycount(+1 example)

fc: {'feature': {'category': count, ...}

cc: {'category': count, ...}


35/51

II. ProbabilitiesP(word | category) the probability that a word is in a particular category (classification)

P(w | c) =P(c w)

P(c)

Assumed Probability

a weight of 1 means the assumed probability is weighted the same as one word


37/51

III. Bayes Theorem


38/51

III. Bayes Theorem

P(c | d) = P(d | c) x P(c)P(d)

P(d | c) = P(w1 | c) P(w2 | c) ... P(wn | c)

can be ignored


39/51


40/51

Ifyoure thinking of filtering spam, go with akismet If you really want to do your own bayesian spam filter,

a good start is wikipedia

Training datasets are available online - for spam andpretty much everything else

http://en.wikipedia.org/wiki/Bayesian_spam_filter

http://akismet.com/

http://spamassassin.apache.org/publiccorpus/
http://en.wikipedia.org/wiki/Bayesian_spam_filterhttp://en.wikipedia.org/wiki/Bayesian_spam_filterhttp://en.wikipedia.org/wiki/Bayesian_spam_filterhttp://en.wikipedia.org/wiki/Bayesian_spam_filterhttp://en.wikipedia.org/wiki/Bayesian_spam_filterhttp://en.wikipedia.org/wiki/Bayesian_spam_filterhttp://en.wikipedia.org/wiki/Bayesian_spam_filterhttp://en.wikipedia.org/wiki/Bayesian_spam_filterhttp://en.wikipedia.org/wiki/Bayesian_spam_filterhttp://en.wikipedia.org/wiki/Bayesian_spam_filterhttp://en.wikipedia.org/wiki/Bayesian_spam_filterhttp://en.wikipedia.org/wiki/Bayesian_spam_filterhttp://localhost/Users/rei/Projects/codebits10/presentation/img/clustering/Cluster-2.svg


41/51

Clustering

Find structure in datasets:

Groups of things, people, concepts Unsupervised (i.e. there is no training)

Common algorithms:

Hierarchical clustering K-means

Non Negative Matrix Approximation

A, B, C, D, F, G, I, J

A, C

B, D, GF

I, J

Non Negative Matrix
http://localhost/Users/rei/Projects/codebits10/presentation/img/clustering/Cluster-2.svghttp://localhost/Users/rei/Projects/codebits10/presentation/img/clustering/Cluster-2.svg


42/51

Non Negative MatrixApproximation (or Factorization)

I. Get the data

in matrix form!

II. Factorize the matrix

III.Present the results

yeah the matrix is kind of magic


43/51

Case Study: News Clustering

I Th D


44/51

I. The Data

[[7, 8, 1, 10, ...]

[2, 0, 16, 1, ...][22, 3, 0, 0, ...][9, 12, 5, 4, ...]

...]]

Matrix

word vectorarticle vector

[sapo, codebits, haiti, iraq, ...]

[A, B, C, D, ...]

value

(word frequency/article)

Article D contains the word iraq 4 times

item(article)

property (word)


45/51


46/51

II. Factorize

[[7, 8][2, 0]]

[[1, 0][2, 3]]

x=[[23, 24][2, 0]]

data matrix = features matrix x weights matrix

word

feature article

feature

importance of the word to the feature

how much the feature applies to the article


47/51

http://public.procoders.net/nnma/py_nnma:

k- the number of features to find (i.e. number of clusters)
http://en.wikipedia.org/wiki/Bayesian_spam_filterhttp://en.wikipedia.org/wiki/Bayesian_spam_filterhttp://en.wikipedia.org/wiki/Bayesian_spam_filterhttp://en.wikipedia.org/wiki/Bayesian_spam_filter


48/51

For every feature:

Display the top X words(from the featuresmatrix)

Display the top Y articlesfor this feature (from theweights matrix)

III. The Results


49/51

*note: this was created using an OPML file exported from my google


50/51

['adobe', 'flash', 'platform', 'acrobat', 'software', 'reader'](0.0014202284481846406, u"Apple, Adobe, and Openness: Let's Get Real")(0.00049914481067248734, u'Piggybacking on Adobe Acrobat and others')(0.00047202214371591086, u'CVE-2010-3654 - New dangerous 0-day authplay library adobe products')

['macbook', 'hard', 'only', 'much', 'drive', 'screen'](0.0017976618817123543, u'The new MacBook Air')(0.00067015549607138966, u'Revisiting Solid State Hard Drives')(0.00035732495413261966, u"The new MacBook Air's SSD performance")

['apps', 'mobile', 'business', 'other', 'good', 'application'](0.0013598162030796167, u'Which mobile apps are making good money?')

(0.00054549656743046277, u'An open enhancement request to the Mobile Safari team for sanebookmarklet installation or alternatives')(0.00040802131970223176, u'Google Apps highlights \u2013 10/29/2010')

['quot', 'strike', 'operations', 'forces', 'some', 'afghan'](0.002464522414843272, u'Kandahar diary: Watching conventional forces conduct a successful COIN')(0.00027058999725999285, u'How universities can help in our wars - By Tom Ricks')(0.00026940637538539202, u'This Weekend\u2019s News: Afghanistan\u2019s Long-Term Stability')

note: this was created using an OPML file exported from my googlereader (260 subscriptions)

F d f h B i


51/51

Food for the Brain

Machine LearningNeural Networks:

A Comprehensive Foundation

Programming Collective Intelligence:Building Smart Web 2.0 Applications

Toby Segaran

Data Mining: Practical MachineLearning Tools and Techniques

The Magical Art of Extracting Meaning From Data

Documents

Transcript of The Magical Art of Extracting Meaning From Data