The Magical Art of Extracting Meaning From Data

download The Magical Art of Extracting Meaning From Data

of 51

Transcript of The Magical Art of Extracting Meaning From Data

  • 8/3/2019 The Magical Art of Extracting Meaning From Data

    1/51

    The Magical Art of Extracting

    Meaning From DataLuis Rei

    @[email protected]://luisrei.com

    Data Mining For The Web

    http://luisrei.com/http://luisrei.com/mailto:[email protected]:[email protected]
  • 8/3/2019 The Magical Art of Extracting Meaning From Data

    2/51

    Outline

    Introduction

    Recommender Systems Classification

    Clustering

  • 8/3/2019 The Magical Art of Extracting Meaning From Data

    3/51

    The greatest problem of today is how to teach people to ignore theirrelevant, how to refuse to know things, before they are suffocated. Fortoo many facts are as bad as none at all.(W.H. Auden)

    The key in business is to know something that nobody else knows.(Aristotle Onassis)

  • 8/3/2019 The Magical Art of Extracting Meaning From Data

    4/51

    DATA

    Luis Rei

    25

    codebits

    4

    MEANING

    Luis Rei

    25

    NAME

    PERSON

    AGE

    PHOTO

    WEBSITE

    http://luisrei.com/http://luisrei.com/http://luisrei.com/http://luisrei.com/http://luisrei.com/http://luisrei.com/
  • 8/3/2019 The Magical Art of Extracting Meaning From Data

    5/51

    Tools

    Python vs C or C++

    feedparser, BeautifulSoup (scrap web pages)

    NumPy, SciPy

    Weka

    R

    Libraries

    http://mloss.org/software/

    http://en.wikipedia.org/wiki/Bayesian_spam_filterhttp://en.wikipedia.org/wiki/Bayesian_spam_filterhttp://en.wikipedia.org/wiki/Bayesian_spam_filterhttp://en.wikipedia.org/wiki/Bayesian_spam_filter
  • 8/3/2019 The Magical Art of Extracting Meaning From Data

    6/51

    Down The Rabbit Hole In 2006, google search crawler used

    850TB of data. Total web history isaround 3PB

    Think of all the audio, photos & videos

    Thats a lot of data

    Open formats (HTML, RSS, PDF, ...)

    Everyone + their dog has an API

    facebook, twitter, flickr, last.fm,delicious, digg, gowalla, ...

    Think about:

    news articles published every day

    status updates / day

  • 8/3/2019 The Magical Art of Extracting Meaning From Data

    7/51

    Recommendations

  • 8/3/2019 The Magical Art of Extracting Meaning From Data

    8/51

    The Netflix Prize

    In October 2006 Netflix launched an open competition for the bestcollaborative filtering algorithm

    at least 10% improvement over netflixs own algorithm

    Predict user ratings for films based on previous ratings (by all users)

    US$1,000,000 prize won in Sep 2009

  • 8/3/2019 The Magical Art of Extracting Meaning From Data

    9/51

    The Three ActsI: The Pledge

    The magician shows you something ordinary. But of course... itprobably isn't.

    II: The Turn

    The magician takes the ordinary something and makes it dosomething extraordinary. Now you're looking for the secret...

    III: The PrestigeBut you wouldn't clap yet. Because making something disappear

    isn't enough; you have to bring it back.

  • 8/3/2019 The Magical Art of Extracting Meaning From Data

    10/51

    Collaborative Filtering

    I. Collect Preferences

    II. Find Similar Users

    or Items

    III. Recommend

  • 8/3/2019 The Magical Art of Extracting Meaning From Data

    11/51

    I. Collecting Preferences

    yes/no votes

    Ratings in stars

    Purchase history

    Who you follow/whos yourfriend.

    The music you listen to or themovies you watch

    Comments (Bad, Great, Lousy, ...)

  • 8/3/2019 The Magical Art of Extracting Meaning From Data

    12/51

    II. Similarity

    Euclidean Distance

    Olsen Twins - notice the similarity!> 0.0 (positive correlation)

    < 1.0 (not equal)

    Same eyes, nose, ...Different hair color, dress, earings, ...

    Pearson Correlation

    (a-b)2

  • 8/3/2019 The Magical Art of Extracting Meaning From Data

    13/51

    III. Recommend

  • 8/3/2019 The Magical Art of Extracting Meaning From Data

    14/51

  • 8/3/2019 The Magical Art of Extracting Meaning From Data

    15/51

    Users Vs Items

    Find similar items instead of similar users! Same recommendation process:

    just switch users with items & vice versa (conceptually)

    Why? Works for new users Might be more accurate (might not) It can be useful to have both

  • 8/3/2019 The Magical Art of Extracting Meaning From Data

    16/51

    Cross-Validation How good are the recommendations?

    Partitioning the data: Training set vs Test set Size of the sets? 95/5

    Variance

    Multiple rounds with different partitions How many rounds? 1? 2? 100?

    Measure of goodness (or rather, the error): RootMean Square Error

  • 8/3/2019 The Magical Art of Extracting Meaning From Data

    17/51

    Case Study: Francesinhas.com

    Django project by 1 programmer

    Users give ratings to restaurants

    0 to 5 stars (0-100 internally)

    Challenge: recommend usersrestaurants they will probably like

  • 8/3/2019 The Magical Art of Extracting Meaning From Data

    18/51

    User Similarity

    normalize

  • 8/3/2019 The Magical Art of Extracting Meaning From Data

    19/51

    Restaurant Similarit

  • 8/3/2019 The Magical Art of Extracting Meaning From Data

    20/51

  • 8/3/2019 The Magical Art of Extracting Meaning From Data

    21/51

    Allows you to show similar restaurants in a restaurants page

    R d

  • 8/3/2019 The Magical Art of Extracting Meaning From Data

    22/51

    Recommend(based on user similarity)

  • 8/3/2019 The Magical Art of Extracting Meaning From Data

    23/51

    (based on restaurant similarity)

  • 8/3/2019 The Magical Art of Extracting Meaning From Data

    24/51

    restaurant recommendationscan be based on user or restaurant similarity

    (its restaurant)

  • 8/3/2019 The Magical Art of Extracting Meaning From Data

    25/51

    Case Study: Twitter Follow

    Recommend users to followUsers dont have ratings implied rating:

    follow (binary)

    Recommend users that thepeople the target userfollows also follow (but that thetarget user doesnt)

    this was stuff I presented @codebits in 2008before twitter had follow recommendations

    (code was rewritten)

  • 8/3/2019 The Magical Art of Extracting Meaning From Data

    26/51

    Similarity

    Scoring

  • 8/3/2019 The Magical Art of Extracting Meaning From Data

    27/51

    A KNN in 1 minute

    Calculate the nearest neighbors (similarity)

    e.g. the other users with the highest number of equal ratingsto the customer

    For the k nearest neighbors:

    neighbor base predictor (e.g. avg rating for neighbor)

    s += sim * (rating - nbp) d += sim

    prediction = cbp + s/d (cbp = customer base predictor, e.g. average customer rating)

  • 8/3/2019 The Magical Art of Extracting Meaning From Data

    28/51

    Classifying

    Assign an item into a category An email as spam (document classification) A set of symptoms to a particular disease

    A signature to an individual (biometric identification) An individual as credit worthy (credit scoring) An image as a particular letter (Optical Character Recognition)

    Item

    Category

    Item

  • 8/3/2019 The Magical Art of Extracting Meaning From Data

    29/51

    Common Algorithms

    Supervised Neural Networks

    Support Vector Machines

    Genetic Algorithms Naive Bayes Classifier

    Unsupervised: Usually done via Clustering (clustering hypothesis)

    i.e. similar contents => similar classification

  • 8/3/2019 The Magical Art of Extracting Meaning From Data

    30/51

    Naive Bayes Classifier

    I. Train

    II. Calculate Probabillities

    III. Classify

  • 8/3/2019 The Magical Art of Extracting Meaning From Data

    31/51

    Case Study: A Spam Filter

    The item (document) is an email message

    2 Categories: Spam and Ham What do we need?

    fc: {'python': {'spam': 0, 'ham': 6}, 'the': {'spam': 3, 'ham': 3}}

    cc: {'ham': 6, 'spam': 6}

  • 8/3/2019 The Magical Art of Extracting Meaning From Data

    32/51

    Feature Extraction

    Input data can be way too large Think every pixel of an image

    It can also be mostly useless

    A signature is the same regardless of color (B&Wwill suffice)

    And incredibly redundant (lots of data, little info) The solution is too transform the input into asmaller representation - a features vector!

    A feature is either present or not

  • 8/3/2019 The Magical Art of Extracting Meaning From Data

    33/51

    Get Features

    Word Vector: Features are words(basic for doc classfication)

    An item (document) is an email message and can: contain a word (feature is present)

    not contain a word (feature is absent)

    [date', 'don', 'mortgage', 'taint', you, how, delay, ...]

    Other ideas: use capitalization, stemming, tlf-idf

  • 8/3/2019 The Magical Art of Extracting Meaning From Data

    34/51

    I. Training

    For every training example (item, category):1.Extract the items features

    2.For each feature: Increment the count for this (feature, category) pair

    3.Increment the categorycount(+1 example)

    fc: {'feature': {'category': count, ...}

    cc: {'category': count, ...}

  • 8/3/2019 The Magical Art of Extracting Meaning From Data

    35/51

    II. ProbabilitiesP(word | category) the probability that a word is in a particular category (classification)

    P(w | c) =P(c w)

    P(c)

    Assumed Probability

    a weight of 1 means the assumed probability is weighted the same as one word

  • 8/3/2019 The Magical Art of Extracting Meaning From Data

    36/51

    P(Document | Category) probability that a given doc belongs in a particular category

    = P(w1 | c) P(w2 | c) ... P(wn | c) for every word in the document

    Yeah thats nice... but what we want isP(Category | Document)!

    *note: Decimal vs float

  • 8/3/2019 The Magical Art of Extracting Meaning From Data

    37/51

    III. Bayes Theorem

  • 8/3/2019 The Magical Art of Extracting Meaning From Data

    38/51

    III. Bayes Theorem

    P(c | d) = P(d | c) x P(c)P(d)

    P(d | c) = P(w1 | c) P(w2 | c) ... P(wn | c)

    can be ignored

  • 8/3/2019 The Magical Art of Extracting Meaning From Data

    39/51

  • 8/3/2019 The Magical Art of Extracting Meaning From Data

    40/51

    Ifyoure thinking of filtering spam, go with akismet If you really want to do your own bayesian spam filter,

    a good start is wikipedia

    Training datasets are available online - for spam andpretty much everything else

    http://en.wikipedia.org/wiki/Bayesian_spam_filter

    http://akismet.com/

    http://spamassassin.apache.org/publiccorpus/

    http://en.wikipedia.org/wiki/Bayesian_spam_filterhttp://en.wikipedia.org/wiki/Bayesian_spam_filterhttp://en.wikipedia.org/wiki/Bayesian_spam_filterhttp://en.wikipedia.org/wiki/Bayesian_spam_filterhttp://en.wikipedia.org/wiki/Bayesian_spam_filterhttp://en.wikipedia.org/wiki/Bayesian_spam_filterhttp://en.wikipedia.org/wiki/Bayesian_spam_filterhttp://en.wikipedia.org/wiki/Bayesian_spam_filterhttp://en.wikipedia.org/wiki/Bayesian_spam_filterhttp://en.wikipedia.org/wiki/Bayesian_spam_filterhttp://en.wikipedia.org/wiki/Bayesian_spam_filterhttp://en.wikipedia.org/wiki/Bayesian_spam_filterhttp://localhost/Users/rei/Projects/codebits10/presentation/img/clustering/Cluster-2.svg
  • 8/3/2019 The Magical Art of Extracting Meaning From Data

    41/51

    Clustering

    Find structure in datasets:

    Groups of things, people, concepts Unsupervised (i.e. there is no training)

    Common algorithms:

    Hierarchical clustering K-means

    Non Negative Matrix Approximation

    A, B, C, D, F, G, I, J

    A, C

    B, D, GF

    I, J

    Non Negative Matrix

    http://localhost/Users/rei/Projects/codebits10/presentation/img/clustering/Cluster-2.svghttp://localhost/Users/rei/Projects/codebits10/presentation/img/clustering/Cluster-2.svg
  • 8/3/2019 The Magical Art of Extracting Meaning From Data

    42/51

    Non Negative MatrixApproximation (or Factorization)

    I. Get the data

    in matrix form!

    II. Factorize the matrix

    III.Present the results

    yeah the matrix is kind of magic

  • 8/3/2019 The Magical Art of Extracting Meaning From Data

    43/51

    Case Study: News Clustering

    I Th D

  • 8/3/2019 The Magical Art of Extracting Meaning From Data

    44/51

    I. The Data

    [[7, 8, 1, 10, ...]

    [2, 0, 16, 1, ...][22, 3, 0, 0, ...][9, 12, 5, 4, ...]

    ...]]

    Matrix

    word vectorarticle vector

    [sapo, codebits, haiti, iraq, ...]

    [A, B, C, D, ...]

    value

    (word frequency/article)

    Article D contains the word iraq 4 times

    item(article)

    property (word)

  • 8/3/2019 The Magical Art of Extracting Meaning From Data

    45/51

  • 8/3/2019 The Magical Art of Extracting Meaning From Data

    46/51

    II. Factorize

    [[7, 8][2, 0]]

    [[1, 0][2, 3]]

    x=[[23, 24][2, 0]]

    data matrix = features matrix x weights matrix

    word

    feature article

    feature

    importance of the word to the feature

    how much the feature applies to the article

  • 8/3/2019 The Magical Art of Extracting Meaning From Data

    47/51

    http://public.procoders.net/nnma/py_nnma:

    k- the number of features to find (i.e. number of clusters)

    http://en.wikipedia.org/wiki/Bayesian_spam_filterhttp://en.wikipedia.org/wiki/Bayesian_spam_filterhttp://en.wikipedia.org/wiki/Bayesian_spam_filterhttp://en.wikipedia.org/wiki/Bayesian_spam_filter
  • 8/3/2019 The Magical Art of Extracting Meaning From Data

    48/51

    For every feature:

    Display the top X words(from the featuresmatrix)

    Display the top Y articlesfor this feature (from theweights matrix)

    III. The Results

  • 8/3/2019 The Magical Art of Extracting Meaning From Data

    49/51

    *note: this was created using an OPML file exported from my google

  • 8/3/2019 The Magical Art of Extracting Meaning From Data

    50/51

    ['adobe', 'flash', 'platform', 'acrobat', 'software', 'reader'](0.0014202284481846406, u"Apple, Adobe, and Openness: Let's Get Real")(0.00049914481067248734, u'Piggybacking on Adobe Acrobat and others')(0.00047202214371591086, u'CVE-2010-3654 - New dangerous 0-day authplay library adobe products')

    ['macbook', 'hard', 'only', 'much', 'drive', 'screen'](0.0017976618817123543, u'The new MacBook Air')(0.00067015549607138966, u'Revisiting Solid State Hard Drives')(0.00035732495413261966, u"The new MacBook Air's SSD performance")

    ['apps', 'mobile', 'business', 'other', 'good', 'application'](0.0013598162030796167, u'Which mobile apps are making good money?')

    (0.00054549656743046277, u'An open enhancement request to the Mobile Safari team for sanebookmarklet installation or alternatives')(0.00040802131970223176, u'Google Apps highlights \u2013 10/29/2010')

    ['quot', 'strike', 'operations', 'forces', 'some', 'afghan'](0.002464522414843272, u'Kandahar diary: Watching conventional forces conduct a successful COIN')(0.00027058999725999285, u'How universities can help in our wars - By Tom Ricks')(0.00026940637538539202, u'This Weekend\u2019s News: Afghanistan\u2019s Long-Term Stability')

    note: this was created using an OPML file exported from my googlereader (260 subscriptions)

    F d f h B i

  • 8/3/2019 The Magical Art of Extracting Meaning From Data

    51/51

    Food for the Brain

    Machine LearningNeural Networks:

    A Comprehensive Foundation

    Programming Collective Intelligence:Building Smart Web 2.0 Applications

    Toby Segaran

    Data Mining: Practical MachineLearning Tools and Techniques