FSharp dojo: Ham or Spam

13
as Brandewinder, 2013. Use freely, attributions appreciated Ham or Spam?

description

An introduction to the Naive Bayes machine learning classifier, using F#.

Transcript of FSharp dojo: Ham or Spam

Page 1: FSharp dojo: Ham or Spam

© Mathias Brandewinder, 2013. Use freely, attributions appreciated

Ham or Spam?

Page 2: FSharp dojo: Ham or Spam

© Mathias Brandewinder, 2013. Use freely, attributions appreciated

The goal for tonight

»Take a classic Machine Learning problem

»Write some code and have fun»Write a classifier, from scratch, using F#»Learn some Machine Learning concepts

Page 3: FSharp dojo: Ham or Spam

© Mathias Brandewinder, 2013. Use freely, attributions appreciated

Imagine 20% of your email is Spam…

Page 4: FSharp dojo: Ham or Spam

© Mathias Brandewinder, 2013. Use freely, attributions appreciated

… your default guess should be Ham

Page 5: FSharp dojo: Ham or Spam

© Mathias Brandewinder, 2013. Use freely, attributions appreciated

What if I told you the Subject was…

Subject: Nigerian Diamonds!!!From: [email protected]

Dear friend,Based on the further explicit investment information about your country from my research i wish to invest in your country under your supervision.

Page 6: FSharp dojo: Ham or Spam

© Mathias Brandewinder, 2013. Use freely, attributions appreciated

!!!

Diamonds

Nigerian!!!

Diamonds

Nigerian

Page 7: FSharp dojo: Ham or Spam

© Mathias Brandewinder, 2013. Use freely, attributions appreciated

Ham

Spam

Nigeria

Nigeria

Ham

Spam

Ham

Spam

100%

100%

Page 8: FSharp dojo: Ham or Spam

© Mathias Brandewinder, 2013. Use freely, attributions appreciated

Bayes Theorem

Proba (email is Spam, if contains “Nigeria”) =

P (email contains “Nigeria”, if Spam) x P (email is Spam)

P (email contains “Nigeria”)

Page 9: FSharp dojo: Ham or Spam

© Mathias Brandewinder, 2013. Use freely, attributions appreciated

This can be used to classify text

P(Spam|“Nigeria”) = P(“Nigeria”|Spam) x P(Spam) / P(“Nigeria”)P(Ham|“Nigeria”) = P(“Nigeria”|Ham) x P(Ham) / P(“Nigeria”)

If P(Spam) > P(Ham), it’s “Crazy Tasty” Spam

Note: we can actually ignore P(“Nigeria”) to make a decision

Page 10: FSharp dojo: Ham or Spam

© Mathias Brandewinder, 2013. Use freely, attributions appreciated

Bayes Theorem weights 2 components

P(Spam|“Nigeria”) = P(“Nigeria”|Spam) x P(Spam)

How likely is it that I observe the word

“Nigeria” in a Spam email?

How likely is it that an email is Spam, “in general”

Page 11: FSharp dojo: Ham or Spam

© Mathias Brandewinder, 2013. Use freely, attributions appreciated

Naïve Bayes Classifier

»Break text into Tokens (“Nigeria”, “Diamond”, …)»Compute the probability that text is Ham or Spam,

given presence/absence of each Token»Combine probabilities into one number

P(Spam|Tokens)=P(T1|Spam)xP(T2|Spam)x … xP(Tn|Spam) x P(Spam) / P(Tokens)

Page 12: FSharp dojo: Ham or Spam

© Mathias Brandewinder, 2013. Use freely, attributions appreciated

Why “Naïve”?

»Considers that impact of Tokens are independent»Suppose “Nigerian Diamonds” always shows up

together»[Nigerian], [Diamonds] will be “double-counted”

Page 13: FSharp dojo: Ham or Spam

© Mathias Brandewinder, 2013. Use freely, attributions appreciated

Your mission

»Figure out if SMS is Spam or Ham given a Token»Use existing implementation to build a basic classifier»Use your brains to make a better classifier

»Project/guided script available at

»www.github.com/c4fsharp/dojo-ham-or-spam