Is that spam in my ham? - EuroPython 2016€¦ · Is that spam in my ham? A novice’s inquiry into...
Transcript of Is that spam in my ham? - EuroPython 2016€¦ · Is that spam in my ham? A novice’s inquiry into...
![Page 1: Is that spam in my ham? - EuroPython 2016€¦ · Is that spam in my ham? A novice’s inquiry into classification. Lorena Mesa | EuroPython 2016 @loooorenanicole bit.ly/europython2016-lmesa](https://reader035.fdocuments.in/reader035/viewer/2022070912/5fb3a16f20ed4538df27c610/html5/thumbnails/1.jpg)
Is that spam in my ham?
A novice’s inquiry into classification.
Lorena Mesa | EuroPython 2016 @loooorenanicole
bit.ly/europython2016-lmesa
![Page 2: Is that spam in my ham? - EuroPython 2016€¦ · Is that spam in my ham? A novice’s inquiry into classification. Lorena Mesa | EuroPython 2016 @loooorenanicole bit.ly/europython2016-lmesa](https://reader035.fdocuments.in/reader035/viewer/2022070912/5fb3a16f20ed4538df27c610/html5/thumbnails/2.jpg)
Hi, I’m Lorena Mesa.
![Page 3: Is that spam in my ham? - EuroPython 2016€¦ · Is that spam in my ham? A novice’s inquiry into classification. Lorena Mesa | EuroPython 2016 @loooorenanicole bit.ly/europython2016-lmesa](https://reader035.fdocuments.in/reader035/viewer/2022070912/5fb3a16f20ed4538df27c610/html5/thumbnails/3.jpg)
Have you seen this before? (You’re not alone.)Subject:
De-junk And Speed Up Your Slow PC!!!
From:
Theme:
Promises of “free” item(s).
Several images in the email itself.
![Page 4: Is that spam in my ham? - EuroPython 2016€¦ · Is that spam in my ham? A novice’s inquiry into classification. Lorena Mesa | EuroPython 2016 @loooorenanicole bit.ly/europython2016-lmesa](https://reader035.fdocuments.in/reader035/viewer/2022070912/5fb3a16f20ed4538df27c610/html5/thumbnails/4.jpg)
How I’ll approach today’s chat.
1. What is machine learning? 2. How is classification a part of this world?3. How can I use Python to solve a
classification problem like spam detection?
![Page 5: Is that spam in my ham? - EuroPython 2016€¦ · Is that spam in my ham? A novice’s inquiry into classification. Lorena Mesa | EuroPython 2016 @loooorenanicole bit.ly/europython2016-lmesa](https://reader035.fdocuments.in/reader035/viewer/2022070912/5fb3a16f20ed4538df27c610/html5/thumbnails/5.jpg)
![Page 6: Is that spam in my ham? - EuroPython 2016€¦ · Is that spam in my ham? A novice’s inquiry into classification. Lorena Mesa | EuroPython 2016 @loooorenanicole bit.ly/europython2016-lmesa](https://reader035.fdocuments.in/reader035/viewer/2022070912/5fb3a16f20ed4538df27c610/html5/thumbnails/6.jpg)
Machine Learningis a subfield of computer science [that] stud[ies] pattern recognition and computational learning [in] artificial intelligence. [It] explores the construction and study of algorithms that can learn from and make predictions on data.
![Page 7: Is that spam in my ham? - EuroPython 2016€¦ · Is that spam in my ham? A novice’s inquiry into classification. Lorena Mesa | EuroPython 2016 @loooorenanicole bit.ly/europython2016-lmesa](https://reader035.fdocuments.in/reader035/viewer/2022070912/5fb3a16f20ed4538df27c610/html5/thumbnails/7.jpg)
Put another wayA computer program is said to learn from experience (E) with respect to some task (T) and some performance measure (P), if its performance on T, as measured by P, improves with experience E.
(Ch. 1 - Machine Learning Tom Mitchell )
![Page 8: Is that spam in my ham? - EuroPython 2016€¦ · Is that spam in my ham? A novice’s inquiry into classification. Lorena Mesa | EuroPython 2016 @loooorenanicole bit.ly/europython2016-lmesa](https://reader035.fdocuments.in/reader035/viewer/2022070912/5fb3a16f20ed4538df27c610/html5/thumbnails/8.jpg)
Human ExperienceHuman Experience
![Page 9: Is that spam in my ham? - EuroPython 2016€¦ · Is that spam in my ham? A novice’s inquiry into classification. Lorena Mesa | EuroPython 2016 @loooorenanicole bit.ly/europython2016-lmesa](https://reader035.fdocuments.in/reader035/viewer/2022070912/5fb3a16f20ed4538df27c610/html5/thumbnails/9.jpg)
Recorded Experience
![Page 10: Is that spam in my ham? - EuroPython 2016€¦ · Is that spam in my ham? A novice’s inquiry into classification. Lorena Mesa | EuroPython 2016 @loooorenanicole bit.ly/europython2016-lmesa](https://reader035.fdocuments.in/reader035/viewer/2022070912/5fb3a16f20ed4538df27c610/html5/thumbnails/10.jpg)
Classification in machine learning
![Page 11: Is that spam in my ham? - EuroPython 2016€¦ · Is that spam in my ham? A novice’s inquiry into classification. Lorena Mesa | EuroPython 2016 @loooorenanicole bit.ly/europython2016-lmesa](https://reader035.fdocuments.in/reader035/viewer/2022070912/5fb3a16f20ed4538df27c610/html5/thumbnails/11.jpg)
Task: Classify a piece of data
Is an email Spam or Ham?
![Page 12: Is that spam in my ham? - EuroPython 2016€¦ · Is that spam in my ham? A novice’s inquiry into classification. Lorena Mesa | EuroPython 2016 @loooorenanicole bit.ly/europython2016-lmesa](https://reader035.fdocuments.in/reader035/viewer/2022070912/5fb3a16f20ed4538df27c610/html5/thumbnails/12.jpg)
Experience: Labeled training data
Email 1 | HamEmail 2 | Spam
![Page 13: Is that spam in my ham? - EuroPython 2016€¦ · Is that spam in my ham? A novice’s inquiry into classification. Lorena Mesa | EuroPython 2016 @loooorenanicole bit.ly/europython2016-lmesa](https://reader035.fdocuments.in/reader035/viewer/2022070912/5fb3a16f20ed4538df27c610/html5/thumbnails/13.jpg)
Performance Measurement: Is the label correct?
Verify if the email is Spam or Ham
![Page 14: Is that spam in my ham? - EuroPython 2016€¦ · Is that spam in my ham? A novice’s inquiry into classification. Lorena Mesa | EuroPython 2016 @loooorenanicole bit.ly/europython2016-lmesa](https://reader035.fdocuments.in/reader035/viewer/2022070912/5fb3a16f20ed4538df27c610/html5/thumbnails/14.jpg)
Naive Bayes is a type of probablilistic classifier.
![Page 15: Is that spam in my ham? - EuroPython 2016€¦ · Is that spam in my ham? A novice’s inquiry into classification. Lorena Mesa | EuroPython 2016 @loooorenanicole bit.ly/europython2016-lmesa](https://reader035.fdocuments.in/reader035/viewer/2022070912/5fb3a16f20ed4538df27c610/html5/thumbnails/15.jpg)
Naive Bayes in stats theoryThe math for Naive Bayes is based on Bayes theorem. It states that the likelihood of one event is independent of the likelihood of another event.
Naive Bayes classifiers make use of this “naive” assumption.
![Page 16: Is that spam in my ham? - EuroPython 2016€¦ · Is that spam in my ham? A novice’s inquiry into classification. Lorena Mesa | EuroPython 2016 @loooorenanicole bit.ly/europython2016-lmesa](https://reader035.fdocuments.in/reader035/viewer/2022070912/5fb3a16f20ed4538df27c610/html5/thumbnails/16.jpg)
Independent vs. Dependent Events
![Page 17: Is that spam in my ham? - EuroPython 2016€¦ · Is that spam in my ham? A novice’s inquiry into classification. Lorena Mesa | EuroPython 2016 @loooorenanicole bit.ly/europython2016-lmesa](https://reader035.fdocuments.in/reader035/viewer/2022070912/5fb3a16f20ed4538df27c610/html5/thumbnails/17.jpg)
Assumption: Independent Events
![Page 18: Is that spam in my ham? - EuroPython 2016€¦ · Is that spam in my ham? A novice’s inquiry into classification. Lorena Mesa | EuroPython 2016 @loooorenanicole bit.ly/europython2016-lmesa](https://reader035.fdocuments.in/reader035/viewer/2022070912/5fb3a16f20ed4538df27c610/html5/thumbnails/18.jpg)
Naive Bayes in Spam Classifiers Q: What is the probability of an email being Spam and Ham?
P(c|x) = P(x|c)P(c) / P(x)likelihood of predictor in the class e.g. 28 out of 50 spam emails have the word “free”
prior probability of class e.g. 50 of all 150 emails are spam
prior probability of predictor e.g. 72 of 150 emails have word free
![Page 19: Is that spam in my ham? - EuroPython 2016€¦ · Is that spam in my ham? A novice’s inquiry into classification. Lorena Mesa | EuroPython 2016 @loooorenanicole bit.ly/europython2016-lmesa](https://reader035.fdocuments.in/reader035/viewer/2022070912/5fb3a16f20ed4538df27c610/html5/thumbnails/19.jpg)
Picks category with MAPMAP: maximum a posterori probability
label = argmax P(x|c)P(c)
P(x) identical for all classes; don’t use it
Q: Is P(c|x) bigger for ham or spam?
A: Pick the MAP!
![Page 20: Is that spam in my ham? - EuroPython 2016€¦ · Is that spam in my ham? A novice’s inquiry into classification. Lorena Mesa | EuroPython 2016 @loooorenanicole bit.ly/europython2016-lmesa](https://reader035.fdocuments.in/reader035/viewer/2022070912/5fb3a16f20ed4538df27c610/html5/thumbnails/20.jpg)
Why Naive Bayes?There are other classifier algorithms you could explore but the math behind Naive Bayes is much simpler and suites what we need to do just fine.
![Page 21: Is that spam in my ham? - EuroPython 2016€¦ · Is that spam in my ham? A novice’s inquiry into classification. Lorena Mesa | EuroPython 2016 @loooorenanicole bit.ly/europython2016-lmesa](https://reader035.fdocuments.in/reader035/viewer/2022070912/5fb3a16f20ed4538df27c610/html5/thumbnails/21.jpg)
So how doI use Pythonto detect spam?
![Page 22: Is that spam in my ham? - EuroPython 2016€¦ · Is that spam in my ham? A novice’s inquiry into classification. Lorena Mesa | EuroPython 2016 @loooorenanicole bit.ly/europython2016-lmesa](https://reader035.fdocuments.in/reader035/viewer/2022070912/5fb3a16f20ed4538df27c610/html5/thumbnails/22.jpg)
Task: Spam DetectionTraining data contains 2500 mails both in Ham(1721) labelled as 1 and Spam(779) labelled as 0.
![Page 23: Is that spam in my ham? - EuroPython 2016€¦ · Is that spam in my ham? A novice’s inquiry into classification. Lorena Mesa | EuroPython 2016 @loooorenanicole bit.ly/europython2016-lmesa](https://reader035.fdocuments.in/reader035/viewer/2022070912/5fb3a16f20ed4538df27c610/html5/thumbnails/23.jpg)
Tools: What we’ll use.
email email package to parse emails into Message objects
lxml to transform email messages into plain text
nltk filter out “stop” words
![Page 24: Is that spam in my ham? - EuroPython 2016€¦ · Is that spam in my ham? A novice’s inquiry into classification. Lorena Mesa | EuroPython 2016 @loooorenanicole bit.ly/europython2016-lmesa](https://reader035.fdocuments.in/reader035/viewer/2022070912/5fb3a16f20ed4538df27c610/html5/thumbnails/24.jpg)
Task: Training the spam filter
![Page 25: Is that spam in my ham? - EuroPython 2016€¦ · Is that spam in my ham? A novice’s inquiry into classification. Lorena Mesa | EuroPython 2016 @loooorenanicole bit.ly/europython2016-lmesa](https://reader035.fdocuments.in/reader035/viewer/2022070912/5fb3a16f20ed4538df27c610/html5/thumbnails/25.jpg)
Training the Python Naive Bayes classifier
Stemming words - treat words like “shop” and “shopping” alike.
![Page 26: Is that spam in my ham? - EuroPython 2016€¦ · Is that spam in my ham? A novice’s inquiry into classification. Lorena Mesa | EuroPython 2016 @loooorenanicole bit.ly/europython2016-lmesa](https://reader035.fdocuments.in/reader035/viewer/2022070912/5fb3a16f20ed4538df27c610/html5/thumbnails/26.jpg)
Tokenize text into a bag of words
![Page 27: Is that spam in my ham? - EuroPython 2016€¦ · Is that spam in my ham? A novice’s inquiry into classification. Lorena Mesa | EuroPython 2016 @loooorenanicole bit.ly/europython2016-lmesa](https://reader035.fdocuments.in/reader035/viewer/2022070912/5fb3a16f20ed4538df27c610/html5/thumbnails/27.jpg)
Zero-Word FrequencyWhat happens if have a new word in an email that was not yet seen by training data?
P(free|spam) * P(your|spam) * …. * P(junk|spam)
0/150 * 50/150 * …. * 25 / 150
Laplace smoothing allows you to add a small positive (e.g. 1) to all counts to prevent this.
![Page 28: Is that spam in my ham? - EuroPython 2016€¦ · Is that spam in my ham? A novice’s inquiry into classification. Lorena Mesa | EuroPython 2016 @loooorenanicole bit.ly/europython2016-lmesa](https://reader035.fdocuments.in/reader035/viewer/2022070912/5fb3a16f20ed4538df27c610/html5/thumbnails/28.jpg)
Task: Classifying emails
![Page 29: Is that spam in my ham? - EuroPython 2016€¦ · Is that spam in my ham? A novice’s inquiry into classification. Lorena Mesa | EuroPython 2016 @loooorenanicole bit.ly/europython2016-lmesa](https://reader035.fdocuments.in/reader035/viewer/2022070912/5fb3a16f20ed4538df27c610/html5/thumbnails/29.jpg)
Floating PointUnderflow
Smoothing
![Page 30: Is that spam in my ham? - EuroPython 2016€¦ · Is that spam in my ham? A novice’s inquiry into classification. Lorena Mesa | EuroPython 2016 @loooorenanicole bit.ly/europython2016-lmesa](https://reader035.fdocuments.in/reader035/viewer/2022070912/5fb3a16f20ed4538df27c610/html5/thumbnails/30.jpg)
Performance Measurement: 90/10 Split
![Page 31: Is that spam in my ham? - EuroPython 2016€¦ · Is that spam in my ham? A novice’s inquiry into classification. Lorena Mesa | EuroPython 2016 @loooorenanicole bit.ly/europython2016-lmesa](https://reader035.fdocuments.in/reader035/viewer/2022070912/5fb3a16f20ed4538df27c610/html5/thumbnails/31.jpg)
Classify the unseen examples.
![Page 32: Is that spam in my ham? - EuroPython 2016€¦ · Is that spam in my ham? A novice’s inquiry into classification. Lorena Mesa | EuroPython 2016 @loooorenanicole bit.ly/europython2016-lmesa](https://reader035.fdocuments.in/reader035/viewer/2022070912/5fb3a16f20ed4538df27c610/html5/thumbnails/32.jpg)
Measure performance on 10% of data
Train on 90% of training data
![Page 33: Is that spam in my ham? - EuroPython 2016€¦ · Is that spam in my ham? A novice’s inquiry into classification. Lorena Mesa | EuroPython 2016 @loooorenanicole bit.ly/europython2016-lmesa](https://reader035.fdocuments.in/reader035/viewer/2022070912/5fb3a16f20ed4538df27c610/html5/thumbnails/33.jpg)
False PositivesI signed up to receive promotional deals from Patagonia.
“Typically used in spam”implementation may be flawed?(e.g. too naive?).
Google spam → report as spam (or not!)
![Page 34: Is that spam in my ham? - EuroPython 2016€¦ · Is that spam in my ham? A novice’s inquiry into classification. Lorena Mesa | EuroPython 2016 @loooorenanicole bit.ly/europython2016-lmesa](https://reader035.fdocuments.in/reader035/viewer/2022070912/5fb3a16f20ed4538df27c610/html5/thumbnails/34.jpg)
Naive Bayes limitations & challenges- Independence assumption is a simplistic
model of the world- Overestimates the probability of the label
ultimately selected- Inconsistent labeling of data (e.g. same email
has both spam label and ham label)
![Page 35: Is that spam in my ham? - EuroPython 2016€¦ · Is that spam in my ham? A novice’s inquiry into classification. Lorena Mesa | EuroPython 2016 @loooorenanicole bit.ly/europython2016-lmesa](https://reader035.fdocuments.in/reader035/viewer/2022070912/5fb3a16f20ed4538df27c610/html5/thumbnails/35.jpg)
Improve PerformanceMore & better feature extraction
Other possible features:
- Subject- Images- Sender
MORE DATA!
![Page 36: Is that spam in my ham? - EuroPython 2016€¦ · Is that spam in my ham? A novice’s inquiry into classification. Lorena Mesa | EuroPython 2016 @loooorenanicole bit.ly/europython2016-lmesa](https://reader035.fdocuments.in/reader035/viewer/2022070912/5fb3a16f20ed4538df27c610/html5/thumbnails/36.jpg)
Want to learn more?Kaggle for toy machine learning problems!
Introduction to Machine Learning With Python by Sarah Guido
Your local Python user group!
![Page 37: Is that spam in my ham? - EuroPython 2016€¦ · Is that spam in my ham? A novice’s inquiry into classification. Lorena Mesa | EuroPython 2016 @loooorenanicole bit.ly/europython2016-lmesa](https://reader035.fdocuments.in/reader035/viewer/2022070912/5fb3a16f20ed4538df27c610/html5/thumbnails/37.jpg)
Thank you! bit.ly/europython2016-lmesa | @loooorenanicole