How does computer know what is spam and what is ham?
-
date post
22-Dec-2015 -
Category
Documents
-
view
224 -
download
3
Transcript of How does computer know what is spam and what is ham?
![Page 1: How does computer know what is spam and what is ham?](https://reader035.fdocuments.in/reader035/viewer/2022062421/56649d765503460f94a58073/html5/thumbnails/1.jpg)
How does computer know what is spam and what is ham?
![Page 2: How does computer know what is spam and what is ham?](https://reader035.fdocuments.in/reader035/viewer/2022062421/56649d765503460f94a58073/html5/thumbnails/2.jpg)
Attempt 1:
(define (spam? email) (cond ( (email from known sender) False) ( (email contains “viagra”) True) ( (email begins with “Dear Mr/Mrs.”) True) ( (email contains URL) True) ( (email contains attachment) True) ( ...
![Page 3: How does computer know what is spam and what is ham?](https://reader035.fdocuments.in/reader035/viewer/2022062421/56649d765503460f94a58073/html5/thumbnails/3.jpg)
Problem: (email contain URL) is an indication, NOT a PROOF
Attempt 1:
(define (spam? email) (cond ( (email from known sender) False) ( (email contains “viagra”) True) ( (email begins with “Dear Mr/Mrs.”) True) ( (email contains URL) True) ( (email contains attachment) True) ( ...
![Page 4: How does computer know what is spam and what is ham?](https://reader035.fdocuments.in/reader035/viewer/2022062421/56649d765503460f94a58073/html5/thumbnails/4.jpg)
Features: Score: email from known sender -50
email contains "viagra" 75
email begins with "Dear Mr/Mrs." 70
email contains URL 10
email contains attachment 5... ... ...
If Total Sum > 100, classify as spam.
![Page 5: How does computer know what is spam and what is ham?](https://reader035.fdocuments.in/reader035/viewer/2022062421/56649d765503460f94a58073/html5/thumbnails/5.jpg)
Features: Score: email from known sender -50
email contains "viagra" 75
email begins with "Dear Mr/Mrs." 70
email contains URL 10
email contains attachment 5... ... ...
If Total Sum > 100, classify as spam.
Problems:
- How to determine the score?
- How to combine the score?
![Page 6: How does computer know what is spam and what is ham?](https://reader035.fdocuments.in/reader035/viewer/2022062421/56649d765503460f94a58073/html5/thumbnails/6.jpg)
Key Idea:
Learn which features are important through examples
Training Set: lots of emails with correct labels (both spam and ham)
![Page 7: How does computer know what is spam and what is ham?](https://reader035.fdocuments.in/reader035/viewer/2022062421/56649d765503460f94a58073/html5/thumbnails/7.jpg)
The Naive Bayes Algorithm:
Step 1. Gather Statistics inside Training Set:
![Page 8: How does computer know what is spam and what is ham?](https://reader035.fdocuments.in/reader035/viewer/2022062421/56649d765503460f94a58073/html5/thumbnails/8.jpg)
The Naive Bayes Algorithm:
Step 1. Gather Statistics inside Training Set:- Count percentage of spams in Training Set: P(spam)- Count percentage of hams in Training Set: P(ham)
- For every feature F_1, F_2, F_3 ... := Count percentage of spams with feature F_i : P(F_i | spam)= Count percentage of hams with feature F_i : P(F_i | ham)
![Page 9: How does computer know what is spam and what is ham?](https://reader035.fdocuments.in/reader035/viewer/2022062421/56649d765503460f94a58073/html5/thumbnails/9.jpg)
The Naive Bayes Algorithm:
Say, F_1 = email contains “viagra”F_2 = email begins with “Dear Mr/Mrs.”
![Page 10: How does computer know what is spam and what is ham?](https://reader035.fdocuments.in/reader035/viewer/2022062421/56649d765503460f94a58073/html5/thumbnails/10.jpg)
The Naive Bayes Algorithm:
Say, F_1 = email contains “viagra”F_2 = email begins with “Dear Mr/Mrs.”
From Training Set, we discovered:
P(spam) = 0.85 P(ham) = 0.15
P(F_1 | spam) = 0.2 P(NOT F_1 | spam) = 0.8P(F_1 | ham) = 0.001 P(NOT F_1 | ham) 0.999
P(F_2 | spam) = 0.99 P(NOT F_2 | spam) = 0.01P(F_2 | ham) = 0.0001 P(NOT F_2 | ham) = 0.9999
![Page 11: How does computer know what is spam and what is ham?](https://reader035.fdocuments.in/reader035/viewer/2022062421/56649d765503460f94a58073/html5/thumbnails/11.jpg)
![Page 12: How does computer know what is spam and what is ham?](https://reader035.fdocuments.in/reader035/viewer/2022062421/56649d765503460f94a58073/html5/thumbnails/12.jpg)
The Naive Bayes Algorithm:
Step 1. Gather Statistics inside Training Set:- Count percentage of spams in Training Set: P(spam)- Count percentage of hams in Training Set: P(ham)
- For every feature F_1, F_2, F_3 ... := Count percentage of spams with feature F_i : P(F_i | spam)= Count percentage of hams with feature F_i : P(F_i | ham)
Step 2. On a new Instance:
- Find what features the new instance has- Use Bayes Rule to compute probability- Take the most probable label
![Page 13: How does computer know what is spam and what is ham?](https://reader035.fdocuments.in/reader035/viewer/2022062421/56649d765503460f94a58073/html5/thumbnails/13.jpg)
Example:Optical Character Recognition
GOAL: recognize scanned hand-written numbers..................................++++++......................##############++............+++++##########+..................+.+++++##+........................+##........................+##+........................+##+.......................+##+........................+#+.........................##+........................+#+........................+##+........................##+........................###+.......................+##+.......................+##+........................+##+.......................+###+.......................+###+.......................+##...........................................
............................
............................
............+#..............
..........+###..............
.........+####+.............
.........+######+...........
........+###+####+..........
........+##..+####..........
........+#+...+##+..........
........+#+...###+..........
........+##+++####+.........
.........#####++##+.........
.........+###+..+##+........
..........+++....+#+........
.................+##........
..................+#+.......
..................+##+......
...................+#+......
...................+##+.....
....................+#+.....
.....................+#+....
......................#+....
............................
![Page 14: How does computer know what is spam and what is ham?](https://reader035.fdocuments.in/reader035/viewer/2022062421/56649d765503460f94a58073/html5/thumbnails/14.jpg)
Instance – scanned image of hand-written numberLabels – 1,2,3,4,5,6,7,8,9
Features – (for project)every 2x2 pixel squares
![Page 15: How does computer know what is spam and what is ham?](https://reader035.fdocuments.in/reader035/viewer/2022062421/56649d765503460f94a58073/html5/thumbnails/15.jpg)
Instance – scanned image of hand-written numberLabels – 1,2,3,4,5,6,7,8,9
Features – (for project)every 2x2 pixel squares
............................
............................
............+#..............
..........+###..............
.........+####+.............
.........+######+...........
........+###+####+..........
........+##..+####..........
........+#+...+##+..........
........+#+...###+..........
........+##+++####+.........
.........#####++##+.........
.........+###+..+##+........
..........+++....+#+........
.................+##........
..................+#+.......
..................+##+......
...................+#+......
...................+##+.....
....................+#+.....
.....................+#+....
......................#+....
............................
![Page 16: How does computer know what is spam and what is ham?](https://reader035.fdocuments.in/reader035/viewer/2022062421/56649d765503460f94a58073/html5/thumbnails/16.jpg)
Instance – scanned image of hand-written numberLabels – 1,2,3,4,5,6,7,8,9
Features – (for project)every 2x2 pixel squares
............................
............................
............+#..............
..........+###..............
.........+####+.............
.........+######+...........
........+###+####+..........
........+##..+####..........
........+#+...+##+..........
........+#+...###+..........
........+##+++####+.........
.........#####++##+.........
.........+###+..+##+........
..........+++....+#+........
.................+##........
..................+#+.......
..................+##+......
...................+#+......
...................+##+.....
....................+#+.....
.....................+#+....
......................#+....
............................
![Page 17: How does computer know what is spam and what is ham?](https://reader035.fdocuments.in/reader035/viewer/2022062421/56649d765503460f94a58073/html5/thumbnails/17.jpg)
Instance – scanned image of hand-written numberLabels – 1,2,3,4,5,6,7,8,9
Features – (for project)every 2x2 pixel squares
............................
............................
............+#..............
..........+###..............
.........+####+.............
.........+######+...........
........+###+####+..........
........+##..+####..........
........+#+...+##+..........
........+#+...###+..........
........+##+++####+.........
.........#####++##+.........
.........+###+..+##+........
..........+++....+#+........
.................+##........
..................+#+.......
..................+##+......
...................+#+......
...................+##+.....
....................+#+.....
.....................+#+....
......................#+....
............................
![Page 18: How does computer know what is spam and what is ham?](https://reader035.fdocuments.in/reader035/viewer/2022062421/56649d765503460f94a58073/html5/thumbnails/18.jpg)
Instance – scanned image of hand-written numberLabels – 1,2,3,4,5,6,7,8,9
Features – (for project)every 2x2 pixel squares
............................
............................
............+#..............
..........+###..............
.........+####+.............
.........+######+...........
........+###+####+..........
........+##..+####..........
........+#+...+##+..........
........+#+...###+..........
........+##+++####+.........
.........#####++##+.........
.........+###+..+##+........
..........+++....+#+........
.................+##........
..................+#+.......
..................+##+......
...................+#+......
...................+##+.....
....................+#+.....
.....................+#+....
......................#+....
............................
![Page 19: How does computer know what is spam and what is ham?](https://reader035.fdocuments.in/reader035/viewer/2022062421/56649d765503460f94a58073/html5/thumbnails/19.jpg)
Instance – scanned image of hand-written numberLabels – 1,2,3,4,5,6,7,8,9
Features – (for project)every 2x2 pixel squares
............................
............................
............+#..............
..........+###..............
.........+####+.............
.........+######+...........
........+###+####+..........
........+##..+####..........
........+#+...+##+..........
........+#+...###+..........
........+##+++####+.........
.........#####++##+.........
.........+###+..+##+........
..........+++....+#+........
.................+##........
..................+#+.......
..................+##+......
...................+#+......
...................+##+.....
....................+#+.....
.....................+#+....
......................#+....
............................
![Page 20: How does computer know what is spam and what is ham?](https://reader035.fdocuments.in/reader035/viewer/2022062421/56649d765503460f94a58073/html5/thumbnails/20.jpg)
Instance – scanned image of hand-written numberLabels – 1,2,3,4,5,6,7,8,9
Features – (for project)every 2x2 pixel squares
Steps.
- Turn image-file into a stream of Images (Abstract Data Type)(done for you)
![Page 21: How does computer know what is spam and what is ham?](https://reader035.fdocuments.in/reader035/viewer/2022062421/56649d765503460f94a58073/html5/thumbnails/21.jpg)
Instance – scanned image of hand-written numberLabels – 1,2,3,4,5,6,7,8,9
Features – (for project)every 2x2 pixel squares
Steps.
- Turn image-file into a stream of Images (Abstract Data Type)(done for you)- Gather feature statistics from Training File(mostly done for you)
![Page 22: How does computer know what is spam and what is ham?](https://reader035.fdocuments.in/reader035/viewer/2022062421/56649d765503460f94a58073/html5/thumbnails/22.jpg)
Instance – scanned image of hand-written numberLabels – 1,2,3,4,5,6,7,8,9
Features – (for project)every 2x2 pixel squares
Steps.
- Turn image-file into a stream of Images (Abstract Data Type)(done for you)- Gather feature statistics from Training File(mostly done for you)- Implement Bayes' Rule (mostly your own work)
![Page 23: How does computer know what is spam and what is ham?](https://reader035.fdocuments.in/reader035/viewer/2022062421/56649d765503460f94a58073/html5/thumbnails/23.jpg)
Instance – scanned image of hand-written numberLabels – 1,2,3,4,5,6,7,8,9
Features – (for project)every 2x2 pixel squares
Steps.
- Turn image-file into a stream of Images (Abstract Data Type)(done for you)- Gather feature statistics from Training File(mostly done for you)- Implement Bayes' Rule (mostly your own work)- Evaluate your OCR by guessing labels on Validation File(mostly done for you)