Talk (II): E-mail Spam: The Problem, Solutions and Potential Industrial Standards
description
Transcript of Talk (II): E-mail Spam: The Problem, Solutions and Potential Industrial Standards
Talk (II): E-mail Spam: The Problem, Solutions and Potential Industrial
Standards
Jenq-Haur WangAcademia Sinica
Nov. 16-17, 2006
Nov. 16-17, 2006 E-mail Spam 2
Outline
• Introduction• Existing Solutions
– Regulatory Solutions– Technical Solutions
• Potential Industrial Standards
Nov. 16-17, 2006 E-mail Spam 3
Introduction
• What is spam?– E-mail, netnews, instant messaging (“spim”),
“Google-spam”, guestbook spam, Weblog comments spam, VoIP (“spit”), …
– Unsolicited messages flooded to uninterested receivers, usually sent in bulk
• What is e-mail spam?– Junk e-mail– Unsolicited bulk e-mail (UBE)– Unsolicited commercial e-mail (UCE)
Nov. 16-17, 2006 E-mail Spam 4
Spam Statistics
• Jan. 2001, – 8% of all e-mail traffic in the US is spam [Brightmail
Inc.]• Jan. 2003,
– 42% [Brightmail Inc.]• Jul. 2004,
– 65% [Symantec (Brightmail) Inc.]• In 2002,
– 3 pieces/day/user (average) [Ferris Research]• By 2005,
– 10 pieces/day/user (average) [Ferris Research]
Nov. 16-17, 2006 E-mail Spam 5
Spam Statistics (cont.)
Nov. 16-17, 2006 E-mail Spam 6
Spam Statistics (cont.)
Nov. 16-17, 2006 E-mail Spam 7
Costs of Spam
• Enterprises– > US$10 billion for US organizations in 2003 [Ferris
Research]– US$245,000/year for a company with 14,000 employees
[IDC]
• End users– 5 spam/day, 30 seconds each -> 15 hours/year [Ferris
Research]– Loss of productivity
• Burden on ISPs– System resource consumption on servers– Waste on network bandwidth– User complaints
Nov. 16-17, 2006 E-mail Spam 8
Latest Spam Statistics
• Email considered spam: 40%• Daily Spam emails sent: 12.4 biliion• Daily spam received per person: 6• Annual spam received per person: 2,200 • Spam cost to all non-corp. Internet users: $255
million• Spam cost to all US corporations in 2002: $8.9
billion• States with anti-spam laws: 26
[source: Spam Statistics 2006, by Don Evett,TopTenReviews, Inc.]
Nov. 16-17, 2006 E-mail Spam 9
Latest Spam Statistics (cont.)
• Email address changes due to spam: 16%• Estimated spam increase by 2007: 63%• Annual spam in 1,000 employee company: 2.1
million• Users who reply to spam email: 28%• Users who purchase from spam email: 8%• Corporate email that is considered spam: 15-20
%• Wasted corporate time per spam email: 4-5 sec
Nov. 16-17, 2006 E-mail Spam 10
Email Statistics
• Daily emails sent: 31 billion• Daily emails sent per email address: 56• Daily emails sent per person: 174• Daily emails sent per corporate user: 34• Daily emails received per person: 10• Email addresses per person: 3.1 average• Cost to all Internet users: $255 million
Nov. 16-17, 2006 E-mail Spam 11
Spam Categories• Products: 25% • Financial: 20% ↑• Adult: 19% ↑• Scams: 9% • Health: 7% • Internet: 7%• Leisure: 6%• Spiritual: 4%• Other: 3%(Source: http://www.brightmail.com/spamstats.html, Jun. 2004 & http://spam-filter-review.toptenreviews.com/spam-
statistics.html, 2006 )
Nov. 16-17, 2006 E-mail Spam 12
Origins of Spam
• Where does the spam come from? [Sophos, “Dirty Dozen” spam producing countries, Apr. 2005]– 35.7% (43%): from the US – 25.0% ↑(16%): from South Korea– 9.7% (11%): from China
• …
Nov. 16-17, 2006 E-mail Spam 13
Major Factors
• Simple SMTP mail relaying mechanism– Cannot verify the identity of the sender
• Forged IP address /sender e-mail address
– Open mail relay/proxy• Low cost for sending bulk e-mails
– Low cost for e-mail address harvesting• Web, mailing list, …
– Bulk mailer programs– Low cost for obtaining “free” e-mail
address
Nov. 16-17, 2006 E-mail Spam 14
Lifecycle of E-mails
sender
recipient
MUAs MTAsMTAs
SMTP
MTArMTAr
SMTP
MUAr
POP3/IMAP4
mailbox
DNS
MX records
queues
sender domain
receiver domain
Nov. 16-17, 2006 E-mail Spam 16
Existing Solutions
• Regulatory solutions– Anti-spam laws– Limitations
• Technical solutions– Filtering– Postage– Disposable e-mail address
Nov. 16-17, 2006 E-mail Spam 17
Regulatory Solutions
• Anti-spam laws– http://www.spamlaws.com/– Ex: US federal law CAN-SPAM Act
(S.877) enacted on Jan. 1, 2004
• Limitations– Dependence on evidences in technical
information– Slow and costly process
Nov. 16-17, 2006 E-mail Spam 18
Current Status ofAnti-Spam Laws
• In the US:– Enacted federal laws: CAN-SPAM Act of 2003 (Pub. L. 108-187, S. 877)– Enacted state laws: Arkansas, California, Colorado, Connecticut, Delaware,
Idaho, Illinois, Indiana, Iowa, Kansas, Louisiana, Maryland, Minnesota, Missouri, Nevada, New Mexico, North Carolina, Ohio, Oklahoma, Pennsylvania, Rhode Island, South Dakota, Tennessee, Utah, Virginia, Washington, West Virginia, Wisconsin, Wyoming, …
• In Europe:• European Union, Austria, Belgium, Czech Republic, Denmark, Finland, France,
Germany, Greece, Ireland, Italy, Luxembourg, Netherlands, Norway, Portugal, Spain, Sweden, United Kingdom, …
• In other countries:• Argentina, Australia, Brazil, Canada, India, Japan, Panama, Peru, Russia, Sout
h Korea, Yugoslavia, …• TaiwanTaiwan: “Anti-Hacker” laws in the Martial Law (Jun. 3, 2003)
Nov. 16-17, 2006 E-mail Spam 19
Technical Solutions
• Filtering: to separate bad from good– Heuristic-based– Classification-based: machine learning – Others: peer-to-peer, honeypot
• Postage: to increase the cost of sending e-mails• Hiding email address
– Encoding (text to image, Java script, …)– Disposable email address: separate e-mail address for differe
nt correspondence• Enhancing SMTP mechanism
– Email path verification– Authenticated SMTP
Nov. 16-17, 2006 E-mail Spam 20
Filtering TechniqueHeuristic-based
• Black/White/Grey lists– Blacklist: lists of IP addresses that send spam
• RBLs (Real-time Blackhole Lists), open mail relays, open proxies, …
– Whitelist: lists of trusted sender• Challenge-response mechanism
– Greylisting: temporary delay of e-mail from unknown sender
• Problems– Easy to make mistake
• Forged IP address/sender e-mail address– Lists need to be updated frequently
• Changing spammer e-mail addresses
Nov. 16-17, 2006 E-mail Spam 21
Filtering TechniqueHeuristic-based (cont.)
• Keyword-matching rules (ex. MS Outlook)– Look for similar messages based on their subject or
content• Problems
– Exact rules are difficult to formulate and maintain• Spam is always changing
– Chinese menu (madlibs) attackMake thousands of dollars working at home !!!
Earn lots of money in the comfort of your own house.
Nov. 16-17, 2006 E-mail Spam 22
Filtering TechniqueClassification-based
• Machine learning– Text classification methods: TF-IDF, Naïve Bayes, S
VM (Support Vector Machine), …– Learn spam vs. good– Adapt to changing spam
• Problems– Need lots of training data
• Diverse contents in e-mail spam– Spammers are learning too
• Images, synonyms, misspellings, …– “One man’s spam is another man’s ham”
Nov. 16-17, 2006 E-mail Spam 24
Filtering Techniques -- Others
• Distributed (peer-to-peer, collaborative) spam filtering– To share the knowledge of spam features– SpamNet: Cloudmark– SpamWatch: UC Berkeley
• Problems– Efficacy– Efficiency
Nov. 16-17, 2006 E-mail Spam 25
Distributed Spam Filtering
• Cloudmark’s SpamNet
SpamNet
MUAr
recipient MTArMTAr
POP3/IMAP4
Add-in
check
Client-side
MUAr
recipient
Client-side
Add-in
report
Nov. 16-17, 2006 E-mail Spam 27
Discussions on Filtering-based Approach
• False-positive vs. false-negative– Cost-sensitive e-mail classification
• Incoming vs. outgoing e-mail filtering– Ex. corporate mail filtering might focus
on preventing confidential data
Nov. 16-17, 2006 E-mail Spam 29
Postage
• Postage: to increase the cost of sending e-mails– Money: payment– Computation: time– Turing tests: challenge-response
• Problems– Requires multiple monetary transactions for
each e-mail delivery– Who pays for infrastructure?
Nov. 16-17, 2006 E-mail Spam 30
Disposable E-mail Address
• Disposable e-mail address– Separate e-mail address for each correspondence
• Channelized e-mail system [R. Hall]– Sort incoming mails according to sender address– Terminate the address with spam
• Problems– How do new senders get your address?– What’s the sender address for multiple receivers?– Difficult to remember
Nov. 16-17, 2006 E-mail Spam 31
Enhancing SMTP Mechanism
• Email path verification– To trace the real origin of e-mail (sender) – Problem: accounting is needed for packet
network
• Authenticated SMTP– Trusted environment
• SMTP authentication (RFC 2554), SMTP over SSL/TLS (RFC 3207), digital signatures (PGP, …)
– Problem: need client-server cooperation
Nov. 16-17, 2006 E-mail Spam 32
Other Techniques (cont.)
• Reputation-based approach– Based on HITS (Hyperlink Induced Topic
Search) algorithm– Ranking on email sending/receiving
reputation
• Problem– Bad reputation for volume senders
(mailing lists, newsletters, …)
Nov. 16-17, 2006 E-mail Spam 35
Existing Anti-Spam Tools
• Open Source Filters– SpamAssassin– ifile– bogofilter– POPfile– SpamBayes– CRM114
• Commercial Products– BrightMail– SurfControl– Anti-virus
Nov. 16-17, 2006 E-mail Spam 36
Spammers’ Tricks
• Images: MIME• Invisible ink (hidden text): color• Misspelling
– o -> 0– i -> l -> 1 -> !– S -> 5
• F R E E, g-i=r-l, …• Ref: John Graham-Cumming: The
Spammers’ Compendium, http://www.jgc.org/tsc/index.htm
Nov. 16-17, 2006 E-mail Spam 37
Potential Industrial Standards
• Sender/Domain authentication for e-mails– Sender ID Framework (Microsoft)– DKIM (Yahoo, Cisco)
• DomainKeys (Yahoo)• Identified Internet Mail (Cisco)
– SPF• Sender Permitted From (AOL)
Nov. 16-17, 2006 E-mail Spam 38
Structures of E-mails
• Envelope: SMTP (RFC 2821)
• Header & body: RFC 2822
Nov. 16-17, 2006 E-mail Spam 41
Sender ID Framework (MS)
Nov. 16-17, 2006 E-mail Spam 43
DomainKeys
Nov. 16-17, 2006 E-mail Spam 45
IIM –Authentication /Authorization Model
Messages must pass two tests before they are authenticated
10401_10_2004
Receiving domain authenticates the message—i.e. Verifies that the message was not altered in any consequential manner prior to reaching the receiving domain
Receiving domain asks sending domain to confirm that whoever signed the message was authorized to do so (without having to identify the sender)
++AUTHENTICATE THE MESSAGE
AUTHORIZE THE SENDER
Nov. 16-17, 2006 E-mail Spam 46
Identified Internet Mail
Nov. 16-17, 2006 E-mail Spam 47
DomainKeys Identified Mail(DKIM)
• Derived from Yahoo DomainKeys and Cisco Identified Mail– IETF Working Group formed– IETF Internet draft
• Message header authentication– DNS identifiers– Public keys in DNS
• End-to-end– Between origin/receiver administrative domains– Not path-based
Nov. 16-17, 2006 E-mail Spam 48
SPF
• Sender Policy Framework– Derived from Sender Permitted From (SPF,
AOL)– By Meng Wong, CTO of Pobox– Current specification: SPFv1 (RFC 4408)– Reverse MX records– Adopted by many mail server implementati
ons
Nov. 16-17, 2006 E-mail Spam 50
Tips for End Users (1/2)• Never give out your personal e-mail address to
strangers• Use separate e-mail addresses for business an
d public use (“disposable”)• Never respond to unsolicited e-mail• Do not click on links within unsolicited e-mail,
including deceptive unsubscribe links
Nov. 16-17, 2006 E-mail Spam 51
Tips for End Users (2/2)• Read carefully the subject line on all e-mail, an
d use the preview feature on mail programs• If your e-mail address appears on a Web site, a
sk the site's manager to do some encoding• Use e-mail service providers that filter spam• Install an anti-spam program on your comput
er
Nov. 16-17, 2006 E-mail Spam 52
Conclusion
• Anti-spam is a battle– “Every time we discover a feature to catch
spam, spammers will find a work-around”
• Some advices– Filtering is just one part of the solutions– Try to make the costs of spammers higher– Be nice to your e-mail address– Mail delivery has to be improved
Nov. 16-17, 2006 E-mail Spam 53
References• IRTF ASRG: http://asrg.sp.am/ • Sender ID: http://www.microsoft.com/mscorp/safety/
technologies/senderid/technology.mspx • DKIM: http://dkim.org/ • DomainKeys: http://antispam.yahoo.com/domainkey
s • Identified Internet Mail: http://www.identifiedmail.co
m/ • SPF Project: http://www.openspf.org/ • RFCs and Internet Drafts
Nov. 16-17, 2006 E-mail Spam 54
References for Research• MIT Spam Conference (2003-2006)
– http://www.spamconference.org/ • Conference on Email and Anti-Spam (CEAS) (2004-200
6)– http://www.ceas.cc/
• TREC (Text REtrieval Conference) Spam Track (2005-2006)– http://trec.nist.gov/data/spam.html