Andrew G. West and Insup Lee CEAS `11 – September 1, 2011 Towards the Effective Spatio- Temporal...

30
Andrew G. West and Insup Lee CEAS `11 – September 1, 2011 Towards the Effective Spatio- Temporal Mining of Spam Blacklists

Transcript of Andrew G. West and Insup Lee CEAS `11 – September 1, 2011 Towards the Effective Spatio- Temporal...

Page 1: Andrew G. West and Insup Lee CEAS `11 – September 1, 2011 Towards the Effective Spatio- Temporal Mining of Spam Blacklists.

Andrew G. West and Insup LeeCEAS `11 – September 1, 2011

Towards the Effective Spatio-Temporal Mining of Spam Blacklists

Page 2: Andrew G. West and Insup Lee CEAS `11 – September 1, 2011 Towards the Effective Spatio- Temporal Mining of Spam Blacklists.

Big Idea / Outline

2

BIG IDEA: Identify IP addresses that have temporally correlated spam behavior; harness this info. predictively

•Related work; motivations•Blacklists as ground truth

• Data collection• Measurement study

•Temporal association mining• Technique• Parameterization• Negative results; discussion

Page 3: Andrew G. West and Insup Lee CEAS `11 – September 1, 2011 Towards the Effective Spatio- Temporal Mining of Spam Blacklists.

Usage Example

3

IPx

IPy

Blacklist history (time)

20 min. 20 min. tnow

What to do “now”?•Assume IPy will be blacklisted

•Start blocking; decrease listing latency

Page 4: Andrew G. West and Insup Lee CEAS `11 – September 1, 2011 Towards the Effective Spatio- Temporal Mining of Spam Blacklists.

Motivations

4

Mail

IP

AS

BLOCK

IP

AS-REP

BLK-REP

IP-REP

SpatialFunctions

Plot into3-D SpaceClassify (SVM)

SPAM or HAM

REPALG

Time

History

Recent research leveraging group behaviors [1—5]:•Overcome “cold-start”•Grouping functions: subnets, rDNS hosts, AS, etc.

Botnets a driving force•Non-contiguous in IP space•“Campaigns” should give rise to temporal correlations•Can we calculate grouping function; use for reputation?

Page 5: Andrew G. West and Insup Lee CEAS `11 – September 1, 2011 Towards the Effective Spatio- Temporal Mining of Spam Blacklists.

Related Work

“How to determine botnet membership?”• Parsing P2P communication graphs

– Issues: Unproven, reqs. expansive view (BotGrep [6])– Blacklists have inherent global view

• Similarity algs. over email bodies/URLs– Issues: Privacy, complexity (Botnet Judo [7])– Mining uses only IP addresses in computation

• Law enforcement infiltrations– Data only useful in ex post facto fashion

5

Page 6: Andrew G. West and Insup Lee CEAS `11 – September 1, 2011 Towards the Effective Spatio- Temporal Mining of Spam Blacklists.

BLACKLIST MEASUREMENT STUDY

6

Page 7: Andrew G. West and Insup Lee CEAS `11 – September 1, 2011 Towards the Effective Spatio- Temporal Mining of Spam Blacklists.

Blacklists

• Why blacklists?– Global compilation; aggregate; low false-positives– We have tons of data

• Spamhaus blacklists [8]1.PBL (Policy Block List) – Dynamic IP ranges2.SBL (Spamhaus Block List) – Static ranges

belonging to spam gangs3.XBL (Exploits Block List) – IPs spamming due to

malware, Trojans (i.e., botnet nodes)

7

Page 8: Andrew G. West and Insup Lee CEAS `11 – September 1, 2011 Towards the Effective Spatio- Temporal Mining of Spam Blacklists.

Blacklist Ops

8

IPx Blacklist history (time)

listednot- listedlisted

de-listing listing duration (d)listing re-listing

Page 9: Andrew G. West and Insup Lee CEAS `11 – September 1, 2011 Towards the Effective Spatio- Temporal Mining of Spam Blacklists.

Blacklist Size

• Why?: Desirable to show that blacklists are a reasonable proxy for the spam problem

9

#1 #2• #1: Spike typical of

holiday seasons

• #2: Shutdowns of Spamit.com affiliate and Rustock

• Small spikes: Evidence of campaigns

Page 10: Andrew G. West and Insup Lee CEAS `11 – September 1, 2011 Towards the Effective Spatio- Temporal Mining of Spam Blacklists.

Listing Duration (d)

10

• Why?: Re-listings (basis for patterns/ correlations) limited by de-listing speed

• Almost universal d=7.5 days

• Speaks to static TTL delisting policy

• Must only correlate listings, not overlapping durations

Page 11: Andrew G. West and Insup Lee CEAS `11 – September 1, 2011 Towards the Effective Spatio- Temporal Mining of Spam Blacklists.

DHCP Issues

11

PBLXBL

“possiblydynamic”11%

79%10%

“knowndynamic”

≈18.4% of all IP space is on the PBL

• Why?: Dynamic IPs may not be able to accumulate enough history for mining, or produce stale predictions

• A large percentage (80%+) of IPs are dynamic

• More important, is how dynamic they are [9]

• This fact supports narrow learning windows

Page 12: Andrew G. West and Insup Lee CEAS `11 – September 1, 2011 Towards the Effective Spatio- Temporal Mining of Spam Blacklists.

Relisting Quantity

12

• Why?: Central issue: do some IPs have histories extensive enough to be mined?

#1 #2 • #1: 50% of IPs have only 1 listing. Discard. Trim problem space.

• #2: 20% of IPs have 5+ listings, yet these account for 66% of all listings (non-trivial).

Page 13: Andrew G. West and Insup Lee CEAS `11 – September 1, 2011 Towards the Effective Spatio- Temporal Mining of Spam Blacklists.

Relisting Rates

13

• Why?: Dynamism supports tight learning, thus we want all re-listings well clustered temporally.

• Media re-listing time is 18 days

• Far from a uniform distribution

• Also speaks to infection lifetimes

Page 14: Andrew G. West and Insup Lee CEAS `11 – September 1, 2011 Towards the Effective Spatio- Temporal Mining of Spam Blacklists.

TEMPORAL ASSOCIATION MINING

14

Page 15: Andrew G. West and Insup Lee CEAS `11 – September 1, 2011 Towards the Effective Spatio- Temporal Mining of Spam Blacklists.

Association Rules

• Developed for “market basket” data– “Beer and diapers” example– Apriori and FP-Growth algs.

• Example rule– {DIAPERS} →{BEER}– Interest measures [10]:

– lift(DIAPERS → BEER) = (3/5) / (4/5) * (4/5) = 0.94– Ratio of actual support, to expected rand. support

15

ID BEER MILK DIAPERS

1 Y N Y

2 N Y N

3 Y Y Y

4 Y N Y

5 N N Y

Page 16: Andrew G. West and Insup Lee CEAS `11 – September 1, 2011 Towards the Effective Spatio- Temporal Mining of Spam Blacklists.

Correlations

16

• Previous: discrete, unordered, and transactional data• Spam data defies these– Continuously distributed– Bi-directionally ordered

• Define “correlation radius” (r) to make binary associations

• Symmetric but non-associative

• Radius enables probabilistic lift and support equivalents

Page 17: Andrew G. West and Insup Lee CEAS `11 – September 1, 2011 Towards the Effective Spatio- Temporal Mining of Spam Blacklists.

Best Pairs

17

For every IP address, produce a finite “best pairs” list for persistent storage, where ordering determined by “lift”

Page 18: Andrew G. West and Insup Lee CEAS `11 – September 1, 2011 Towards the Effective Spatio- Temporal Mining of Spam Blacklists.

Implementation

• 232 × 232 = Scalability issues• Prune search space with “minimum support”

– M=3 produces a 54.3 trillion entry matrix

– But 98% sparse

• Multi-threaded runtime= 3 days; we used EC2

18

Page 19: Andrew G. West and Insup Lee CEAS `11 – September 1, 2011 Towards the Effective Spatio- Temporal Mining of Spam Blacklists.

Free Variables

• Correlation radius (r)– Try to capture campaigns with minimal noise– r = 2 hours (4 hour diameter)

• Training window length (length(h’))– Narrow: Infection lifetimes [11], DHCP addresses– Broad: Need for re-listings, bot-to-campaign map– length(h’) = 3 months

• Minimum support (m)– Derived based on scalability needs (m=3)

19

Page 20: Andrew G. West and Insup Lee CEAS `11 – September 1, 2011 Towards the Effective Spatio- Temporal Mining of Spam Blacklists.

RESULTS AND DISCUSSION

1. “Best pairs” significance2. Botnet membership capture

3. Blacklist prediction

20

Page 21: Andrew G. West and Insup Lee CEAS `11 – September 1, 2011 Towards the Effective Spatio- Temporal Mining of Spam Blacklists.

Rule Significance

21

• Intuition: Lift matrix should have values higher than random chance would suggest

#1

#2

• #1: Flip expected; About 0.6% of all pairs correlate more than random

• #2: Even at lift=120, 36% chance the correlation is rand. AGGREGATE

Page 22: Andrew G. West and Insup Lee CEAS `11 – September 1, 2011 Towards the Effective Spatio- Temporal Mining of Spam Blacklists.

Botnet Membership

22

• Intuition: Given a set of botnet IPs, shared member/ pair lifts should exceed member/non-member pairs

• Actual dumps: Kraken + Cutwail

• 70-80% of IPs are XBL listed, 40% at min. support

• 6.0% of shared have non-zero lift, compared to 2.8%

Page 23: Andrew G. West and Insup Lee CEAS `11 – September 1, 2011 Towards the Effective Spatio- Temporal Mining of Spam Blacklists.

Blacklist Prediction (1)

23

Page 24: Andrew G. West and Insup Lee CEAS `11 – September 1, 2011 Towards the Effective Spatio- Temporal Mining of Spam Blacklists.

Blacklist Prediction (2)• Prediction criteria

– No ballot stuffing; can’t re-guess

• Experiment with different thresholds• Same story: Outperforming random, but too

minimal to be of any consequence

24

Page 25: Andrew G. West and Insup Lee CEAS `11 – September 1, 2011 Towards the Effective Spatio- Temporal Mining of Spam Blacklists.

Discussion (1)

• Scalability issues × minor performance increments don’t warrant production

• Focus on acute areas of improvement:

1. DHCP research– 90%+ of IPs at min. support are dynamic, how?– Need reliable IP classification; churn rates

2. Refining windows/correlations– Non-binary correlations. Gaussian weights.– Time-decay of events in training windows

25

Page 26: Andrew G. West and Insup Lee CEAS `11 – September 1, 2011 Towards the Effective Spatio- Temporal Mining of Spam Blacklists.

Discussion (2)3. Appropriateness of blacklist data

– Desirable conciseness (500 million listings = 12GB)– Blacklists inherently latent. Their aggregate, opaque,

and binary triggers may blur campaign-level data. – Install on an email server? Collect other metadata

• Takeaway; Utility in negative result– Measurement study builds on prior research– Our model serves as foundation for future efforts– Lessons learned about botnet dynamics– Identified poorly understood dynamism areas

26

Page 27: Andrew G. West and Insup Lee CEAS `11 – September 1, 2011 Towards the Effective Spatio- Temporal Mining of Spam Blacklists.

References

27

[1] A. Ramachandran and N. Feamster. Understanding the network-level behavior of spammers. In SIGCOMM, 2006.

[2] F. Li and M.-H. Hsieh. An empirical study of clustering behavior of spammers and group-based anti-spam strategies. In CEAS, 2006.

[3] S. Hao, et al. Detecting spammers with SNARE: Spatio- temporal network-level automated reputation engine. In USENIX Security, 2009.

[4] Z. Qian, et al. On network-level clusters for spam detection. In NDSS, 2010[5] A. G. West, et al. Spam mitigation using spatio-temporal reputations from

blacklist history. In ACSAC, 2010.[6] S. Nagaraja, et al. BotGrep: Finding P2P bots with structured graph

analysis. In USENIX Security, 2010.[7] A. Pitsillidis, et al. Botnet judo: Fighting spam with itself. In NDSS, 2010.[8] Spamhaus Project. http://www.spamhaus.org/[9] Y. Xie, et al. How dynamic are IP addresses? In SIGCOMM, 2007.[10] L. Geng and H. J. Hamilton. Interestingness measures for data mining: A

survey. ACM Comp. Surveys, 38(9), 2006.[11] J. E. Dunn. Botnet PCs stay infected for years. Tech World, 2009.

Page 28: Andrew G. West and Insup Lee CEAS `11 – September 1, 2011 Towards the Effective Spatio- Temporal Mining of Spam Blacklists.

28

Backup Slides (1)

Page 29: Andrew G. West and Insup Lee CEAS `11 – September 1, 2011 Towards the Effective Spatio- Temporal Mining of Spam Blacklists.

29

Backup Slides (2)

Above: Lift distributions as a consequence of altering minimum support.

Above: Lift distributions as a consequence of altering correlation radius and minimum support

Page 30: Andrew G. West and Insup Lee CEAS `11 – September 1, 2011 Towards the Effective Spatio- Temporal Mining of Spam Blacklists.

30

Backup Slides (3)