Real time classification of malicious urls.pptx 2
-
Upload
daniyar-mukhanov -
Category
Education
-
view
249 -
download
0
Transcript of Real time classification of malicious urls.pptx 2
![Page 1: Real time classification of malicious urls.pptx 2](https://reader031.fdocuments.in/reader031/viewer/2022030402/588499d91a28ab26058b603d/html5/thumbnails/1.jpg)
Real Time classification of malicious URLs
Daniyar Mukhanov, Chandan Gowda
![Page 2: Real time classification of malicious urls.pptx 2](https://reader031.fdocuments.in/reader031/viewer/2022030402/588499d91a28ab26058b603d/html5/thumbnails/2.jpg)
Introduction
- Malicous software in Online Social Network (OSN)
Malicous web sites are top 3 thread to enterprise security
- Koobface virus. Anagram of word “Facebook”
![Page 3: Real time classification of malicious urls.pptx 2](https://reader031.fdocuments.in/reader031/viewer/2022030402/588499d91a28ab26058b603d/html5/thumbnails/3.jpg)
Koobface
![Page 4: Real time classification of malicious urls.pptx 2](https://reader031.fdocuments.in/reader031/viewer/2022030402/588499d91a28ab26058b603d/html5/thumbnails/4.jpg)
Cyber criminals can piggyback on events to share malicious URL-s
![Page 5: Real time classification of malicious urls.pptx 2](https://reader031.fdocuments.in/reader031/viewer/2022030402/588499d91a28ab26058b603d/html5/thumbnails/5.jpg)
Aim of paper
Develop a real-time machine classification system to distinguish between malicious
and benign URLs within seconds of the URL being clicked
Training several machine classification models by getting data during two large
sport events:
- Superbowl
- Cricket World Cup
![Page 6: Real time classification of malicious urls.pptx 2](https://reader031.fdocuments.in/reader031/viewer/2022030402/588499d91a28ab26058b603d/html5/thumbnails/6.jpg)
Related Work
- Malware propagation and Social networks
- Classifying malicious web pages
![Page 7: Real time classification of malicious urls.pptx 2](https://reader031.fdocuments.in/reader031/viewer/2022030402/588499d91a28ab26058b603d/html5/thumbnails/7.jpg)
Malware propagation and Social networks
- Low degree of connections is not an obstacle
- Highly clustered networks slows propagation
- Large-scale events are ideal for spreading malware
![Page 8: Real time classification of malicious urls.pptx 2](https://reader031.fdocuments.in/reader031/viewer/2022030402/588499d91a28ab26058b603d/html5/thumbnails/8.jpg)
Classifying malicious webpages
used static analysis of scripts embedded within a Web page
Static code analysis to detect evasive malware
Honeypots to interact with malicious content and anti-virus to analyse the
malicious content
Static code Vs Run-time analysis
![Page 9: Real time classification of malicious urls.pptx 2](https://reader031.fdocuments.in/reader031/viewer/2022030402/588499d91a28ab26058b603d/html5/thumbnails/9.jpg)
Data collection
American Super Bowl; to train data
Cricket World Cup; to test data
- #superbowlXLIX - 122 542 URL containing tweets
- #CWC15 - 7961 URL containing tweets
![Page 10: Real time classification of malicious urls.pptx 2](https://reader031.fdocuments.in/reader031/viewer/2022030402/588499d91a28ab26058b603d/html5/thumbnails/10.jpg)
Identifying malicious URLs
- Client-side honeypot system
- Low interaction honeypots and high interaction honeypots
- The Capture HPC toolkit
- 5 minutes of visit
![Page 11: Real time classification of malicious urls.pptx 2](https://reader031.fdocuments.in/reader031/viewer/2022030402/588499d91a28ab26058b603d/html5/thumbnails/11.jpg)
Architecture for suspicious URL annotation
- Capture HPC operates in VM
- User can specify own omission or inclusion rule
![Page 12: Real time classification of malicious urls.pptx 2](https://reader031.fdocuments.in/reader031/viewer/2022030402/588499d91a28ab26058b603d/html5/thumbnails/12.jpg)
Sampling and Feature Identification
• Data has been collected from twitter with the help of Tweepy.
• Data from one event used to train a classifier and data from another event is
used to test the model’s generalizability.
• Super Bowl training data contained 1000 URLs as Malicious and Benign each.
• Cricket World Cup testing data contained 891 Malicious URLs and 1100
Benign.
![Page 13: Real time classification of malicious urls.pptx 2](https://reader031.fdocuments.in/reader031/viewer/2022030402/588499d91a28ab26058b603d/html5/thumbnails/13.jpg)
Sampling and Feature Identification
- 80% of URLs from Cricket World Cup found to be malicious
Metrics:
- CPU
- Connection established
- Port Number
- Process ID
- Remote IP
- Network Interface
- Bytes sent/received
![Page 14: Real time classification of malicious urls.pptx 2](https://reader031.fdocuments.in/reader031/viewer/2022030402/588499d91a28ab26058b603d/html5/thumbnails/14.jpg)
Baseline Model Selection
Data modelling activity is intended for:
• Extracting features from machine activity that would help predict malicious behaviour during
an interaction with a URL
• To connect the dots between machine activity and malicious behaviour
• Generative Vs Discriminative models
• Data acquired can include logs of machine activity even during idle system state.
• Hence it is likely there is noise as well as malicious behaviour recorded in those logs.
![Page 15: Real time classification of malicious urls.pptx 2](https://reader031.fdocuments.in/reader031/viewer/2022030402/588499d91a28ab26058b603d/html5/thumbnails/15.jpg)
Statistics for Trained and Test Datasets
t● High variance in mean recorded values
for CPU usage, bytes/packets
sent/received and ports used.
● But Standard Deviation is very similar for
both the data sets.
![Page 16: Real time classification of malicious urls.pptx 2](https://reader031.fdocuments.in/reader031/viewer/2022030402/588499d91a28ab26058b603d/html5/thumbnails/16.jpg)
Baseline Model Selection
• Datasets contained well balanced number of malicious and benign activity logs but
largely benign.
• This could have an impact on the effectiveness of a discriminative classifier.
• Identifying decision boundaries where the inputs may not be linearly separable.
• So in this case, a generative model suits better.
![Page 17: Real time classification of malicious urls.pptx 2](https://reader031.fdocuments.in/reader031/viewer/2022030402/588499d91a28ab26058b603d/html5/thumbnails/17.jpg)
Choosing classifiers
Generative Models
1. Bayesian Classifier
2. Naïve Bayesian Classifier
Discriminative Models
1. J48 Decision Tree
2. Multi Layer Perception Model (MLP)
![Page 18: Real time classification of malicious urls.pptx 2](https://reader031.fdocuments.in/reader031/viewer/2022030402/588499d91a28ab26058b603d/html5/thumbnails/18.jpg)
Baseline Model Results- Generative Models
The low error rates at t=60 in Bayesian model during training phase suggest:1. The features that we’re using to build the models are predictive of malicious activities2. Malicious activities are occurring within first 60 seconds of interaction.3. There are conditional dependencies between variables.
![Page 19: Real time classification of malicious urls.pptx 2](https://reader031.fdocuments.in/reader031/viewer/2022030402/588499d91a28ab26058b603d/html5/thumbnails/19.jpg)
Baseline Model Results- Discriminative Models
• MLP has a precision of 0.720 at t=30, only slightly below its optimum level. But it demonstrates the model’s ability to
reduce false positives early on.
![Page 20: Real time classification of malicious urls.pptx 2](https://reader031.fdocuments.in/reader031/viewer/2022030402/588499d91a28ab26058b603d/html5/thumbnails/20.jpg)
Classifier Performance over time
● This chart depicts correctly classifiedinstances over a period of time incrementally.
● Discriminative models outperform generative
models.
● This suggests that certain malicious activities
are linearly separable from benign behaviour.
● the model, Naive Bayesian fails to perform
well.
● MLP model outsmarted the rest of the
classifiers.
![Page 21: Real time classification of malicious urls.pptx 2](https://reader031.fdocuments.in/reader031/viewer/2022030402/588499d91a28ab26058b603d/html5/thumbnails/21.jpg)
Model Analysis
● MLP produced 9 hidden nodes and the table
shows weightings given for each
class(Benign/Malicious)
● Here node 9 stands out with higher weight
for malicious behaviour
NODE WEIGHTS BY CLASS
![Page 22: Real time classification of malicious urls.pptx 2](https://reader031.fdocuments.in/reader031/viewer/2022030402/588499d91a28ab26058b603d/html5/thumbnails/22.jpg)
Model Analysis
● Node 9 holds highest value for bytes received
variable.
● Compare it with Node 3 for Bytes sent/received
and Packets sent/received
● This is an interesting find as we know Node 9
was involved with malicious links.
● Most important discovery is in the connection
attribute which is weighted high for Node 1.
● Subsequently Remote IP and Bytes Sent also
receive a massive hike. Suggestive of an attack.
MLP ANALYSIS
![Page 23: Real time classification of malicious urls.pptx 2](https://reader031.fdocuments.in/reader031/viewer/2022030402/588499d91a28ab26058b603d/html5/thumbnails/23.jpg)
Sampled learning
Correctly classified instances with sampled
training data
![Page 24: Real time classification of malicious urls.pptx 2](https://reader031.fdocuments.in/reader031/viewer/2022030402/588499d91a28ab26058b603d/html5/thumbnails/24.jpg)
Conclusion
- Endpoint is not clear from tweets
- MLP model performed best on unseen data 72%
- Bayesian approach performed best in early stages of interaction 66%
- Twitter recently introduced new policies to protect from harm.