Machine Learning applied to Security Steve Poulson 25 th Feb 2010.

Machine Learning applied to Security

Steve Poulson 25th Feb 2010

0 1 2 3 4 5 6 7 8 9 10

0

10

20

30

40

50

60

70

80

90

100

Complexity

Security Threats – Drivers and Trends

ScanSafe Threat Center

Com

plexity of Threat

Time spent online creating larger user base to steal from

Value of information transmitted Online banking, personal data theft

Communication is fragmenting alternative platforms are growing rapidly beyond email

Webmail, IM, RSS, Wikis, VoiP

New platforms are less mature and less well protected - vulnerable to attack

the more popular the application the more attention it draws

Hackers/Cyber criminals working faster AV signatures are failing Threats becoming more complex Looking for new vectors to exploit

“Email security is mature and successful, threats are migrating from inbox to

browser”

Zero-Day Threats

Social EngineeringEmail Viruses

Hybrid Worms

Spam

OS Vulnerabilities

Mobile Attacks

Identity Theft

Phishing

IM Threats

Web Viruses

DDoS Attacks

Spyware/Adware

Web Security: Trusted Sites Under Attack

Worldwide open web

Dynamic and dangerous internet

Over 127m active websites (Netcraft survey)

Graphics

Webmail

New Web Pages

BlogsAd Links

Links

Comments

Banner Ads

Backdoors

Rootkits

Trojan Horses

Keyloggers

Worms

Samsung site hijacked as malware host

Web Security: Risks of Unfiltered Content

Up to 40% of time spend online is non business related (IDC)

Productivity Bandwidth congestion Legal liability threat

37% of users visited an X-rated web site from work (Gartner)

Web Filtering Blocks by Category (%)

The Facebook Effect

“32% of our customers now blocking social

networking sites, up from 18% last year”

Effortless Management

Manage Granular Policies Directory and custom grouping Web usage quotas Schedules 50+ URL categories 60+ content types Custom block/allow lists Email and browser alerts

Generate Reports Summary Scheduled Forensic audit Blocked/allowed traffic

Ease of Management + Unrivalled SLAs

Ease and speed of deployment

Management portal

Reporting across multiple locations

Database management built in

Dedicated expert 24x7x365 support

Zero maintenance

Automated continuous updates

No patching

1. Availability

Time our service is available to scan traffic

99.999% guaranteed availability

2. Latency

Additional load time attributable to services

Evaluated by 3rd party analysis

3. False Positives

Pages that were blocked but should not have

False positive rate < 0.0004%

4. False Negatives

Pages that were not blocked, but should have

False negative rate < 0.0004%

The most comprehensive Service Level Agreements available for Web security

Proactive Security

Acceptable

Uncategorized

Prohibited

Malicious

1 in 5 searches yield Malware or Inappropriate Content

Over 90% of new sites are visited as the result of an Internet search

Trojan-Download.Win32.IstBar.jl Case Study

Provides protection in the ‘zero-hour’ Proactive threat detection The most effective scanner, sits at the

heart of all web traffic, analyses the largest amount of web traffic

Generating the most accurate heuristics in the fastest time

Outbreak Intelligence SearchAhead

Outbreak Intelligence

• Users are protected by several anti-virus engines at once• However this is not sufficient due to the variety of

exploits, and their ability to disguise themselves (polymorphism)

• Outbreak Intelligence harnesses machine-learning techniques and ScanSafe’s dataset to develop novel techniques to detect zero-hour attacks

• Uses advanced techniques such as code emulation• However we must always meet our maximum false-

positive rate of 1/250,000– Just 0.000004!

• Solutions must scale to millions of requests in real-time

Industrial Development Constraints

• Getting FP / FN right – customer expectation

• Deadlines :(• Solutions must scale to 250 million requests

per day (and growing)– Involves lots of approximations– Lookup tables in place of actual functions– Fast Data Structures– Constrains the choice of algorithms

• E.g. neural-nets or naïve bayes instead of SVMs

Industrial Development Constraints

• Dataset is continually changing– As the nature of interests across the web– And vectors targeted by attackers– Constantly change

• E.g. the latest Quicktime vulnerability targets in-request headers, by-passing virus detectors entirely

• Hence a preference for online models which can be continually updated, rather than those which have to be trained in batch.

Dataset

• Scan approx. 250 million web-requests every day

• From 45 different countries

• All traffic is logged for several months

• We can also archive traffic as it travels through our servers– Which means we can replay hacks several

days after the event to investigate them

Techniques Employed

• Supervised Learning– Support Vector Machines for classification and anomaly

detection– Some use of Neural Networks– Various probabilistic models such as Naïve Bayes variations

• Unsupervised Learning– HMMs and more complex variations thereof– Various clustering algorithms, MoG, KNNs– Dimensionality Reduction Algorithms (KPCA)

• Other– Adaboost, mixtures of experts

• Disclaimer– Not all are used in end products, and unfortunately we cannot

say which techniques are used in which applications.

Applications of Machine Learning

• Inappropriate Web Content • Drive-by attacks (first step in an attack)

– Malicious JavaScript and other scripts– Malicious Non-Executable Files

• Actual attacks– Malicious Executable Files

• Phishing– Use third-party databases– Use models that generate a probability based on URL, request and time

of a phishing attack• Reputation

– Use history of blocks for a URL, the probability of it being a phished URL, and other information, to derive a prior probability of it hosting malware to govern the decision model generating actions from the results of other classifiers

Inappropriate Content

• Basically just document classification• Want to stop Bad sites by content – Porn,

hate, ...• Good classifier naïve Bayes – Multinomial

Bernouli → Multinomial mixture model • These have problems, in practice add IR

techniques such as TF/IDF• SVM approaches better.• Also topic based – LSA / LDA

Malicious JavaScript

• Normal document classification works on the presence of “words” in files

• It’s also possible to encapsulate other information in models– E.g. Naïve Bayes classifiers for email use pseudo

words like “sender-tld:info”, “sender-tld:com” and “address-known:false”, “address-known:true” to improve accuracy

• We use similar methods with JavaScript• We extract words (though not all words)• And other features of interest• And feed these to a model

Malicious JavaScript

• Complications arise due to the extreme use of obfuscation techniques by attackers– And also legitimate vendors (e.g. Google)– And by large Web 2.0 libraries

v46f658f5e2260(v46f658f5e3226){ function v46f658f5e4207 () {return 16;} return(parseInt(v46f658f5e3226,v46f658f5e4207()));}function v46f658f5e61f4(v46f658f5e7174){ function v46f658f5ea0cd () {return 2;} var v46f658f5e813e=\'\';for(v46f658f5e9105=0; v46f658f5e9105<v46f658f5e7174.length; v46f658f5e9105+=v46f658f5ea0cd()){ v46f658f5e813e+=(String.fromCharCode(v46f658f5e2260(v46f658f5e7174.substr(v46f658f5e9105, v46f658f5ea0cd()))));}return v46f658f5e813e;} document.write(v46f658f5e61f4(\'3C5343524950543E77696E646F772E7374617475733D2\'));

• The above is JavaScript, but where are the features?– An exercise for the reader!

Case Study: India Times Malware Cocktail

25 Oct 07 first malware detected for ScanSafe customers

STAT team investigating Oi blocks from certain pages on the India Times website ranked 483rd by traffic

Impacted pages contain script pointing to remote site containing more iframes pointing to two further sites.

One iframe points to an encrypted script which exploits multiple vulnerabilities.

Successful exploit results in massive download of malware and assorted other files - over 434 files.

Installed malware includes cocktail of downloader/dropper Trojans, malicious binaries, scripts, cookies, other non-binaries

STAT tested binaries through VirusTotal and overall detection among signature-based AV vendors is low.

India Times notified immediately by STAT to prevent further infection - ScanSafe customers continued to be protected

“This starts an automatic chain of exploit, and all of it was

invisible to the user"

Mary Landesman - senior researcher

http://blog.scansafe.com/

Malicious Non-Executable Files

• Almost no-one opens executables from odd sources any more

• So instead people use drive-by attacks– They serve a normal file (JavaScript, JPEG,

Quicktime movie, animated cursor)– Which is crafted to exploit a vulnerability in a viewer

(Internet Explorer, Quicktime, a system library that a viewer depends on)

– Which causes code embedded within the file to be executed

– Which then downloads• The actual executable• Or another program to download the main payload.


• We’ve already covered JavaScript• But there are a lot of file formats out there• It’s not feasible to figure out the formats for all

these files themselves– So we have to write an application that can learn a file

format

• In the case of zero-day attacks, we have no data to compare against– So we can’t just create and train a simple binary

classifier


• We’ll deal with the second element first• If we can’t train a binary classifier

– We have to train a unary classifier

• Basically this is anomaly detection– Already used in business to help detect fraud– Typically define (sometimes implicitly) a probability

distribution over all possible data• And so generate a probability of a particular datum being

“normal”• Use some decision function based on this probability to

decide whether or not to block


• However we also have to automatically extract features from the file– Could use kernel methods (1-class SVM, bounding hypersphere)– But developing a kernel to capture the latent structure is not

easy– And may be expensive to execute

• Could use probabilistic methods– HMMs are good for sequences

• But poor at capturing long-range correlations

– Algorithms exist for capturing grammars probabilistically• But are difficult to implement• And may also be expensive in terms of runtime.

• Another exercise for the reader!

Malicious Executable Files

• The final stage of an attack is downloading an executable

• Typically blocked using signatures– Effectively quite advanced regular

expressions

• Virus writers now release several variations of their virus over its life-time

• And release viruses that change themselves as they propagate


• This all makes signature based approaches increasingly infeasible– F-Secure now checks every file against 500,000 signatures– McAfee now checks every file against 360,000 signatures

• The rate at which variants of viruses come out is growing rapidly– The Storm worm launched separate Christmas and New Year

versions of its attack within days of each other

• Vendors are struggling to develop techniques to detect variants using their existing technologies– But continuing to add separate signatures for each new variant

is not feasible


• We seek to investigate machine learning techniques to look into this.

• Several approaches have been used in the past– Typically binary classifiers using existing virus samples– Techniques include decision trees, self-organising maps, naïve-

bayes classifiers, neural networks, SVMs and others– Features are usually library includes, strings, or hex-sequences

selected using information theoretic techniques (e.g. information gain)

– Some break the executable into a graph, where nodes correspond to blocks of code (most of which are identical between variations) and perform analysis on the graphs to determine similarity.

Malicious Executable FilesWindows Portable Executable (PE) is a rich format, starts with magic number

‘MZ’ so easy to detect. This means we can quickly extract features without resorting to disassembly or flow graph construction. Some notable features:

• 60% of recent malware is obfuscated. We determined that if an executable is obfuscated, there is a greater than 95% probability that it is malware.

• An executable consists of sections, such as header, text, code and so on. There are generally fewer sections in malicious files than in non-malicious ones. In our analysis, more than 70% of the malware samples consisted of two or three sections, while more than 70% of non-malicious files consisted of four or five sections.

• Another notable feature relates to peculiarities in the executable structure – for example, some sections in the executable may not be aligned properly. In our analysis, more than 78% of malware revealed an anomaly in the executable structure, while only 5% of non-malicious samples had an anomaly in their structure. If an anomaly exists, there is a more than 93% chance that the sample is malicious.

• As part of our investigations we also calculated statistics relating to the importing of DLL files. For example, if an executable imports system32.dll, then the sample has a more than 77% chance of being malware and if it imports kernel32.dll, then the sample has a more than 67% chance of being malware.


As discussed there are many classification algorithms at our disposal.Currently we are using the naive-Bayes classificationalgorithm as it is both accurate and simple to implement.The simplified algorithm (assuming that there are only twoclasses: malware and non-malware) is given in Equation (1).

Where x = [x1, x2, · · · , xn] is an array of selected featuresfrom an executable, P(c|x) is the a posteriori probability thatthe executable with feature set x is in class c, and P(x|c) isthe probability of x occurring in class c.

Malicious Executable FilesWe used one group of non-malware and 28 released malware groups that had been detected by our analysis team in recent months. Each group contained around 150 to 300 samples. We plotted the results of our experiment. A smooth, dashed curve shows the recognition

We are consistently getting more than 90% accuracy detection of malware. The FPR of our system is around 10% and we are trying to reduce this by extracting new features and by developing a new feature selection algorithm.

Control flow Graph

Can be matched by a graph edit distance and nearest neighbour classifier – slow :(


Much like early attempts to classify email using naïve-bayes– Which concentrated only on text– Until someone thought to use the entire

context of the email, such as when it was sent, from whom, the domain and TLD of the email address etc.

– Which brings us to

Website Reputation Classifier

Gather information from context– Time of request– TLD, domain of server– Type of URL (IP Address, Domain Name)– Geographic location of server– Details of request (drive-bys may not simulate a browser)– Details of response (server may be misconfigured)– And any other information

• And use it to alter the prior probability of malware from the default 0.5

• Which may help control the FP rate.

Any Questions?

Problem Overview

• Attackers no longer rely on users launching executables

• Rely on drive-by download techniques to launch an executable without user involvement

• Examples include– JavaScript exploiting browser vulnerabilities to launch

remote executables– Website content (ANIs, WMFs, etc.) exploiting

browser and / or operating system vulnerabilities to launch remote executables

Problem Overview

• Things to look out for– Buffer overflows: extraordinarily long field values– Integer overflows: value encoded in 4 bytes is very

large:• Hard to spot!• But could be found by the absence of leading zeros in e.g. 4

byte length fields

– Exploit Code• May not resemble expected data• However raw data in some formats (JPEG, MP3) may be

relatively indistinguishable from machine-code.

Problem Specification

• Examine first 300 or so bytes of file

• Detect if it’s normal– If not normal, it’s an exploit

• System should infer file-structure itself to determine normalcy– Unfeasible for us to manually break down

every file-format into individually interesting features

Anomaly Detection

• Techniques used in machine learning and statistics to detect “outliers”: data-points (such as file content) which aren’t probable (normal)

• Two broad approaches– Non-Probabilistic Discriminative Classifiers

• Learn a function that spits out positive or negative depending on some version of the data

– Probabilistic Generative Classifiers• Find a way of estimating the actual probability of the file

being what it appears to be and use that to make a decision

Anomaly Detection :: Techniques

• Non-Probabilistic Classifiers– One-Class Support-Vector Machine (SVM) using

Sequence Kernels[Trialed, not implemented]

• Probability Density Estimation (PDE)– Hidden Markov Model (HMM)

[Implemented]– Hierarchical Hidden Markov Model (HHMM)

[Not Implemented, Not Planned]– Factorial Hidden Markov Model (FHMM)

[Not Implemented, but an avenue for future work]

Anomaly Detection :: Classifiers

• One-Class Support-Vector Machine (SVM)– If a binary (2-class) classifier draws a line between two classes– A unary classifier draws a circle around the data – everything

outside the circle is weird.

• SVMs try to find the best place to place the line, and can work around errors in the dataset

• They store the line in terms of the inputs it crosses (the “support-vectors”)

• They minimise the number of support-vectors they have to store to represent this line.

• SVMs use “kernels” to find a way of representing the data such that it’s easy to figure out where to place the line


• Kernels can also be used to convert symbolic data (such as strings and sequences) into a tractable numeric form.

• Kernels can also be chained together to help figure out “where to put the line”

ÿØÿà..JFIF.....H.H..ÿá.§Exif..MM.*........

[10, 23, 34, 0, 0, 0, 23, 0, 23, …, 0, 0, 2, 1]


• In testing performance (using string kernel) was quite poor– Needed to store a large number of support vectors to

remember where the line was– Only detected buffer overflows, not integer overflows– Couldn’t be re-trained on the go

• Arguably all these problems could be solved by a more complex kernel function– But that would increase run-time

Anomaly Detection: PDE

• Probability Density Estimation

• Return a probability for each file-header indicating how typical it is

• Approach implemented is a simple Hidden Markov Model (HMM), using various heuristics to help it fit the file-types.

• What is a HMM?

Anomaly Detection :: HMMs

• Implementation Issues: – How to jointly determine the probabilities of

• Certain characters appearing in each stage• Moving from one stage to another for all stages• Answer is the Expectation Maximisation (EM) algorithm

– How to figure out the structure of the model in advance

• “Structural Learning” problem is single major problem in machine learning

• In our case we use heuristics based on reg-exp idea.– Variable Length Sequences

• Multiply result by constant multiple of probability of file size (normally dist)

Machine Learning applied to Security Steve Poulson 25 th Feb 2010.

Documents

Transcript of Machine Learning applied to Security Steve Poulson 25 th Feb 2010.