Jim Geovedi - Machine Learning for Cybersecurity
-
Upload
idsecconf -
Category
Technology
-
view
242 -
download
0
Transcript of Jim Geovedi - Machine Learning for Cybersecurity
GVDJ #IDSECCONF2016
machine learning for cybersecurity
USA SOUTH KOREA
NORTH KOREA INDONESIA
GVDJ #IDSECCONF2016
security goals
▸ security goals
▸ confidentiality of information and resources
▸ integrity of information and resources
▸ availability of information and resources
▸ basic definitions
▸ threat: potential violation of a security goal
▸ security: protection from intentional threats
▸ attack: intentional violation of a security goal
GVDJ #IDSECCONF2016
security mechanisms
▸ security policies and mechanisms
▸ policy: statement of what is and what is not allowed
▸ mechanism: method or tool enforcing a security policy
▸ security is a process, not a product!
▸ strategies for security mechanisms
▸ prevention of attacks, e.g. encryption
▸ detection of attacks, e.g. virus scanner
▸ analysis of attacks, e.g. forensic
GVDJ #IDSECCONF2016
prevention is a hard task
▸ continuous discovery of vulnerabilities
▸ insecure software and hardware
▸ developers unawareness
goto fail;goto fail;
goto fail(february 2014)
heartbleed(april 2014)
shellshock(september 2014)
GVDJ #IDSECCONF2016
attacks against services
▸ numerous security breaches at popular web services
▸ identities often include real names, addresses, emails, passwords, etc.
‘;--have i been pwned?
142pwned websites
1,444,567,928pwned accounts
39,842pastes
31,108,929paste accounts
GVDJ #IDSECCONF2016
imbalance of security cycle
▸ increasing imbalance of security cycle
▸ increasing number of vulnerabilities
▸ high amount of novel attacks
▸ high diversity of malicious software
▸ bottleneck: human analyst in the loop
▸ manual discovery of vulnerabilities
▸ manual generation of attack signatures
▸ manual analysis of malicious software
GVDJ #IDSECCONF2016
conventional detection
▸ conventional attack detection using signatures
▸ ineffective against novel and unknown attacks
▸ inherent delay to availability of novel signatures
▸ analysis obstructed by polymorphism and obfuscation
HEADER APPLICATION PAYLOAD
... IP TCP GET /scripts/ ..%c1%9c.. /system32/cmd.exe
TCP ..%c1%9c.. NIMDA WORM
GVDJ #IDSECCONF2016
intelligent defence
▸ construction of intelligent security systems
▸ combining computer security and machine learning
▸ minimum human intervention on prevention, detection, and analysis
▸ challenge in practice
▸ effectivity, efficiency, and robustness
▸ transparency and controlability
machine learning for cybersecurity
MACHINELEARNING
PREDICTIONPLATFORM
HUMANINTUITION
attack mitigation issues
supervised unsupervised
rules driven(limited by experiences and expertise)
high rates undetectable attacks(false negatives)
delayed response(between detection and prevention)
statistical driven(improved detection of new attacks)
substantial investigative efforts (false positives)
alarm fatigue and distrust(reversion to supervised method)
GVDJ #IDSECCONF2016
implementation challenges
▸ lack of data: limited or no history of previous attacks (required by supervised learning model)
▸ evolving attacks: attackers constantly change their behaviours, making current models obsolete
▸ limited resources: relying on security analysts to investigate the attacks can be costly and time consuming
GVDJ #IDSECCONF2016
components
THREAT PREDICTION PLATFORM
MODEL
ANALYSTS
PREDICTIONFEATURE
RAWDATA ACTION
EVENTSMODELLING
CONTEXTUALMODELLING
GVDJ #IDSECCONF2016
components
▸ big data processing system: quantifying features from raw data
▸ outlier detection system: learning a descriptive model using features from unsupervised learning process
▸ feedback mechanism and continuous learning: incorporating analyst input
GVDJ #IDSECCONF2016
data characteristics
GVDJ #IDSECCONF2016
data characteristics0.1 data sources
▸ common sources: networking devices and applications log
▸ router, switch, firewall, ids, ips, and load balancer devices
▸ web, database, and micro services
▸ frontend and backend applications
▸ delivered in realtime from widely distributed systems
GVDJ #IDSECCONF2016
data characteristics0.2 data dimensions and unique entities
▸ volume of raw data: metrics (GB/TB) or number of lines (≥ tens of millions on a daily basis)
▸ specific to behavioural analytics: IP addresses, users, sessions, etc.
01010101010101001111010111010101010100011000100101000100111101101010010001001001001000101011110110100111101101001100011110101011101011100110101110110111011001111110000010100110001000001110110101100001000000011010111110111011001110011100010001000100111001000011101111101111011010010010011010001010001110111110001001001001
GVDJ #IDSECCONF2016
data characteristics0.3 malicious activity prevalence
▸ under normal circumstances, malicious activities are extremely rare (generally ≤ 0.1%)
▸ resulting extreme class imbalance in supervised learning
▸ increasing the difficulty of detection processes
▸ unknown and/or unreported attacks introduce noise into data
▸ attack vectors can take a wide variety of shapes
GVDJ #IDSECCONF2016
big data analytics
DAILYWEEKLY
MONTHLY
RAW DATA AGGREGATED DATA
JIM ✖ ✖ ✖
FEATURES
IS NE
W U
SER?
LAST
CHAN
GED
PASS
WOR
D
LAST
IP AD
DRES
S
LAST
SESS
ION
LENG
TH .....
.....
.....
.....
.....
NUM
BER O
F FAI
LED
LOGI
N
JIM
GVDJ #IDSECCONF2016
big data analytics0.1 behavioural signatures
▸ quantifying signatures (often comprises the series of attack steps) from raw data
▸ quantitative values can be defined by security analysts
▸ extracting features per-entity and per-time-segment basis
GVDJ #IDSECCONF2016
big data analytics0.2 design requirements
▸ capable of analysing ≥ 10 millions entities in daily basis
▸ capable of updating and retrieving signatures of active entities, on demand and/or in realtime
GVDJ #IDSECCONF2016
big data analytics0.3.1 process: activity tracking
▸ absorbing the log stream: identifying the entities and updating corresponding records
▸ in short temporal window: 30 minutes, 1 hour, 12 hours, or 24 hours.
▸ focus on efficient retrieval for feature computation
GVDJ #IDSECCONF2016
big data analytics0.3.2 process: activity aggregation
▸ computing behavioural features over an interval of time
▸ retrieving all activity records within given interval
▸ aggregating smaller time unit (minutes, hours, days, weeks) as the feature demands
GVDJ #IDSECCONF2016
algorithm selection
GVDJ #IDSECCONF2016
algorithm selection
GVDJ #IDSECCONF2016
outlier detection
OUTLIER
GVDJ #IDSECCONF2016
outlier detection
▸ matrix decomposition-based outlier analysis
▸ replicator neural networks
▸ density-based outlier analysis
▸ score interpretation
▸ transforming score to probabilities
▸ detection ensembles
MATRIX DECOMPOSITION
REPLICATOR NEURAL NETWORKS
GVDJ #IDSECCONF2016
continuous learning
▸ overcomes limited analyst bandwidth
▸ overcomes weaknesses of unsupervised learning
▸ actively adapts and synthesises new models
PREDICTACT
TRAIN
GVDJ #IDSECCONF2016
example: open network insightleveraging insights from flow and packet analysis
GVDJ #IDSECCONF2016
example: open network insightadvantages
GVDJ #IDSECCONF2016
example: open network insighthow it works
GVDJ #IDSECCONF2016
example: entradanetwork data analytics platform
GVDJ #IDSECCONF2016
summary
▸ current problems of security
▸ automatisation of attacks
▸ massive amount of novel malicious code
▸ defences involving manual actions (often ineffective)
▸ machine learning in security
▸ adaptive defences using learning algorithms
▸ automatic detection and analysis of threats
QUESTIONS?