Tavolo 2 - Big Data Adaptive Monitoring
description
Transcript of Tavolo 2 - Big Data Adaptive Monitoring
Tavolo 2 - Big DataAdaptive Monitoring
CIS, UNINA, UNICAL, UNIFI
Primi risultati
• I primi risultati del tavolo sono stati pubblicati nell’articolo
Big Data for Security Monitoring: Challenges Opportunities in Critical Infrastructures Protection
L. Aniello2, A. Bondavalli3, A. Ceccarelli3, C. Ciccotelli2, M. Cinque1, F. Frattini1, A. Guzzo4, A. Pecchia1, A. Pugliese4, L. Querzoni2, S. Russo1
(1) UNINA (2) CIS (3) UNIFI (4) UNICAL
• Presentato al workshop BIG4CIP @EDCC (12 maggio 2014)
Data-Driven Security Framework
CRITICAL INFRASTRUCTURE
DATA ANALYSIS
DATA PROCESSING
MONITORINGADAPTER
ATTACK MODELING
INVARIANT-BASEDMINING
BAYESIANINFERENCE
RAW DATACOLLECTION
…
KNOWLEDGE BASE
ENVIRONMENTAL DATANODE RESOURCE DATA
APPLICATION/SYSTEM LOGS
IDS ALERTSNETWORK AUDIT… PROTECTION
ACTIONS
FUZZY LOGIC
CONFORMANCE CHEKING
ADAPTIVE MONITORING
Scenario
• Problem:need to analyze more data coming from distinct sources in order to improve the capability to detect faults/cyber attackso Excessively large volumes of information to transfer and analyzeo Negative impact on the performance of monitored systems
• Proposed solution:dynamically adapt the granularity of monitoringo Normal case: coarse-grained monitoring (low-overhead)o Upon anomaly detection: fine-grained monitoring (higher overhead)
• Two distinct scenarioso Fault detection current CIS’s research directiono Cyber attack detection
Anomaly Detection
• Metrics Selectiono Find correlated metrics (invariants) to be used as anomaly signalso Learn which invariants hold when the system is healthy Profile the healthy behavior of the monitored system
• Anomaly Detectiono Monitor the health of the system by looking at a few metrics
How to choose these metrics?o When an invariant stops to hold, adapt the monitoring
The aim is detecting the root cause of the problem Possibility of false positives
[1] J., M., R., W., "Information-Theoretic Modeling for Tracking the Health of Complex Software Systems", 2008[2] J., M., R., W., " Detection and Diagnosis of Recurrent Faults in Software Systems by Invariant Analysis", 2008[3] M., J., R., W., "Filtering System Metrics for Minimal Correlation-Based Self-Monitoring", 2009
Adapt the Monitoring
• Two dimensions in adapting the monitoringo Change the set monitored metricso Change the frequency of metrics retrieval
• How to choose the way of adapting the monitoring on the basis of the detected anomaly?
• Additional issueo The goal of the adaptation is discovering the root cause of the problemo Need to zoom-in specific portions of the system
Very likely to increase the amount of data to transfer/analyze Risk to have a negative impact on system performance Possible solution: keep the volume of monitored data limited by zooming-out
other portions of the system
[4] M., R., J., A., W., "Adaptive Monitoring with Dynamic Differential Tracing-Based Diagnosis", 2008[5] M., W., "Leveraging Many Simple Statistical Models to Adaptively Monitor Software Systems", 2014
Fault Localization
• Goal: given a set of alerts, determine which fault occurred and which component originated it• Problemso A same alert may be due to different faults (Ambiguity)o A single fault may cause several alerts (Domino Effect)o Concurrent alerts may be generated by concurrent unrelated faultso Tradeoff: monitoring granularity vs precision of fault identification
• Approaches:o Probabilistic models (e.g. HMM, Bayesian Networks)oMachine learning techniques (e.g. Neural Networks, Decision Trees)oModel-based techniques (e.g., Dependency Graphs, Causality Graphs)
[6] S., S., "A survey of fault localization techniques in computer networks", 2004[7] D., G., B., C., "Hidden Markov Models as a Support for Diagnosis: ...", 2006
Prototype - Work in Progress
JBoss AS
Host #1
gmond
JBoss AS
Host #2
gmond
JBoss AS
Host #3
gmond
JBoss AS
Host #4
gmond
Adaptive Monitoring
Mon. Host
gmetad
monitored
metricsmonitored metrics
monitored metrics
monitoring adaptations
monitoring adaptations
monitoring of a JBoss cluster by using Ganglia
Prototype - Goals
• Identify a small set of metrics to monitor on a JBoss cluster to detect possible faultso Find existing correlationso Profile healthy behavior
• Inject faults on JBoss with Byteman (http://byteman.jboss.org/)
• For each fault, identify the set of additional metrics to monitor• Implement the prototype in order to evaluate
o The effectiveness of the approacho The reactivity of the adaptationo The overhead of the adaptation
OPERATING SYSTEMS AND APPLICATION SERVERS MONITORING
Data collection and processing
• Collects a selection of attributes from OS and AS, through probes that have been installed on machines– Current implementation observes Tomcat 7 ad CentOS 6
• Executes the Statistical Prediction and Safety Margin algorithm on the data collected
• The CEP Esper is used to apply rules on events (performs the detection of anomalies)
• Work partially done within the context of the Secure! Project (see later today)
High level view
INVARIANTS MINING
Why invariants?
• Invariants are properties of a program that are guaranteed to hold for all executions of the program. – If those properties are broken at runtime, it is possible to
raise an alarm for immediate action
• Invariants can be useful to– detect transient faults, silent errors and failures – report performance issues– avoid SLAs violations – help operators to understand the runtime behavior of the
app
• Pretty natural properties for apps performing batch work
An example of flow intensity invariant
• A platform for the batch processing of files: the processing time is proportional to the file size
• Measuring the file size and the time spent in a stage, I(x) and I(y), (the flow intensities), the equation
is an invariant relationship characterising the expected behaviour of the batch system.– If there is an execution problem (e.g., file
processing hangs) the equation does not hold any more (broken invariant)
)()( xIkyI
Research questions
RQ1: how to discover invariants out of the hundreds of properties observable from an application log?
RQ2: How to detect broken invariants at runtime?
Our contribution
AUTOMATED MININGA framework and a tool for mining invariants automatically from application logs• tested on 9 months of logs collected from a real-world Infosys
CPG SaaS application • able to to automatically select 12 invariants out of 528 possible
relationships
IMPROVED DETECTIONAn adaptive threshold scheme defined to significantly shrink down the number of broken invariants• from thousands to tens broken invariants w.r.t. static thresholds
on our dataset
BAYESIAN INFERENCE
Data-driven Bayesian Analysis
• Security monitors may produce a large number of false alerts
• A Bayesian network can be used to correlate alerts coming from different sources and to filter out false notifications
• This approach has been successfully used to detect credential stealing attacks– Raw alerts generated during the progression of an attack
(e.g. user-profile violations and IDS notifications) are correlated
– The approach was able to remove around 80% of false positives (i.e., not compromised user being declared compromised) without missing any compromised user
Data-driven Bayesian Analysis
• Vector extraction starting from raw data:– each vector represents a security event, e.g., attack,
compromised user, etc…– suitable for post-mortem forensics and runtime analysis;– event logs, network audit, environmental sensors.
VECTOR EXTRACTION
v1 ✓ ✓
v2 ✓ ✓
v3 ✓ ✓ ✓ ✓
vN ✓ ✓ ✓
binary features (0 / 1)event
A14
hypothesis variable
information variables
(the user is compromised)
(alerts)unknownaddress
multiple loginssuspicious download…
Bayesian network
• Allows estimating the probability of the hypothesis variable (attack event), given the evidence in the raw data:
A2
A1
C
Network parameters a-priori probability P(C); conditional probability table (CPT)
for each alert Ai.
Incident analysis
• Estimate the probability that the vector represents an attack, given the features
V_i ✓ ✓ ✓
…✓ ✓
P(C)=0.31
vector
Preliminary testbed
• A preliminary implementation with Apache Storm
• Tested with synthetic logs emulating the activity of 2,5 million users, generating 5 millions log entries per day (IDS logs and user access logs)
Log lines Time (ms)
4.300.000 140.886
4.400.000 143.960
4.600.000 147.024
4.500.000 150.448
4.700.000 153.551
4.800.000 159.567
4.900.000 162.642
LogStreamer (spout)
FactorCompute (bolt)
AlertProcessor (bolt)
_________