Automatic Trust Management for Adaptive Survivable Systems (ATM for ASS’s) Howard Shrobe MIT AI...
description
Transcript of Automatic Trust Management for Adaptive Survivable Systems (ATM for ASS’s) Howard Shrobe MIT AI...
Automatic Trust Managementfor
Adaptive Survivable Systems(ATM for ASS’s)
Howard Shrobe MIT AI LabJon Doyle MIT Lab for Computer Science
A Motivating Example: Background• In the MIT AI Lab, an ensemble of computers runs a Visual Surveillance and
Monitoring application.
• On January 12, 2001 several of the machines experience unusual traffic from outside the lab.
• Intrusion Detection systems report that several password scans and other probes.
• After about 3 days of varying levels of such activity, things seem to return to normal
• For another 3 weeks no unusual activity is noticed.
• Then, a crucial machine (Harding) begins to experience unusually high load averages and the components that run on this machine begin to receive less than the expected quality of service.
• The load average, degradation of service, the consumption of disk space and the amount of traffic to and from unknown outside machines continue to increase to annoying levels.
• Then they level off.
A Motivating Example: The Quandary
• On March 2, a high performance machine in the ensemble (Grant) crashes.
• The application has been written in a way which allows it to migrate the computations on Grant.
?Harding
Load Average
Potentially Hacked
Grant
C1
• Harding has been behaving oddly and is heavily loaded. • Grant’s computations are critical to the application.
• Should the system migrate Grant’s computations to Harding?
A Motivating Example: Explaining the Decision• The system needed to run the computations somewhere.
?Harding
Load Average
Potentially Hacked
Grant
C1
Thing1Load Average
Thing2Load Average
• Although more loaded than expected, Harding was still the best pool of available resources, – Other machines were even more heavily loaded with other critical
computations of the application. • Hackers had correctly guessed a user password on Harding;
– They had set up a public FTP site containing pirated software – They had not, in fact, gained root access.
Hack isn’trelevant
• There was, therefore, no serious worry in migrating critical computations to Harding
A Different Example
• The application was being run to protect a US embassy in Africa during a period of international tension.
• We had observed a variety of information attacks being aimed at Harding.
• At least some of these attacks are of a type known to be effective in gaining root access to a machine like Harding.
• They are followed by a period of no anomalous behavior other than a periodic low volume communication with an unknown outside host.
• When Grant crashes, should Harding be used as the backup?
The Explanation
• It is likely that an intruder has gained root access to Harding.
• It is also likely that the intent of the intrusion is malicious and political.
• It is less likely, but still possible, that the periodic connection to an outside host is an attempt to contact a control source for a “go signal” that will initiate serious spoofing of the application.
• Under these circumstance, it is wiser to shift the computations to a more trusted machine (Grant) even though it is more overloaded than Harding.
The Core Thesis
Survivable systems make careful judgments about
the trustworthiness of their computational environment
and they make rational resource allocation decisions
based on their assessment of trustworthiness.
The Thesis In Detail: Trust Model
• It is crucial to estimate to what degree and for what purposes a computational resource may be trusted.
• This influences decisions about:– What tasks should be assigned to which resources.– What contingencies should be provided for,– How much effort to spend watching over the resources.
• The trust estimate depends on having a model of the possible ways in which a computational resource may be compromised.
The Thesis in Detail: Perpetual Analytic Monitoring
• Trust Models depend on having a system for long term monitoring and analysis of the computational infrastructure.
• Monitoring must detect complex temporal patterns.– E.g. “a period of attacks followed by quiescence followed by increasing
degradation of service”
• The monitoring system must assimilate information from:– Self-checking observation points within the application itself– Intrusion detection systems– Firewalls, filtering routers– Other health status indicators
The Thesis in Detail: Adaptive Survivable Systems
• The application itself must be capable of self-monitoring and diagnosis – It must know the purposes of its components– It must check that these are achieved– If these purposes are not achieved, it must localize and characterize the
failure
• The application itself must be capable of adaptation so that it can best achieve its purposes within the available infrastructure.– It must have more than one way to effect each critical computation– It should choose an alternative approach if the first one failed– It should make its initial choices in light of the trust model
The Thesis in Detail: Rational Resource Allocation
• This depends on the ability of the application, monitoring, and control systems to engage in rational decision making about what resources they should use to achieve the best balance of expected benefit to risk.
• The amount of resources dedicated to monitoring should vary with the threat level
• The methods used to achieve computational goals and the location of the computations should vary with the threat
• Somewhat compromised systems will sometimes have to be used to achieve a goal
• Sometimes doing nothing will be the best choice
The Active Trust Management Architecture
Self Adaptive Survivable Systems
PerpetualAnalytical Monitoring
Trust Model:TrustworthinessCompromises
AttacksRational Decision
Making
Other InformationSources:
Intrusion Detectors
TrendTemplates System Models
&Domain Architecture
Rational Resource Allocation
The Nature of a Trust Model
• Trust is a continuous, probabilistic notion– All computational resources must be considered suspect to some degree.
• Trust is a dynamic notion– the degree of trustworthiness may change with further compromises – the degree of trustworthiness may change with efforts at amelioration– The degree of trustworthiness may depend on the political situation and
the motivation of a potential attacker• Trust is a multidimensional notion
– A System may be trusted to deliver a message which not being trusted to preserve its privacy.
– A system may be unsafe for one user but relatively safe for another.• The Trust Model must occupy at least Three tiers
– The Trustworthiness of each resource for each specific purpose– The nature of the compromise to the resources– The nature of the attacks on the resources– Most work has only looked at attacks (intrusion detection).
Tiers of a Trust Model
• Attack Level: history of “bad” behaviors – penetration, denial of service, unusual access, Flooding
• Compromise Level: state of mechanisms that provide:– Privacy: stolen passwords, stolen data, packet snooping– Integrity: parasitized, changed data, changed code– Authentication: changed keys, stolen keys– Non-repudiation: compromised keys, compromised algorithms– QoS: slow execution– Command and Control Properties: compromises to the monitoring
infrastructure
• Trust Level: degree of confidence in key properties– Compromise states– Intent of attackers– Political situation
Perpetual Analytical Monitoring (MAITA)• Collects evidence from a broad variety of sources
– Intrusion detection systems– Network monitors– Firewalls– Self monitoring application systems
• Filters, aggregates, correlates and conditions the data
• Matches these against a knowledge base of trend templates– Templates represent temporal patterns indicative of the “etiology of a
disease”– Degree of match is indicative of the likelihood of compromise
• Sends alerts to running applications in the case of “alarm situations”
• Can be directed to increase/decrease its activity to:– Bolster trust in a desirable resource– Carefully monitor a potentially “dicey situation”– Free “computrons” for more important activities
What is a Trend Template?
• A Temporal Pattern characteristic of etiology• Key Features:
– Landmark points– Temporal constraints between Landmarks– Intervals bounded by Landmarks– Value Constraints relating Variables within the intervals
• Trend templates represent a conditional probability of the conclusion given a match to the template– Probabilistic inference also depends on the degree of fit to each trend
template– More than a single template may match the data
• Trend templates are used to recognize compromised resources after the fact.
• Trend templates are also used to recognize attacks– Some attacks cannot be expressed without them
Trend Templates
Increasing Attacks and Probes Variable but High
Rate of Attacks and Probes
Dropping RateOf Attacks
Quiet
Increasing Disk Use,Increasing External Access
Compromise:Stolen Password
Exposed FTP Site
1 - 3 days 1 - 3 days 1 - 3 hours 1 - 3 weeks 1 - 3 weeks
Adaptive Survivable Systems
Super routinesLayer1
Layer2
Layer3
Post Condition 1 of FooBecause Post Cond 2 of B
And Post Cond 1 of C
PreReq 1 of BBecause Post Cond 1 of A
A
B
C
Foo
Synthesized Sentinels
Development Environment Runtime Environment
DiagnosticService
Repair PlanSelector
ResourceAllocator
alerts
A B
Condition-1Condition-1
SelfMonitoring
RollbackDesigner
Enactment
Plan Structures
Component Asset Base
1 2 3
Foo
1 2 3
B1 2 3
A
Method 3Is most Attractive
1 2 3
To: Execute Foo
Rational Selection
Diagnosis &Recovery
How to Build Adaptive Survivable Systems• Make Systems Fundamentally dynamic
– Systems should have more than one way to achieve any goal– Always allow choices to be made (or revised) late
• Inform the runtime environment with design information– Systems should know the purposes of their component computations– Systems should know the quality of different methods for the same goal
• Make the System responsible for achieving its goals – Systems should diagnose the failure to achieve intended goals– Systems should select alternative techniques in the event of failure
• Build a trust model by pervasive monitoring and temporal analysis
• Optimize dynamically in light of the trust model– balance between quality of goal achieved vs risk encountered
Dynamic Rational Component Selection
• Systems have more than one method for each task.
• Each method specifies– Quality of Service provided – Resources Consumed – Likelihood of success
• Likelihood of success is updated to reflect current state of trust model
• Select the method with greatest Expected Net Benefit
• Generalizes “Method Dispatch”Replace notion of “Most Specific Method” By that of “Most Beneficial Method”
Informing the Runtime with Design Info
• Plan Structures – Goals – Invariants – Dependencies
• Dispatching – Alternative Methods for common tasks– Applicability conditions for alternatives– Decision Criteria for selecting methods
• Active Sentinels– Monitors for expected conditions– Data Collectors for runtime statistics– Data on long term failure rates
Making the System Responsible for Achieving Its Goals
Scope of RecoverySelection of Alternative
Localization & Characterization
DiagnosticService
Repair PlanSelector
ResourceAllocator
Concrete Repair Plan
Resource Plan
alerts
achieves
requiresA
B
Condition-1
Condition-1
prerequisite
Monitor
RollbackDesigner
Enactment
The Space of Intrusion Detection
StatisticalProfile
StructuralModel/Pattern
Match to Bad
Discrepancy from GoodAnomaly
Suspicious Violation
Symptom
Model of Expected Behavior.UNSUPERVISED LEARNING FROM NORMAL RUNS
SUPERVISED LEARNING FROM ATTACK RUNS
HANDCODED STRUCTURAL MODELS OF ATTACKS
A symptom may indicate an attack or a compromise
Model Based TroubleshootingGDE
Times
Times
Times
Plus
Plus
3
5
3
5
5
40
40
35
40
Conflicts:
Diagnoses:
25
20
Blue or Violet Broken
Green Broken, Red with compensating fault
Green Broken, Yellow with masking fault
15
15
25
Moving to a Bayesian Framework• The model has two (or more) levels of detail specifying computations, the
underlying resources and the mapping of computations to resources• Each computation has models of its normal and compromised states (and a
model for “everything else”)• Each resource has models of its normal and compromised states• The modes of the resource models are linked to the modes of the
computational models by conditional probabilities• The Model can be viewed as a Bayesian Network
Normal: Delay: 2,4
Delayed: Delay 4,+inf
Accelerated: Delay -inf,2
Node17
Located On
Normal: Probability 90%
Parasite: Probability 9%
Other: Probability 1%
Component 1
Has models Has models
Conditional probability = .2
Conditional probability = .4
Conditional probability = .3
The Diagnostic Algorithm
• Begin with all normal modes• Repeat until there are no conflicts
– Extend the Bayesian network with a new node indicating the incompatibility of the models in the conflict
– Remove all supersets of the conflict– Recompute the probabilities in the Bayesian network– Choose the least likely model in the conflict and replace it with its
most likely alternative• The Bayesian network now has posterior probabilities for
all models• By adding nodes to the Bayesian network for any
remaining diagnosis (I.e. a model for each component) we can compute the posterior probability of that particular diagnosis
• The updated probabilities are fed back to the monitoring system.
Final Model Probabilities
Hacked Hacked Hacked Normal NormalResource Posterior Prior Posterior PriorTrader-Joe .324 .300 .676 .700Bonds-R-Us .207 .200 .793 .800JPMorgan-Net .450 .150 .550 .850WallSt-Server .267 .100 .733 .900
Computation Mode ProbabilityWeb-Server Off-Peak .028
Peak .541Normal .432
Dollar-Monitor Slow .738Normal .262
Yen-Monitor Slower .516Slow .339Normal .145
Bond-Trader Slow .590Fast .000Normal .410
Currency-Trader Slow .612Fast .065Normal .323
Work Plan
• New start July 1, 2000• Tasks in Base Effort
– Trust Models: Ontologies of Attacks, Compromises, Intentions, and Trust Positions
– Perpetual Analytic Monitoring: Trend templates dealing with compromises, informed by IT systems and self-monitoring applications
– Rational Trust Management: Decision theoretic models and algorithms for allocating resources to computations
– Test bed for above• Tasks in Options
– Self-Adaptive Application Infrastructure: Synthesis of monitors, diagnostic techniques,
Milestones
• Trust models– Publish final ontology of attacks, compromises etc - 12 month
• Perpetual Analytic Monitoring– Demonstrate trust monitoring library 12 month– Demonstrate reconfiguration of monitoring infrastructure 18 month– Final exam 18 month
• Rational Trust Management– Demonstrate initial models and algorithms 12 month– Demonstrate capability in high-frequency real-time environment 18
month• Options:
– Demonstrate initial prototype of Adaptive System infrastructure which utilizes the initial trust model ontology. 24 month
– Demonstrate prototype of integrated Adaptive System development and runtime environment, including all aspects of intrusion tolerance. 30 month
– Final exam 36 month
Major Risks
• How to guard against monitoring infrastructure itself becoming compromised or denied service
• Comprehensiveness of the ontologies (attacks, compromises, trust states, etc.)
• Decision making in real-time (method selection, resource allocation)
Mitigation Strategy
• In principle, much of the other technology could be used to harden our infrastructure
• Our techniques for adaptivity can be applied to our infrastructure
• Other projects are also working on cataloging the same knowledge, broaden our list of collaborators
• Replace explicit decision theoretic techniques by qualitative analogs and by switching between rule-based policies that approximate decision theoretic conclusions.
Testing Strategy
• Deploy small application in the AI Lab environment– Open university environment– Subject to frequent hackery
• Use this as Test Bed
• Incrementally deploy research techniques in test bed
• Measure effectiveness– Against usual background– Against intentional, staged attacks (by us and our friends)
Conclusion
Survivable systems make careful judgments about
the trustworthiness of their computational environment
and they make rational resource allocation decisions
based on their assessment of trustworthiness.
We hope to demonstrate the feasibility of this approach over the next 18 months and then enrich the technical solutions during the remaining contract period.