Automatic Trust Management for Adaptive Survivable Systems (ATM for ASS’s) Howard Shrobe MIT AI...

Automatic Trust Managementfor

Adaptive Survivable Systems(ATM for ASS’s)

Howard Shrobe MIT AI LabJon Doyle MIT Lab for Computer Science

A Motivating Example: Background• In the MIT AI Lab, an ensemble of computers runs a Visual Surveillance and

Monitoring application.

• On January 12, 2001 several of the machines experience unusual traffic from outside the lab.

• Intrusion Detection systems report that several password scans and other probes.

• After about 3 days of varying levels of such activity, things seem to return to normal

• For another 3 weeks no unusual activity is noticed.

• Then, a crucial machine (Harding) begins to experience unusually high load averages and the components that run on this machine begin to receive less than the expected quality of service.

• The load average, degradation of service, the consumption of disk space and the amount of traffic to and from unknown outside machines continue to increase to annoying levels.

• Then they level off.

A Motivating Example: The Quandary

• On March 2, a high performance machine in the ensemble (Grant) crashes.

• The application has been written in a way which allows it to migrate the computations on Grant.

?Harding

Load Average

Potentially Hacked

Grant

C1

• Harding has been behaving oddly and is heavily loaded. • Grant’s computations are critical to the application.

• Should the system migrate Grant’s computations to Harding?

A Motivating Example: Explaining the Decision• The system needed to run the computations somewhere.

?Harding

Load Average

Potentially Hacked

Grant

C1

Thing1Load Average

Thing2Load Average

• Although more loaded than expected, Harding was still the best pool of available resources, – Other machines were even more heavily loaded with other critical

computations of the application. • Hackers had correctly guessed a user password on Harding;

– They had set up a public FTP site containing pirated software – They had not, in fact, gained root access.

Hack isn’trelevant

• There was, therefore, no serious worry in migrating critical computations to Harding

A Different Example

• The application was being run to protect a US embassy in Africa during a period of international tension.

• We had observed a variety of information attacks being aimed at Harding.

• At least some of these attacks are of a type known to be effective in gaining root access to a machine like Harding.

• They are followed by a period of no anomalous behavior other than a periodic low volume communication with an unknown outside host.

• When Grant crashes, should Harding be used as the backup?

The Explanation

• It is likely that an intruder has gained root access to Harding.

• It is also likely that the intent of the intrusion is malicious and political.

• It is less likely, but still possible, that the periodic connection to an outside host is an attempt to contact a control source for a “go signal” that will initiate serious spoofing of the application.

• Under these circumstance, it is wiser to shift the computations to a more trusted machine (Grant) even though it is more overloaded than Harding.

The Core Thesis

Survivable systems make careful judgments about

the trustworthiness of their computational environment

and they make rational resource allocation decisions

based on their assessment of trustworthiness.

The Thesis In Detail: Trust Model

• It is crucial to estimate to what degree and for what purposes a computational resource may be trusted.

• This influences decisions about:– What tasks should be assigned to which resources.– What contingencies should be provided for,– How much effort to spend watching over the resources.

• The trust estimate depends on having a model of the possible ways in which a computational resource may be compromised.

The Thesis in Detail: Perpetual Analytic Monitoring

• Trust Models depend on having a system for long term monitoring and analysis of the computational infrastructure.

• Monitoring must detect complex temporal patterns.– E.g. “a period of attacks followed by quiescence followed by increasing

degradation of service”

• The monitoring system must assimilate information from:– Self-checking observation points within the application itself– Intrusion detection systems– Firewalls, filtering routers– Other health status indicators

The Thesis in Detail: Adaptive Survivable Systems

• The application itself must be capable of self-monitoring and diagnosis – It must know the purposes of its components– It must check that these are achieved– If these purposes are not achieved, it must localize and characterize the

failure

• The application itself must be capable of adaptation so that it can best achieve its purposes within the available infrastructure.– It must have more than one way to effect each critical computation– It should choose an alternative approach if the first one failed– It should make its initial choices in light of the trust model

The Thesis in Detail: Rational Resource Allocation

• This depends on the ability of the application, monitoring, and control systems to engage in rational decision making about what resources they should use to achieve the best balance of expected benefit to risk.

• The amount of resources dedicated to monitoring should vary with the threat level

• The methods used to achieve computational goals and the location of the computations should vary with the threat

• Somewhat compromised systems will sometimes have to be used to achieve a goal

• Sometimes doing nothing will be the best choice

The Active Trust Management Architecture

Self Adaptive Survivable Systems

PerpetualAnalytical Monitoring

Trust Model:TrustworthinessCompromises

AttacksRational Decision

Making

Other InformationSources:

Intrusion Detectors

TrendTemplates System Models

&Domain Architecture

Rational Resource Allocation

The Nature of a Trust Model

• Trust is a continuous, probabilistic notion– All computational resources must be considered suspect to some degree.

• Trust is a dynamic notion– the degree of trustworthiness may change with further compromises – the degree of trustworthiness may change with efforts at amelioration– The degree of trustworthiness may depend on the political situation and

the motivation of a potential attacker• Trust is a multidimensional notion

– A System may be trusted to deliver a message which not being trusted to preserve its privacy.

– A system may be unsafe for one user but relatively safe for another.• The Trust Model must occupy at least Three tiers

– The Trustworthiness of each resource for each specific purpose– The nature of the compromise to the resources– The nature of the attacks on the resources– Most work has only looked at attacks (intrusion detection).

Tiers of a Trust Model

• Attack Level: history of “bad” behaviors – penetration, denial of service, unusual access, Flooding

• Compromise Level: state of mechanisms that provide:– Privacy: stolen passwords, stolen data, packet snooping– Integrity: parasitized, changed data, changed code– Authentication: changed keys, stolen keys– Non-repudiation: compromised keys, compromised algorithms– QoS: slow execution– Command and Control Properties: compromises to the monitoring

infrastructure

• Trust Level: degree of confidence in key properties– Compromise states– Intent of attackers– Political situation

Perpetual Analytical Monitoring (MAITA)• Collects evidence from a broad variety of sources

– Intrusion detection systems– Network monitors– Firewalls– Self monitoring application systems

• Filters, aggregates, correlates and conditions the data

• Matches these against a knowledge base of trend templates– Templates represent temporal patterns indicative of the “etiology of a

disease”– Degree of match is indicative of the likelihood of compromise

• Sends alerts to running applications in the case of “alarm situations”

• Can be directed to increase/decrease its activity to:– Bolster trust in a desirable resource– Carefully monitor a potentially “dicey situation”– Free “computrons” for more important activities

What is a Trend Template?

• A Temporal Pattern characteristic of etiology• Key Features:

– Landmark points– Temporal constraints between Landmarks– Intervals bounded by Landmarks– Value Constraints relating Variables within the intervals

• Trend templates represent a conditional probability of the conclusion given a match to the template– Probabilistic inference also depends on the degree of fit to each trend

template– More than a single template may match the data

• Trend templates are used to recognize compromised resources after the fact.

• Trend templates are also used to recognize attacks– Some attacks cannot be expressed without them

Trend Templates

Increasing Attacks and Probes Variable but High

Rate of Attacks and Probes

Dropping RateOf Attacks

Quiet

Increasing Disk Use,Increasing External Access

Compromise:Stolen Password

Exposed FTP Site

1 - 3 days 1 - 3 days 1 - 3 hours 1 - 3 weeks 1 - 3 weeks

Adaptive Survivable Systems

Super routinesLayer1

Layer2

Layer3

Post Condition 1 of FooBecause Post Cond 2 of B

And Post Cond 1 of C

PreReq 1 of BBecause Post Cond 1 of A

A

B

C

Foo

Synthesized Sentinels

Development Environment Runtime Environment

DiagnosticService

Repair PlanSelector

ResourceAllocator

alerts

A B

Condition-1Condition-1

SelfMonitoring

RollbackDesigner

Enactment

Plan Structures

Component Asset Base

1 2 3

Foo

1 2 3

B1 2 3

A

Method 3Is most Attractive

1 2 3

To: Execute Foo

Rational Selection

Diagnosis &Recovery

How to Build Adaptive Survivable Systems• Make Systems Fundamentally dynamic

– Systems should have more than one way to achieve any goal– Always allow choices to be made (or revised) late

• Inform the runtime environment with design information– Systems should know the purposes of their component computations– Systems should know the quality of different methods for the same goal

• Make the System responsible for achieving its goals – Systems should diagnose the failure to achieve intended goals– Systems should select alternative techniques in the event of failure

• Build a trust model by pervasive monitoring and temporal analysis

• Optimize dynamically in light of the trust model– balance between quality of goal achieved vs risk encountered

Dynamic Rational Component Selection

• Systems have more than one method for each task.

• Each method specifies– Quality of Service provided – Resources Consumed – Likelihood of success

• Likelihood of success is updated to reflect current state of trust model

• Select the method with greatest Expected Net Benefit

• Generalizes “Method Dispatch”Replace notion of “Most Specific Method” By that of “Most Beneficial Method”

Informing the Runtime with Design Info

• Plan Structures – Goals – Invariants – Dependencies

• Dispatching – Alternative Methods for common tasks– Applicability conditions for alternatives– Decision Criteria for selecting methods

• Active Sentinels– Monitors for expected conditions– Data Collectors for runtime statistics– Data on long term failure rates

Making the System Responsible for Achieving Its Goals

Scope of RecoverySelection of Alternative

Localization & Characterization

DiagnosticService

Repair PlanSelector

ResourceAllocator

Concrete Repair Plan

Resource Plan

alerts

achieves

requiresA

B

Condition-1

Condition-1

prerequisite

Monitor

RollbackDesigner

Enactment

The Space of Intrusion Detection

StatisticalProfile

StructuralModel/Pattern

Match to Bad

Discrepancy from GoodAnomaly

Suspicious Violation

Symptom

Model of Expected Behavior.UNSUPERVISED LEARNING FROM NORMAL RUNS

SUPERVISED LEARNING FROM ATTACK RUNS

HANDCODED STRUCTURAL MODELS OF ATTACKS

A symptom may indicate an attack or a compromise

Model Based TroubleshootingGDE

Times

Times

Times

Plus

Plus

3

5

3

5

5

40

40

35

40

Conflicts:

Diagnoses:

25

20

Blue or Violet Broken

Green Broken, Red with compensating fault

Green Broken, Yellow with masking fault

15

15

25

Moving to a Bayesian Framework• The model has two (or more) levels of detail specifying computations, the

underlying resources and the mapping of computations to resources• Each computation has models of its normal and compromised states (and a

model for “everything else”)• Each resource has models of its normal and compromised states• The modes of the resource models are linked to the modes of the

computational models by conditional probabilities• The Model can be viewed as a Bayesian Network

Normal: Delay: 2,4

Delayed: Delay 4,+inf

Accelerated: Delay -inf,2

Node17

Located On

Normal: Probability 90%

Parasite: Probability 9%

Other: Probability 1%

Component 1

Has models Has models

Conditional probability = .2



The Diagnostic Algorithm

• Begin with all normal modes• Repeat until there are no conflicts

– Extend the Bayesian network with a new node indicating the incompatibility of the models in the conflict

– Remove all supersets of the conflict– Recompute the probabilities in the Bayesian network– Choose the least likely model in the conflict and replace it with its

most likely alternative• The Bayesian network now has posterior probabilities for

all models• By adding nodes to the Bayesian network for any

remaining diagnosis (I.e. a model for each component) we can compute the posterior probability of that particular diagnosis

• The updated probabilities are fed back to the monitoring system.

Final Model Probabilities

Hacked Hacked Hacked Normal NormalResource Posterior Prior Posterior PriorTrader-Joe .324 .300 .676 .700Bonds-R-Us .207 .200 .793 .800JPMorgan-Net .450 .150 .550 .850WallSt-Server .267 .100 .733 .900

Computation Mode ProbabilityWeb-Server Off-Peak .028

Peak .541Normal .432

Dollar-Monitor Slow .738Normal .262

Yen-Monitor Slower .516Slow .339Normal .145

Bond-Trader Slow .590Fast .000Normal .410

Currency-Trader Slow .612Fast .065Normal .323

Work Plan

• New start July 1, 2000• Tasks in Base Effort

– Trust Models: Ontologies of Attacks, Compromises, Intentions, and Trust Positions

– Perpetual Analytic Monitoring: Trend templates dealing with compromises, informed by IT systems and self-monitoring applications

– Rational Trust Management: Decision theoretic models and algorithms for allocating resources to computations

– Test bed for above• Tasks in Options

– Self-Adaptive Application Infrastructure: Synthesis of monitors, diagnostic techniques,

Milestones

• Trust models– Publish final ontology of attacks, compromises etc - 12 month

• Perpetual Analytic Monitoring– Demonstrate trust monitoring library 12 month– Demonstrate reconfiguration of monitoring infrastructure 18 month– Final exam 18 month

• Rational Trust Management– Demonstrate initial models and algorithms 12 month– Demonstrate capability in high-frequency real-time environment 18

month• Options:

– Demonstrate initial prototype of Adaptive System infrastructure which utilizes the initial trust model ontology. 24 month

– Demonstrate prototype of integrated Adaptive System development and runtime environment, including all aspects of intrusion tolerance. 30 month

– Final exam 36 month

Major Risks

• How to guard against monitoring infrastructure itself becoming compromised or denied service

• Comprehensiveness of the ontologies (attacks, compromises, trust states, etc.)

• Decision making in real-time (method selection, resource allocation)

Mitigation Strategy

• In principle, much of the other technology could be used to harden our infrastructure

• Our techniques for adaptivity can be applied to our infrastructure

• Other projects are also working on cataloging the same knowledge, broaden our list of collaborators

• Replace explicit decision theoretic techniques by qualitative analogs and by switching between rule-based policies that approximate decision theoretic conclusions.

Testing Strategy

• Deploy small application in the AI Lab environment– Open university environment– Subject to frequent hackery

• Use this as Test Bed

• Incrementally deploy research techniques in test bed

• Measure effectiveness– Against usual background– Against intentional, staged attacks (by us and our friends)

Conclusion

Survivable systems make careful judgments about

the trustworthiness of their computational environment

and they make rational resource allocation decisions

based on their assessment of trustworthiness.

We hope to demonstrate the feasibility of this approach over the next 18 months and then enrich the technical solutions during the remaining contract period.

Automatic Trust Management for Adaptive Survivable Systems (ATM for ASS’s) Howard Shrobe MIT AI...

Documents

Transcript of Automatic Trust Management for Adaptive Survivable Systems (ATM for ASS’s) Howard Shrobe MIT AI...