RABA’s Red Team Assessments

RABA’s Red TeamAssessments

14 December 2005

QuickSilver

Agenda

• Tasking for this talk…• Projects Evaluated• Approach / Methodology• Lessons Learned

o and Validations Achieved

• The Assessmentso General Strengths / Weaknesseso AWDRAT (MIT)

• Success Criteria• Assessment Strategy• Strengths / Weaknesses

o LRTSS (MIT)o QuickSilver / Ricochet (Cornell)o Steward (JHU)

The Tasking“Lee would like a presentation from the Red Team perspective on the experiments you've been involved with. He's interested in a

• talk that's heavy on lessons learned and benefits gained.

Also of interest would be

• red team thoughts on strengths and weaknesses of the technologies involved.

Keeping in mind that no rebuttal would be able to take place beforehand,

• controversial observations should be either generalized (i.e., false positives as a problem across several projects) or left to the final report.” -- John Frank e-mail (November 28, 2005)

Specific Teams We Evaluated

• Architectural-Differencing, Wrappers, Diagnosis, Recover, Adaptive Software and Trust Management (AWDRAT)

o October 18-19, 2005o MIT

• Learning and Repair Techniques for Self-Healing Systems (LRTSS)

o October 25, 2005o MIT

• QuickSilver / Ricocheto November 8, 2005o Cornell University

• Stewardo Dec 9, 2005o JHU

Basic Methodology• Planning

o Present High Level Plan at July PI Meetingo Interact with White Team to scheduleo Prepare Project Overviewo Prepare Assessment Plan

• Coordinate with Blue Team and White Team

• Learningo Study documentation provided by teamo Conference Callso Visit with Blue Team day prior to assessment

• Use system, examine output, gather data

• Test

• Formal De-Brief at end of Test Day

Lessons Learned

(and VALIDATIONS achieved)

Validation / Lessons Learned

• Consistent Discontinuity of Expectationso Scope of the Assessment + Success Criteria

• Boiling it down to “Red Team Wins” or “Blue Team Wins” on each test required significant clarity

o Unique to these assessments because the metrics were unique

• Lee/John instituted an assessment scope conference call ½ way through

o we think that helped a lot

o Scope of Protection for the systems• Performer’s Assumptions vs. Red Team’s Expectations

• In all cases, we wanted to see a more holistic approach to the security model

• We assert each program needs to define its security policyo And especially document what it assumes will be protected / provided

by other components or systems

Existing System’s Perspective

LL: Scope of ProtectionAWDRAT:

- OS Environment (s/w & data at rest; services)

- Complete Path Protection

LRTSS:- Protect ALL Data Structures- Protect Dependent Relationships- Detect Pointer Corruption

QuickSIlver:- OS Environment

(s/w & data at rest; services)

Steward:- Protect Keys & Key Mgmt- Defense against Evil Clients- OS Environment (s/w & data at rest; services)

Red Team’s Perspective

Validation / Lessons Learned• More time would have helped A LOT

o Longer Test Period (2-3 day test vice 1 day test)• Having an evening to digest then return to test would have allowed more

effective additional testing and insighto We planned an extra 1.5 days for most, and that was very helpful

• We weren’t rushing to get on an airplane• We could reduce the data and come back for clarifications if needed• We could defer non-controversial tests to the next day to allow focus with

Government present

• More Communication with Performerso Pre-Test Site/Team Visit (~2-3 weeks prior to test)

• Significant help in preparing testing approach• The half-day that we implemented before the test was crucial for us

o More conference calls would have helped, tooo Hard to balance against performers main focus, though

Validation / Lessons Learned• A Series of Tests Might Be Better

o Perhaps one day of tests similar to what we dido Then a follow-up test a month or two later as prototypes matured

• With the same test team to leverage understanding of system gained

• We Underestimated the Effort in Our Bido Systems were more unique and complex than we anticipatedo 20-25% more hours would have helped us a lot in data reduction

• Multi-talented team proved vital to successo We had programming (multi-lingual), traditional red team,

computer security, systems engineering, OS, system admin, network engineering, etc. talent present for each test

• Highly tailored approach proved appropriate and necessary

o Using more traditional network-oriented Red Team Assessment approach would have failed

The Assessments

Overall Strengths / Weaknesses of Projects

• Strengthso Teams worked hard to support our assessmentso The technologies are exciting and powerful

• Weaknesseso Most Suffered a Lack of System Documentation

• We understand there is a balance to strike – these are research prototypes essentially after all

• Really limited ability to prepare for assessmento All are Prototypes -- stability needed for deterministic test resultso All provide incomplete security / protection almost by definitiono Most Suffered a Lack of Configuration Management / Controlo Test “Harnesses” far from optimal for Red Team use

• Of course, they are oriented around supporting the development• But, we’re fairly limited in using other tools due to uniquenesses of

the technologies

AWDRATAssessment

October 18-19, 2005

Success Criteria• The target application can successfully and/or correctly

perform its mission

• The AWDRAT system can o detect an attacked client’s misbehavioro interrupt a misbehaving cliento reconstitute a misbehaving client in such a way that the reconstituted

client is not vulnerable to the attack in question

• The AWDRAT system must o Detect / Diagnose at least 10% of attacks/root causeso Take effective corrective action on at least 5% of the successfully

identified compromises/attacks

Assessment Strategy• Denial of Service

o aimed at disabling or significantly modifying the operation of the application to an extent that mission objectives cannot be accomplished

o attacks using buffer-overflow and corrupted data injection to gain system access

• False Negative Attackso a situation in which a system fails to report an occurrence of anomalous or

malicious behavioro Red Team hoped to perform actions that would fall "under the radar". We

targeted the modules of AWDRAT that support diagnosis and detection.

• False Positive Attackso system reports an occurrence of malicious behavior when the activity detected

was non-maliciouso Red Team sought to perform actions that would excite AWDRAT's monitors.

Specifically, we targeted the modules supporting diagnosis and detection.

• State Disruption Attackso interrupt or disrupt AWDRAT's ability to maintain its internal state machines

• Recovery Attackso disrupt attempts to recover or regenerate a misbehaving client o target the Adaptive Software and Recovery and Regeneration modules in an

attempt to allow a misbehaving client to continue operating

Strengths / Weaknesses• Strengths

o With a reconsideration of system’s scope of responsibility, we anticipate the system would have performed far better in the tests

o We see great power in the concept of wrapping all the functions

• Weaknesseso Scope of Responsibility / Protection far too Limitedo Need to Develop Full Security Policyo Single points of failureo Application-Specific Limitationso Application Model Issues

• Incomplete – by design?

• Manually Created

• Limited Scope

• Doesn’t really enforce multi-layered defense

LRTSSAssessment

October 25, 2005

Success Criteria• The instrumented Freeciv server does not core dump

under a condition in which the uninstrumented Freeciv server does core dump

• The LRTSS system can o Detect a corruption in a data structure that causes an

uninstrumented Freeciv server to exito Repair the data corruption in such a way that the instrumented

Freeciv server can continue running

• The LRTSS system must o Detect / Diagnose at least 10% of attacks/root causeso Take effective corrective action on at least 5% of the successfully

identified compromises/attacks

Assessment Strategy• Denial of Service

o Aimed at disabling or significantly modifying the operation of the Freeciv server to an extent that mission objectives cannot be accomplished

o In this case, not achieving mission objectives is defined as the Freeciv server exits or dumps core

o Attacks using buffer-overflow, corrupted data injection, and resource utilization

o Various data corruptions aimed at causing the server to exit o Formulated the attacks by targeting the uninstrumented server

first, then running the same attack against the instrumented server

• State Disruption Attackso interrupt or disrupt LRTSS's ability to maintain its internal state

machines


o Performs very well under simple data corruptions • (that would cause the system to crash under normal operation)

o Performs well under a large number of these simple data corruptions

• (200 to 500 corruptions are repaired successfully)o Learning and Repair algorithms well thought out

• Weaknesseso Scope of Responsibility / protection too limitedo Complex Data Structure Corruptions not handled wello Secondary Relationships are not protected againsto Pointer Data Corruptions not entirely testedo Timing of Check and Repair Cycles not optimalo Description of “Mission Failure” as core dump may be

excessive

QuickSilverAssessment

November 8, 2005

Success Criteria• Ricochet can successfully and/or correctly perform its

missiono “Ricochet must consistently achieve a fifteen-fold reduction in

latency (with benign failures) for achieving consistent values of data shared among one hundred to ten thousand participants, where all participants can send and receive events."

• Per client direction, elected to use average latency time as the comparative metric

o Ricochet’s Average Recovery demonstrates 15-fold improvement over SRM

o Additional constraint levied requiring 98% update saturation (imposing the use of the NACK failover for Ricochet)

Assessment Strategy• Scalability Experiments -- test scalability in terms of number of groups per node

and number of nodes per group. Here, no node failures will be simulated, and no packet losses will be induced (aside from those that occur as a by-product of normal network traffic).

o Baseline Latencyo Group Scalabilityo Large Repair Packet Configurationo Large Data Packet Storage Configuration

• Simulated Node Failures – simulate benign node failures.o Group Membership Overhead / Intermittent Network Failure

• Simulated Packet Losses – introduce packet loss into the network. o High Packet Loss Rates

• Node-driven Packet Loss• Network-driven Packet Loss• Ricochet-driven Packet Loss

o High Ricochet Traffic Volumeo Low Bandwidth Network

• Simulated Network Anomalies – simulate benign routing and network errors that might exist on a deployed network. The tests will establish whether or not the protocol is robust in its handling of typical network anomalies, as well as those atypical network anomalies that may be induced by an attacker.

o Out of Order Packet Deliveryo Packet Fragmentationo Duplicate Packetso Variable Packet Sizes


o Appears to be very resilient when operating within its assumptions

o Very stable softwareo Significant performance gains over SRM

• Weaknesseso FEC-orientation - focus in statistics belies valuable data

regarding complete packet deliveryo Batch-oriented Test Harness –

• Impossible to perform interactive attacks• Very limited insight into blow-by-blow performance

o Metrics collected were very difficult to understand fully

STEWARDAssessment

December 9, 2005

Success Criteria

• The STEWARD system must: o Make progress in the system when under attack.

• Progress is defined as the eventual global ordering, execution, and reply to any request which is assigned a sequence number within the system

o Maintain a consistency of data replicated on each of the servers in the system

Assessment Strategy• Validation Activities - tests we will perform to

verify that STEWARD can endure up to five Byzantine faults while maintaining a three-fold reduction in latency with respect to BFT

o Byzantine Node Thresholdo Benchmark Latency

• Progress Attacks - attacks we will launch to prevent STEWARD from progressing to a successful resolution of an ordered client request

o Packet Losso Packet Delayo Packet Duplicationo Packet Re-orderingo Packet Fragmentationo View Change Message Floodo Site Leader Stops Assigning Sequence Numberso Site Leader Assigns Non-Contiguous Sequence

Numberso Suppressed New-View Messageso Consecutive Pre-Prepare Messages in Different

Viewso Out of Order Messageso Byzantine Induced Failover

• Data Integrity Attacks - attempts to create an inconsistency in the data replicated on the various servers in the network

o Arbitrarily Execute Updateso Multiple Pre-Prepare Messages using Same Sequence

Numbers and Different Request Datao Spurious Prepare, Null Messageso Suppressed Checkpoint Messageso Prematurely Perform Garbage Collectiono Invalid Threshold Signature

• Protocol State Attacks - attacks focused on interrupting or disrupting STEWARD's ability to maintain its internal state machines

o Certificate Threshold Validation Attacko Replay Attacko Manual Exploit of Client or Server

Note: We Note: We did not trydid not try to to validate or break the validate or break the

encryption algorithms.encryption algorithms.

Strengths / Weaknesses

• Strengthso First system that assumes and actually tolerates corrupted

components (Byzantine attack)o Blue Team spent extensive time up front in analysis, design

and proof of the protocol – it was clear in the performanceo System was incredibly stable and resiliento We did not compromise the system

• Weaknesseso Limited Scope of Protection

• Relies on external entity to secure and manage keys which are fundamental to the integrity of the system

• STEWARD implicitly and completely trusts the cliento Client-side attacks were out of scope of the assessment

Going Forward

• White Team will generate definitive report on this Red Team Test activity

o It will have the official scoring and results

• RABA (Red Team) will generate a test report from our perspective

o We will publish to: • PI for the Project

• White Team (Mr. Do)

• DARPA (Mr. Badger)

Questions or Comments

Any Questions, Comments, or Concerns?

RABA’s Red Team Assessments

Documents

Transcript of RABA’s Red Team Assessments