An Analysis of the 1999 DARPA/Lincoln Laboratory Evaluation Data for Network Anomaly Detection Matt...
-
Upload
liliana-nicholson -
Category
Documents
-
view
213 -
download
0
Transcript of An Analysis of the 1999 DARPA/Lincoln Laboratory Evaluation Data for Network Anomaly Detection Matt...
An Analysis of the 1999 DARPA/Lincoln LaboratoryEvaluation Data for Network
Anomaly Detection
Matt Mahoney
Feb. 18, 2003
Is the DARPA/Lincoln Labs IDS Evaluation Realistic?
• The most widely used intrusion detection evaluation data set.
• 1998 data used in KDD cup competition with 25 participants.
• 8 participating organizations submitted 18 systems to the 1999 evaluation.
• Tests host or network based IDS.• Tests signature or anomaly detection.• 58 types of attacks (more than any other evaluation)• 4 target operating systems.• Training and test data released after evaluation to
encourage IDS development.
Problems with the LL Evaluation
• Background network data is synthetic.• SAD (Simple Anomaly Detector) detects
too many attacks.• Comparison with real traffic – range of
attribute values is too small and static (TTL, TCP options, client addresses…).
• Injecting real traffic removes suspect detections from PHAD, ALAD LERAD, NETAD, and SPADE.
1. Simple Anomaly Detector (SAD)
• Examines only inbound client TCP SYN packets.• Examines only one byte of the packet.• Trains on attack-free data (week 1 or 3).• A value never seen in training is an anomaly.• If there have been no anomalies for 60 seconds,
then output an alarm with score 1.
Train: 001110111 Test: 010203001323011
60 sec. 60 sec.
DARPA/Lincoln Labs Evaluation
• Weeks 1 and 3: attack free training data.• Week 2: training data with 43 labeled attacks.• Weeks 4 and 5: 201 test attacks.
SunOS Solaris Linux NT
RouterInternet
SnifferAttacks
SAD Evaluation
• Develop on weeks 1-2 (available in advance of 1999 evaluation) to find good bytes.
• Train on week 3 (no attacks).
• Test on weeks 4-5 inside sniffer (177 visible attacks).
• Count detections and false alarms using 1999 evaluation criteria.
SAD Results
• Variants (bytes) that do well: source IP address (any of 4 bytes), TTL, TCP options, IP packet size, TCP header size, TCP window size, source and destination ports.
• Variants that do well on weeks 1-2 (available in advance) usually do well on weeks 3-5 (evaluation).
• Very low false alarm rates.• Most detections are not credible.
SAD vs. 1999 Evaluation
• The top system in the 1999 evaluation, Expert 1, detects 85 of 169 visible attacks (50%) at 100 false alarms (10 per day) using a combination of host and network based signature and anomaly detection.
• SAD detects 79 of 177 visible attacks (45%) with 43 false alarms using the third byte of the source IP address.
1999 IDS Evaluation vs. SAD
0 20 40 60 80 100
Src IP[3]
TCP Hdr
SAD TTL
Forensics
Dmine
Expert 2
Expert 1
Recall %
Precision
SAD Detections by Source Address(that should have been missed)
• DOS on public services: apache2, back, crashiis, ls_domain, neptune, warezclient, warezmaster
• R2L on public services: guessftp, ncftp, netbus, netcat, phf, ppmacro, sendmail
• U2R: anypw, eject, ffbconfig, perl, sechole, sqlattack, xterm, yaga
2. Comparison with Real Traffic
• Anomaly detection systems flag rare events (e.g. previously unseen addresses or ports).
• “Allowed” values are learned during training on attack-free traffic.
• Novel values in background traffic would cause false alarms.
• Are novel values more common in real traffic?
Measuring the Rate of Novel Values
• r = Number of values observed in training.• r1 = Fraction of values seen exactly once (Good-
Turing probability estimate that next value will be novel).
• rh = Fraction of values seen only in second half of training.
• rt = Fraction of training time to observe half of all values.
Larger values in real data would suggest a higher false alarm rate.
Network Data for Comparison
• Simulated data: inside sniffer traffic from weeks 1 and 3, filtered from 32M packets to 0.6M packets.
• Real data: collected from www.cs.fit.edu Oct-Dec. 2002, filtered from 100M to 1.6M.
• Traffic is filtered and rate limited to extract start of inbound client sessions (NETAD filter, passes most attacks).
Attributes measured
• Packet header fields (all filtered packets) for Ethernet, IP, TCP, UDP, ICMP.
• Inbound TCP SYN packet header fields.
• HTTP, SMTP, and SSH requests (other application protocols are not present in both sets).
Comparison results
• Synthetic attributes are too predictable: TTL, TOS, TCP options, TCP window size, HTTP, SMTP command formatting.
• Too few sources: Client addresses, HTTP user agents, ssh versions.
• Too “clean”: no checksum errors, fragmentation, garbage data in reserved fields, malformed commands.
TCP SYN Source Address
Simulated Real
Packets, n 50650 210297
r 29 24924
r1 0 45%
rh 3% 53%
rt 0.1% 49%
r1 ≈ rh ≈ rt ≈ 50% is consistent with a Zipf distribution and a constant growth rate of r.
3. Injecting Real Traffic
• Mix equal durations of real traffic into weeks 3-5 (both sets filtered, 344 hours each).
• We expect r ≥ max(rSIM, rREAL) (realistic false alarm rate).
• Modify PHAD, ALAD, LERAD, NETAD, and SPADE not to separate data.
• Test at 100 false alarms (10 per day) on 3 mixed sets.
• Compare fraction of “legitimate” detections on simulated and mixed traffic for median mixed result.
PHAD
• Models 34 packet header fields – Ethernet, IP, TCP, UDP, ICMP
• Global model (no rule antecedents)• Only novel values are anomalous• Anomaly score = tn/r where
– t = time since last anomaly– n = number of training packets– r = number of allowed values
• No modifications needed
ALAD
• Models inbound TCP client requests – addresses, ports, flags, application keywords.
• Score = tn/r
• Conditioned on destination port/address.
• Modified to remove address conditions and protocols not present in real traffic (telnet, FTP).
LERAD
• Models inbound client TCP (addresses, ports, flags, 8 words in payload).
• Learns conditional rules with high n/r.• Discards rules that generate false alarms
in last 10% of training data.• Modified to weight rules by fraction of real
traffic.If port = 80 then word1 = GET, POST (n/r = 10000/2)
NETAD
• Models inbound client request packet bytes – IP, TCP, TCP SYN, HTTP, SMTP, FTP, telnet.
• Score = tn/r + ti/fi allowing previously seen values.– ti = time since value i last seen
– fi = frequency of i in training.
• Modified to remove telnet and FTP.
SPADE (Hoagland)
• Models inbound TCP SYN.
• Score = 1/P(src IP, dest IP, dest port).
• Probability by counting.
• Always in training mode.
• Modified by randomly replacing real destination IP with one of 4 simulated targets.
Criteria for Legitimate Detection
• Source address – target server must authenticate source.
• Destination address/port – attack must use or scan that address/port.
• Packet header field – attack must write/modify the packet header (probe or DOS).
• No U2R or Data attacks.
Mixed Traffic: Fewer Detections, but More are Legitimate
Detections out of 177 at 100 false alarms
0
20
40
60
80
100
120
140
PHAD ALAD LERAD NETAD SPADE
Total
Legitimate
Conclusions
• SAD suggests the presence of simulation artifacts and artificially low false alarm rates.
• The simulated traffic is too clean, static and predictable.
• Injecting real traffic reduces suspect detections in all 5 systems tested.
Limitations and Future Work
• Only one real data source tested – may not generalize.
• Tests on real traffic cannot be replicated due to privacy concerns (root passwords in the data, etc).
• Each IDS must be analyzed and modified to prevent data separation.
• Is host data affected (BSM, audit logs)?
Limitations and Future Work
• Real data may contain unlabeled attacks. We found over 30 suspicious HTTP request in our data (to a Solaris based host).
IIS exploit with double URL encoding (IDS evasion?)
GET /scripts/..%255c%255c../winnt/system32/cmd.exe?/c+dir
Probe for Code Red backdoor.GET /MSADC/root.exe?/c+dir HTTP/1.0