Sensitive Information Sweep

37
Sensitive Information Sweep Using Cornell’s Spider Wyman Miles, Cornell University Kerry Havens, University of Colorado at Boulder Steve Lovaas, Colorado State University

description

Sensitive Information Sweep. Using Cornell’s Spider Wyman Miles , Cornell University Kerry Havens , University of Colorado at Boulder Steve Lovaas , Colorado State University. Overview. Quick Background The Technical Problem (Kerry) The Organizational Problem (Steve) Spider (Wyman) - PowerPoint PPT Presentation

Transcript of Sensitive Information Sweep

Page 1: Sensitive Information Sweep

Sensitive Information Sweep

Using Cornell’s Spider

Wyman Miles, Cornell University

Kerry Havens, University of Colorado at Boulder

Steve Lovaas, Colorado State University

Page 2: Sensitive Information Sweep

Overview

• Quick Background

• The Technical Problem (Kerry)

• The Organizational Problem (Steve)

• Spider (Wyman)

• Summary & Questions

Page 3: Sensitive Information Sweep

What is “Sensitive Information”?

• A Growing Concern

• A Moving Target

• SSN, Credit Card, Driver’s License, Medical Records, Student Information, Proprietary Research,…

• Data in Context – Aggregation

Page 4: Sensitive Information Sweep

Why Are We All Here?

• The Front Page!

• CDW-G 2006 Survey – more than 3 million college students may have lost personal information in the last year.

• Identity theft is the fastest growing crime in the U.S.

• By far the biggest culprit? Lost or stolen computers.

Page 5: Sensitive Information Sweep

Regulations, Standards, & Laws

• Federal – HIPAA, FERPA, SarbOx, GLB,… Identity Theft Protection Act?

• State – Many states passing identity theft protection laws; New York & Colorado have state CISO

• Industry – PCIDSS

Page 6: Sensitive Information Sweep

The Technical Problem:Finding sensitive information in a

haystack

Kerry Havens

University of Colorado at Boulder

Page 7: Sensitive Information Sweep

SSN Remediation

• At CU-Boulder, SSNs were used as a student identifier before 2004

• House Bill 03-1175 was approved in 2003 requiring institutions to change this method to ensure the privacy of a student’s social security number

• CU-Boulder started issuing student IDs to new students in July 2004 and converting SSNs to SIDs in 2005

Page 8: Sensitive Information Sweep

Where the data is not stored

• File type exclusions – fine tuning– Binary files where the data cannot be read– Received input from community for fine tuning

• False positives– International telephone numbers– Examples for web form validation

• Why is the department webpage asking for SSNs?

Page 9: Sensitive Information Sweep

OS and File Encoding Problems

• HTML encoding problems• Representations (pictures) of sensitive

data are not found– Examples include PDF

• Searching a UNIX filesystem– Preparing the file before searching for private

data– For example, using strings to extract text from

text/binary hybrids like .doc or .xls

Page 10: Sensitive Information Sweep

Where the data is stored

• Typical file types of discovered data– Gradebooks– Course web pages– Homework assignments– Travel authorization forms– Personal financial documents– Email

Page 11: Sensitive Information Sweep

Regular Expressions

• Returns too much data: /\d{3}-\d{2}-\d{4}/

• Searching for environment specific data in the hope that common data will lead us to more data:/\b([0-7]\d{2}[-|\s]\d{2}[-|\s]\d{4} |

(52[1-4]|65[0-3])\d{6})\b/

• State specific information can be found at

http://www.ssa.gov/employer/stateweb.htm

Page 12: Sensitive Information Sweep

Regular Expressions

• Let’s dissect this…

/\b([0-7]\d{2}[-|\s]\d{2}[-|\s]\d{4} |

(52[1-4]|65[0-3])\d{6})\b/

Page 13: Sensitive Information Sweep

Regular Expressions

• Let’s dissect this…

/\b([0-7]\d{2}[-|\s]\d{2}[-|\s]\d{4} |

(52[1-4]|65[0-3])\d{6})\b/

Boundary

Page 14: Sensitive Information Sweep

Regular Expressions

• Let’s dissect this…

/\b([0-7]\d{2}[-|\s]\d{2}[-|\s]\d{4} |

(52[1-4]|65[0-3])\d{6})\b/

First acceptable digit

Page 15: Sensitive Information Sweep

Regular Expressions

• Let’s dissect this…

/\b([0-7]\d{2}[-|\s]\d{2}[-|\s]\d{4} |

(52[1-4]|65[0-3])\d{6})\b/

2, 4, or 6 digits in a row

Page 16: Sensitive Information Sweep

Regular Expressions

• Let’s dissect this…

/\b([0-7]\d{2}[-|\s]\d{2}[-|\s]\d{4} |

(52[1-4]|65[0-3])\d{6})\b/

Delimited by dash or space

Page 17: Sensitive Information Sweep

Regular Expressions

• Let’s dissect this…

/\b([0-7]\d{2}[-|\s]\d{2}[-|\s]\d{4} |

(52[1-4]|65[0-3])\d{6})\b/

Colorado specific prefix, not delimited

Page 18: Sensitive Information Sweep

CU Experiences

• Pitfalls– Users’ interpretations of the log file– Fine tuning file extension exceptions and

regular expressions

• Recommendations– Keep current environment in mind

Page 19: Sensitive Information Sweep

The Organizational Problem:a really big haystack

Steve Lovaas

Network Security Manager

Colorado State University

Page 20: Sensitive Information Sweep

Organizational Vision

• Support from the top – Cabinet-level committee driving the project– Spurred by headlines and state mandates– VP for IT who really gets security

• Campus PR campaign– Web site– Public meetings

• Tied SSN purge to the rollout of a new CSUID in Fall 2006

Page 21: Sensitive Information Sweep

Using Resources

• Project Constraints– Tight timeline– No budget – Not a trivial programming project

• Buy / Build / Leverage tools?

• Goal: 100% coverage vs. Best Effort

• Spider chosen for Windows, Linux, Mac

• Manual searching on AIX, mainframe

Page 22: Sensitive Information Sweep

Ultimate Responsibility

• Original thought: deans / dept. heads

• Revised edition: individual employees

• Developed a personal attestation for for every employee to sign, submitted in bulk by colleges

• More work for central IT

• Senior VP: Doing the scan and signing the form is a CONDITION OF EMPLOYMENT

Page 23: Sensitive Information Sweep

Individual Attestation Form

• Every employee• 2 choices:

– I don’t interact with SSNs in the course of my job

– SSNs in all electronic files under my control have been removed or encrypted

• VP for IT must approve exceptions

Page 24: Sensitive Information Sweep

CSU Experiences

• Pitfalls– Beta tool for a live project requires quick response

and careful management of user expectations & acceptance

– Careful of deadlines, it’s a lot of work!

• Recommendations– Don’t do this kind of project without active support

from the very top– Anticipate the need for analysis/parsing tools– Have a supported encryption solution for exceptions

Page 25: Sensitive Information Sweep

Cornell Spider

Wyman Miles

Sr. Security Engineer

Cornell University

Page 26: Sensitive Information Sweep

A Brief History of Spider

• Early 2005, scan Web for SSNs

• Later, scan disk images for SSNs/CCNs

• March 2006, debut at BU Security Camp

• April 2006, Educause, demand for a Windows version

• Version 1.0 in May, 2.0 in June

Page 27: Sensitive Information Sweep

A Brief History, II

• June 2006, major feedback from Steve: bug reports, tests, feature requests

• Engine developed that same month: internal incident response

• OSX Spider Sept 2006

• Windows Spider rewrite

• April 2007, GPL release of all Spiders

Page 28: Sensitive Information Sweep

Current Spider

• SSN, SIN, CCN, NINO discovery in many file types

• Various data type validators

• Web scanning, back to its roots

• Scan for data in unallocated space

• Faster. More readable source

Page 29: Sensitive Information Sweep

Various Spiders

• Windows Spider, aka Spider3

• OSX Spider

• Engine, general UNIX spider

• LinSpider, our oldest version

• Spider Simple: Windows Spider preconfigured to skip noisy files

Page 30: Sensitive Information Sweep

Future Spider

• Feature set convergence between Engine, OSX, Windows

• Community Development

• Possible I2 hosting of distribution and documentation

• More documentation!

• Client-Server model revisited

Page 31: Sensitive Information Sweep

Spider Log

Page 32: Sensitive Information Sweep

Spider at Cornell

• Incident response: a compromise has happened, what was at risk?

• Pre-emptive– Dan Elswit, CALS Security Officer

Page 33: Sensitive Information Sweep

Spider in CIT

• CIT abandoned SSNs a few years ago, but they remain

• Tech support uses Spider Simple to discover lurking SSNs

• Manual process

Page 34: Sensitive Information Sweep

Athletics

• Spider Simple

• Unique log names to network share

• Centralized analysis

Page 35: Sensitive Information Sweep

Spider Downloads

• http://www.cit.cornell.edu/security/tools

Page 36: Sensitive Information Sweep

Summary

• Purging sensitive information is something we’re going to have to get good at

• Get support from the highest levels• Tune regular expressions and file/ext skip

lists for your environment• Anticipate parsing needs, exceptions• New Spider features, more users, broader

OS support• Spider also for ongoing support, forensics

Page 37: Sensitive Information Sweep

Questions?

• Wyman Miles:– [email protected]

• Kerry Havens:– [email protected]

• Steve Lovaas:– [email protected]

• The Spider users’ list:– [email protected]