Data Spillage - insurehub.org · The files that support HDFS are stored as part of the local system...
Transcript of Data Spillage - insurehub.org · The files that support HDFS are stored as part of the local system...
The Center for Education and
Research in Information Assurance
and Security
DATA SPILLAGE
Project Investigators:Oluwatosin (Tosin) Alabi, Joe Beckman, Dheeraj Gurugubelli
Technical Directors: Andy Sampson and Aaron Piper
INHADOOP CLUSTERS
PROBLEM OVERVIEWProblem Statement:
Detection and removal of classified information on an unauthorized Hadoop distributed file system(HDFS).
Motivations:
• As institutions create their own private cloud service and other data warehousing solutions for data storage and processing, the loss of control over sensitive and protected data can become a serious threat to business operations and national security (NSA Mitigation Group, 2012).
• Working towards developing an incident response procedural framework would help mitigate data spillage security risks.
Research Question:
Can classified data leaked, by user error, into an unauthorized Hadoop Distributed File System (HDFS), be located, recovered, and removed completely from the server?
[1] Mitigations NSA Group. (2012). Securing Data and Handling Spillage Events [White Paper]. Retrieved fromhttps://www.nsa.gov/ia/_files/factsheets/final_data_spill.pdf
2
BACKGROUND: SPILLAGE LOCATION (SHVACHKO ET AL, 2010)
The files that support HDFS are stored as part of the local system memory
Hadoop Distributed File System is patterned after Unix
NameNode: Stores edit log and file metadata (i.e. location, file size, file type)
DataNodes: Stores application data and its metadata (i.e. checksum & generation time
stamp)
Deleted items use are transferred to the /.trash folder within NameNode
Snapshots are stored image files used by system administrators for data back-up and
recovery
3
[2] Shvachko, K., Kuang, H., Radia, S., & Chansler, R. (2010, May). The hadoop distributed file system. In Mass Storage Systems and Technologies (MSST), 2010 IEEE 26th Symposium on (pp. 1-
10). IEEE.
Lim, S., Yoo, B., Park, J., Byun, K., & Lee, S. (2012). A research on the investigation method of digital forensics for a VMware Workstation’s virtual
machine. Mathematical and Computer Modelling, 55(1), 151-160.
• Recovery procedures for Virtual Machine disks (Lim et al, 2012)
• Adapted to:
Retrieve metadata from the NameNode
Recover file remnants from the DataNodes
BACKGROUND: DATA RECOVERY (LIM ET AL, 2012)
StartUndamaged
?Copy VM image
(e.g. VMDK)
Mount Image
Analyze
Report
End
Recoverable?
Recover VM image
Extract filesystemmetadata & File
Carving
yes
No
yes
No
4
BACKGROUND: DATA DELETION (HUGHES & COUGHLIN, 2006)
• Data deletion protocols from storage systems have different characteristics
• Selection of erasure protocols there is often a tradeoff between sanitation security level and
time
5
Hughes, G., & Coughlin, T. (2006). Tutorial on disk drive data sanitization.cmrr. ucsd. edu/people/Hughes/DataSanitizationTutorial. pdf..
Data Sanitation Approches
AverageTime
(100GB)
Data Sanitization
Security LevelComments
Normal File Deletion
Minutes Very PoorDeletes only file pointers, not actual
data
DoD 5520 Block Erase
Up to several days
MediumNeeds 3 writes+verify, cannot erase
reassigned blocks
Secure Erase 1-2 hours HighIn-drive overwrite of all user
accessible records
NIST 800-88Enhanced
Secure EraseSeconds Very High Change in-drive encryption key
• Apply a deletion protocol
• Determine can data be recovered in HDFS & to what extent?
WHAT CAN BE RECOVERED & TO WHAT EXTENT?
6
Project Scope
HDFS
• Retrieve NameNodeMetadata
• Recover of DataNode Data Remnants
Load Tagged
Document
• Retrieve NameNodeMetadata
• Recover of DataNode Data Remnants
Delete Tagged
Document
Apply digital forensics procedure to locate data remnants
Assess/qualify the data sanctification level
Compare results using the delete file method to a more secure method (e.g. NIST 800-88)
RESEARCH PROCESSCLOUD STORAGE FORENSIC FRAMEWORK: ADAPTED FROM NIST (KENT ET AL 2006) & MCKEMMISH
(1999)
7
Scope: Determine the investigation boundaries and limitation
Preparation: Prepare cluster environment
Identification: Identify the location of spilled data on data nodes using the metadata on name node
Collection & Preservation: Acquire the .vmdk image files of impacted nodes.
Examine & Analyze: Conduct forensic procedure(s) to find any spilled data remnants on the disks
Presentation: Document and Report Findings
Itera
tive
PREPARATION OF CLUSTER ENVIRONMENT
VIRTUALIZED CLUSTER IN VMWARE HOST MACHINE
Set-up of an 8-node Hadoop Cluster:
• Hardware: 1 Quanta QCT Server
• Host OS: Vmware ESXiTM 5.5
• Virtualization Layer: 8 Virtual Machines
(VMs)
Memory allocated to each VM: 80GB
APP: Hadoop 2.0 using CHD5.2.0
(Cloudera manager package)
Max Block Size per file: 64MB
Replication factor: 3 per file block
1 NameNode
7 DataNodes
8
We loaded data files (evidence) and than started to implement the incident response process
PREPARATION OF CLUSTER ENVIRONMENT
Name node: •Maintains Namespace tree and mapping of blocks to DataNodes:
•Metadata: Inodes & Block list•Edit logs: Journal
HDFSData node 1:•Stores the actual data in blocks >=64MB•Block’s metadata
Heartbeat
Data node 2: Data node 4: Data node 8:
9
Data node 3:
Test File 1
LOADInto Cluster
File1R1
File1R2
File1R3 File2
R3
File2R2
File2R1
We loaded data files (evidence) and than started to implement the incident response process
IDENTIFICATION OF DATA LOCATION
10
Name node: •Maintains Namespace tree and mapping of blocks to DataNodes:
•Metadata: Inodes & Block list•Edit logs: Journal
HDFSData node 1:•Stores the actual data in blocks >=64MB•Block’s metadata
Heartbeat
Data node 2: Data node 4: Data node 8:Data node 3:
File1R1
File1R2
File1R3 File2
R3
File2R2
File2R1
Retrieve file evidence
File Meta
data
DataNode
locations
COLLECTION & PRESERVATION OF IMAGE FILES
Shutdown VM and copy disk image from VMWARE data store
11
NameNode: •Tracks metadata•Data location logs
HDFS DataNode 1:•Stores the actual data in blocks >=64MB•Block’s metadata
Heartbeat
DataNode 2: DataNode 4: DataNode 8:Dead DataNode 3:
File1R1
File1R2
File1R3 File2
R3
File2R2
File2R1
No Heartbeat
.VMDK:
Virtual
Machine
Disk file
Recover file evidence
EXAMINE & ANALYZE: FORENSIC PROCESS
Forensic Acquisition
• Mount the acquired disk images
Log Analysis (Name Node)
• HDFS Event Logs
• Disk ID’s (Before Delete)
• Moved to ./trash (After Delete)
Forensic Analysis (Data Node)
• Block-Based Carving
• Header/Footer Carving
• File structure based Carving
• Live Search
• Index Search
Forensic Tools
FTK Imager
Forensic Tool Kit 5.3
EXAMINE & ANALYZE: LOG ANALYSIS
Hdfs-audit.log
EXAMINE & ANALYZE: LOG ANALYSIS
Hdfs-audit.log
Extension Header (Hex) Footer (Hex)
DOC D0 CF 11 E0 A1 B1 1A E1 57 6F 72 64 2E 44 6F 63 75 6D 65 6E 74 2E
XLS D0 CF 11 E0 A1 B1 1A E1 FE FF FF FF 00 00 00 00 00 00 00 00 57 00 6F 00 72 00 6B 00
62 00 6F 00 6F 00 6B 00
PPT D0 CF 11 E0 A1 B1 1A E1 50 00 6F 00 77 00 65 00 72 00 50 00 6F 00 69 00 6E 00 74
00 20 00 44 00 6F 00 63 00 75 00 6D 00 65 00 6E 00 74
ZIP 50 4B 03 04 14 50 4B 05 06 00
JPG FF D8 FF E0 00 10 4A 46 49 46 00 01 01 D9 (“Better To Use File size Check”)
GIF 47 49 46 38 39 61 4E 01 53 00 C4 21 00 00 3B 00
PDF 25 50 44 46 2D 31 2E 25 25 45 4F 46
FORENSIC ANALYSIS – DATA CARVING
Example: Header/ Footer Method
EXAMINE & ANALYZE: DATA CARVE
Save Select
POSSIBLE DATA REMANANT CONTAINERS
/.Trash
Data Nodes
17
Data at Rest
Future Direction - Automation
FUTURE WORK
Exploration of the efficacy of various secure data
sanitation methods for data removal in a virtual HDFS
cluster
Extension of this process to include the minimization of
cluster downtime during the removal process
Extension of this process to include detection and
removal of other file types
Automation of data removal process in HDFS
18
QUESTIONS?
19
THANK YOU
HDFS automatically replicates data when node is unavailable (as detected by Namenode)
COLLECTION & PRESERVATION OF IMAGE FILES
20
Replication:
DataNode 4:
NameNode: •Tracks metadata•Data location logs
HDFS DataNode 1:•Stores the actual data in blocks >=64MB•Block’s metadata
Heartbeat
DataNode 2: DataNode 8:Dead DataNode 3:
File1R1
File1R2
File1R3 File2
R3
File2R2File2
R1File1R2
No Heartbeat
.VMDK:
Virtual
Machine
Disk file
Recover file evidence