Data Spillage - insurehub.org · The files that support HDFS are stored as part of the local system...

20
The Center for Education and Research in Information Assurance and Security DATA SPILLAGE Project Investigators: Oluwatosin (Tosin) Alabi, Joe Beckman, Dheeraj Gurugubelli Technical Directors: Andy Sampson and Aaron Piper IN HADOOP CLUSTERS

Transcript of Data Spillage - insurehub.org · The files that support HDFS are stored as part of the local system...

Page 1: Data Spillage - insurehub.org · The files that support HDFS are stored as part of the local system memory Hadoop Distributed File System is patterned after Unix NameNode: Stores

The Center for Education and

Research in Information Assurance

and Security

DATA SPILLAGE

Project Investigators:Oluwatosin (Tosin) Alabi, Joe Beckman, Dheeraj Gurugubelli

Technical Directors: Andy Sampson and Aaron Piper

INHADOOP CLUSTERS

Page 2: Data Spillage - insurehub.org · The files that support HDFS are stored as part of the local system memory Hadoop Distributed File System is patterned after Unix NameNode: Stores

PROBLEM OVERVIEWProblem Statement:

Detection and removal of classified information on an unauthorized Hadoop distributed file system(HDFS).

Motivations:

• As institutions create their own private cloud service and other data warehousing solutions for data storage and processing, the loss of control over sensitive and protected data can become a serious threat to business operations and national security (NSA Mitigation Group, 2012).

• Working towards developing an incident response procedural framework would help mitigate data spillage security risks.

Research Question:

Can classified data leaked, by user error, into an unauthorized Hadoop Distributed File System (HDFS), be located, recovered, and removed completely from the server?

[1] Mitigations NSA Group. (2012). Securing Data and Handling Spillage Events [White Paper]. Retrieved fromhttps://www.nsa.gov/ia/_files/factsheets/final_data_spill.pdf

2

Page 3: Data Spillage - insurehub.org · The files that support HDFS are stored as part of the local system memory Hadoop Distributed File System is patterned after Unix NameNode: Stores

BACKGROUND: SPILLAGE LOCATION (SHVACHKO ET AL, 2010)

The files that support HDFS are stored as part of the local system memory

Hadoop Distributed File System is patterned after Unix

NameNode: Stores edit log and file metadata (i.e. location, file size, file type)

DataNodes: Stores application data and its metadata (i.e. checksum & generation time

stamp)

Deleted items use are transferred to the /.trash folder within NameNode

Snapshots are stored image files used by system administrators for data back-up and

recovery

3

[2] Shvachko, K., Kuang, H., Radia, S., & Chansler, R. (2010, May). The hadoop distributed file system. In Mass Storage Systems and Technologies (MSST), 2010 IEEE 26th Symposium on (pp. 1-

10). IEEE.

Page 4: Data Spillage - insurehub.org · The files that support HDFS are stored as part of the local system memory Hadoop Distributed File System is patterned after Unix NameNode: Stores

Lim, S., Yoo, B., Park, J., Byun, K., & Lee, S. (2012). A research on the investigation method of digital forensics for a VMware Workstation’s virtual

machine. Mathematical and Computer Modelling, 55(1), 151-160.

• Recovery procedures for Virtual Machine disks (Lim et al, 2012)

• Adapted to:

Retrieve metadata from the NameNode

Recover file remnants from the DataNodes

BACKGROUND: DATA RECOVERY (LIM ET AL, 2012)

StartUndamaged

?Copy VM image

(e.g. VMDK)

Mount Image

Analyze

Report

End

Recoverable?

Recover VM image

Extract filesystemmetadata & File

Carving

yes

No

yes

No

4

Page 5: Data Spillage - insurehub.org · The files that support HDFS are stored as part of the local system memory Hadoop Distributed File System is patterned after Unix NameNode: Stores

BACKGROUND: DATA DELETION (HUGHES & COUGHLIN, 2006)

• Data deletion protocols from storage systems have different characteristics

• Selection of erasure protocols there is often a tradeoff between sanitation security level and

time

5

Hughes, G., & Coughlin, T. (2006). Tutorial on disk drive data sanitization.cmrr. ucsd. edu/people/Hughes/DataSanitizationTutorial. pdf..

Data Sanitation Approches

AverageTime

(100GB)

Data Sanitization

Security LevelComments

Normal File Deletion

Minutes Very PoorDeletes only file pointers, not actual

data

DoD 5520 Block Erase

Up to several days

MediumNeeds 3 writes+verify, cannot erase

reassigned blocks

Secure Erase 1-2 hours HighIn-drive overwrite of all user

accessible records

NIST 800-88Enhanced

Secure EraseSeconds Very High Change in-drive encryption key

Page 6: Data Spillage - insurehub.org · The files that support HDFS are stored as part of the local system memory Hadoop Distributed File System is patterned after Unix NameNode: Stores

• Apply a deletion protocol

• Determine can data be recovered in HDFS & to what extent?

WHAT CAN BE RECOVERED & TO WHAT EXTENT?

6

Project Scope

HDFS

• Retrieve NameNodeMetadata

• Recover of DataNode Data Remnants

Load Tagged

Document

• Retrieve NameNodeMetadata

• Recover of DataNode Data Remnants

Delete Tagged

Document

Apply digital forensics procedure to locate data remnants

Assess/qualify the data sanctification level

Compare results using the delete file method to a more secure method (e.g. NIST 800-88)

Page 7: Data Spillage - insurehub.org · The files that support HDFS are stored as part of the local system memory Hadoop Distributed File System is patterned after Unix NameNode: Stores

RESEARCH PROCESSCLOUD STORAGE FORENSIC FRAMEWORK: ADAPTED FROM NIST (KENT ET AL 2006) & MCKEMMISH

(1999)

7

Scope: Determine the investigation boundaries and limitation

Preparation: Prepare cluster environment

Identification: Identify the location of spilled data on data nodes using the metadata on name node

Collection & Preservation: Acquire the .vmdk image files of impacted nodes.

Examine & Analyze: Conduct forensic procedure(s) to find any spilled data remnants on the disks

Presentation: Document and Report Findings

Itera

tive

Page 8: Data Spillage - insurehub.org · The files that support HDFS are stored as part of the local system memory Hadoop Distributed File System is patterned after Unix NameNode: Stores

PREPARATION OF CLUSTER ENVIRONMENT

VIRTUALIZED CLUSTER IN VMWARE HOST MACHINE

Set-up of an 8-node Hadoop Cluster:

• Hardware: 1 Quanta QCT Server

• Host OS: Vmware ESXiTM 5.5

• Virtualization Layer: 8 Virtual Machines

(VMs)

Memory allocated to each VM: 80GB

APP: Hadoop 2.0 using CHD5.2.0

(Cloudera manager package)

Max Block Size per file: 64MB

Replication factor: 3 per file block

1 NameNode

7 DataNodes

8

Page 9: Data Spillage - insurehub.org · The files that support HDFS are stored as part of the local system memory Hadoop Distributed File System is patterned after Unix NameNode: Stores

We loaded data files (evidence) and than started to implement the incident response process

PREPARATION OF CLUSTER ENVIRONMENT

Name node: •Maintains Namespace tree and mapping of blocks to DataNodes:

•Metadata: Inodes & Block list•Edit logs: Journal

HDFSData node 1:•Stores the actual data in blocks >=64MB•Block’s metadata

Heartbeat

Data node 2: Data node 4: Data node 8:

9

Data node 3:

Test File 1

LOADInto Cluster

File1R1

File1R2

File1R3 File2

R3

File2R2

File2R1

Page 10: Data Spillage - insurehub.org · The files that support HDFS are stored as part of the local system memory Hadoop Distributed File System is patterned after Unix NameNode: Stores

We loaded data files (evidence) and than started to implement the incident response process

IDENTIFICATION OF DATA LOCATION

10

Name node: •Maintains Namespace tree and mapping of blocks to DataNodes:

•Metadata: Inodes & Block list•Edit logs: Journal

HDFSData node 1:•Stores the actual data in blocks >=64MB•Block’s metadata

Heartbeat

Data node 2: Data node 4: Data node 8:Data node 3:

File1R1

File1R2

File1R3 File2

R3

File2R2

File2R1

Retrieve file evidence

File Meta

data

DataNode

locations

Page 11: Data Spillage - insurehub.org · The files that support HDFS are stored as part of the local system memory Hadoop Distributed File System is patterned after Unix NameNode: Stores

COLLECTION & PRESERVATION OF IMAGE FILES

Shutdown VM and copy disk image from VMWARE data store

11

NameNode: •Tracks metadata•Data location logs

HDFS DataNode 1:•Stores the actual data in blocks >=64MB•Block’s metadata

Heartbeat

DataNode 2: DataNode 4: DataNode 8:Dead DataNode 3:

File1R1

File1R2

File1R3 File2

R3

File2R2

File2R1

No Heartbeat

.VMDK:

Virtual

Machine

Disk file

Recover file evidence

Page 12: Data Spillage - insurehub.org · The files that support HDFS are stored as part of the local system memory Hadoop Distributed File System is patterned after Unix NameNode: Stores

EXAMINE & ANALYZE: FORENSIC PROCESS

Forensic Acquisition

• Mount the acquired disk images

Log Analysis (Name Node)

• HDFS Event Logs

• Disk ID’s (Before Delete)

• Moved to ./trash (After Delete)

Forensic Analysis (Data Node)

• Block-Based Carving

• Header/Footer Carving

• File structure based Carving

• Live Search

• Index Search

Forensic Tools

FTK Imager

Forensic Tool Kit 5.3

Page 13: Data Spillage - insurehub.org · The files that support HDFS are stored as part of the local system memory Hadoop Distributed File System is patterned after Unix NameNode: Stores

EXAMINE & ANALYZE: LOG ANALYSIS

Hdfs-audit.log

Page 14: Data Spillage - insurehub.org · The files that support HDFS are stored as part of the local system memory Hadoop Distributed File System is patterned after Unix NameNode: Stores

EXAMINE & ANALYZE: LOG ANALYSIS

Hdfs-audit.log

Page 15: Data Spillage - insurehub.org · The files that support HDFS are stored as part of the local system memory Hadoop Distributed File System is patterned after Unix NameNode: Stores

Extension Header (Hex) Footer (Hex)

DOC D0 CF 11 E0 A1 B1 1A E1 57 6F 72 64 2E 44 6F 63 75 6D 65 6E 74 2E

XLS D0 CF 11 E0 A1 B1 1A E1 FE FF FF FF 00 00 00 00 00 00 00 00 57 00 6F 00 72 00 6B 00

62 00 6F 00 6F 00 6B 00

PPT D0 CF 11 E0 A1 B1 1A E1 50 00 6F 00 77 00 65 00 72 00 50 00 6F 00 69 00 6E 00 74

00 20 00 44 00 6F 00 63 00 75 00 6D 00 65 00 6E 00 74

ZIP 50 4B 03 04 14 50 4B 05 06 00

JPG FF D8 FF E0 00 10 4A 46 49 46 00 01 01 D9 (“Better To Use File size Check”)

GIF 47 49 46 38 39 61 4E 01 53 00 C4 21 00 00 3B 00

PDF 25 50 44 46 2D 31 2E 25 25 45 4F 46

FORENSIC ANALYSIS – DATA CARVING

Example: Header/ Footer Method

Page 16: Data Spillage - insurehub.org · The files that support HDFS are stored as part of the local system memory Hadoop Distributed File System is patterned after Unix NameNode: Stores

EXAMINE & ANALYZE: DATA CARVE

Save Select

Page 17: Data Spillage - insurehub.org · The files that support HDFS are stored as part of the local system memory Hadoop Distributed File System is patterned after Unix NameNode: Stores

POSSIBLE DATA REMANANT CONTAINERS

/.Trash

Data Nodes

17

Data at Rest

Future Direction - Automation

Page 18: Data Spillage - insurehub.org · The files that support HDFS are stored as part of the local system memory Hadoop Distributed File System is patterned after Unix NameNode: Stores

FUTURE WORK

Exploration of the efficacy of various secure data

sanitation methods for data removal in a virtual HDFS

cluster

Extension of this process to include the minimization of

cluster downtime during the removal process

Extension of this process to include detection and

removal of other file types

Automation of data removal process in HDFS

18

Page 19: Data Spillage - insurehub.org · The files that support HDFS are stored as part of the local system memory Hadoop Distributed File System is patterned after Unix NameNode: Stores

QUESTIONS?

19

THANK YOU

Page 20: Data Spillage - insurehub.org · The files that support HDFS are stored as part of the local system memory Hadoop Distributed File System is patterned after Unix NameNode: Stores

HDFS automatically replicates data when node is unavailable (as detected by Namenode)

COLLECTION & PRESERVATION OF IMAGE FILES

20

Replication:

DataNode 4:

NameNode: •Tracks metadata•Data location logs

HDFS DataNode 1:•Stores the actual data in blocks >=64MB•Block’s metadata

Heartbeat

DataNode 2: DataNode 8:Dead DataNode 3:

File1R1

File1R2

File1R3 File2

R3

File2R2File2

R1File1R2

No Heartbeat

.VMDK:

Virtual

Machine

Disk file

Recover file evidence