Hadoop and Big Data Security

10
Hadoop and Big Data Security Kevin T. Smith, 11/14/2013 Ksmith <AT> Novetta . COM

description

Hadoop and Big Data Security - Kevin T. Smith

Transcript of Hadoop and Big Data Security

Page 1: Hadoop and Big Data Security

Hadoop and Big Data Security

Kevin T. Smith, 11/14/2013Ksmith <AT> Novetta . COM

Page 2: Hadoop and Big Data Security

Big Data Security – Why Should We Care? New Challenges related to Data Management, Security, and Privacy

As data growth is explosive, so is the complexity of our IT environments Many organizations required to enforce access control & privacy restrictions on data sets (HIPAA, Privacy Laws) – or face steep penalties & fines Organizations are increasingly required to enforce access control to their data scientists based on Need-to-Know, User Authorization levels, and what data they are allowed to see – especially in Healthcare, Finance, and GovernmentOrganizations struggling to understand what data they can release

Mismanagement of Data Sets -- Costly..AOL Research “Data Valdez” Incident• CNNMoney - “101 Dumbest Moments in Business”• $5 Million Settlement , plus $100 to each member of AOL between 3/2006-5/2006, + $50

to each member who believed their data was in the released data; Fired employees, CTO Resignation

The Netflix Contest Anonymized Data Set Incident • Class-Action Lawsuit, $9 Million Settlement

Massachusetts Hospital Record Incident Cyber Security Attacks are on the Rise

Ponemon Institute – the Average Cost of a Data Breach in the U.S. is 5.4 Million dollars*Playstation (2011) – Experts predict costs between 2.2 and 2.4 Billion* (Breach Study: Global Analysis, May 2013)

Page 3: Hadoop and Big Data Security

A (Brief) History of Hadoop Security

Hadoop developed without Security in MindOriginally No Security model

No authentication of users or servicesAnyone could submit arbitrary code to be executedLater authorization added, but any user could impersonate other users with command-line switch

In 2009, Yahoo! focused on Hadoop Authentication, and did a Hadoop Redesign, But…

Resulting Security Model is ComplexSecurity Configuration is complex & Easy to Mess UpNo Data at Rest EncryptionKerberos-Centric Limited Authorization Capabilities

Things are Changing, But Slowly..

It is important to understand how Hadoop Security is Currently Implemented & ConfiguredIt is important to understand how to meet your organization’s security requirements

Page 4: Hadoop and Big Data Security

Hadoop Security Data Flow

Distributed Security is a ChallengeSince the .20.20x distributions of Hadoop, much of the model is Kerberos Centric , as you see to the right Model is quite complex, as you will see on the next slide

Page 5: Hadoop and Big Data Security

Token Delegation & Hadoop Security Flow

Token Used For

Kerberos TGT Kerberos initial authentication to KDC.

Kerberos service ticket

Kerberos initial authentication between users, client processes, and services.

Delegation token Token issued by the NameNode to the client, used by the client or any services working on the client’s behalf to authenticate them to the NameNode.

Block Access token

Token issued by the NameNode after validating authorization to a particular block of data, based on a shared secret with the DataNode. Clients (and services working on the client’s behalf) use the Block Access token to request blocks from the DataNode.

Job token This is issued by the JobTracker to TaskTrackers. Tasks communicating with TaskTrackers for a particular job use this token to prove they are associated with the job.

Page 6: Hadoop and Big Data Security

Some Vendor Activity in Hadoop Security

Cloudera Sentry – Fine Grained Access Control for Apache Hive & Cloudera ImpalaIBM InfoSphere Optim Data Masking – Optim Data Masking provides “De-identification” of data by obfuscating corporate secrets, Guardium provides monitoring & auditingIntel’s Secure Hadoop Distribution – Encryption in transit & at rest, Granular access control with HBase DataStax Enterprise – Encryption in Transit & at Rest (using Cassandra for storage)DataGuise for Hadoop – Detects & protects sensitive data, setting access permission, masking or encrypting data, authorization based access Knox Gateway (Hortonworks) – Perimeter security, integration with IDAM environments, manage security across multiple clusters – now an Apache ProjectProtegrity – Big Data Protector provides Encryption & tokenization, Enterprise Security Administrator provides central policy, key mgmt, auditing, reportingSqrrl – Builds on Apache Accumulo’s security capabilities for Hadoop Zettaset Secure Orchestrator – security wrapper around Hadoop

Seems to be a New One Every Week!

Page 7: Hadoop and Big Data Security

Apache Accumulo

• Cell-Level Access Control via visibility• By default, uses its own db for users & credentials• Can be extended in code to use other Identity & Access Management Infrastructure

Page 8: Hadoop and Big Data Security

Project Rhino

Intel launched this open source effort to improve security capabilities of Hadoop & contributed code to Apache in early 2013. Encrypted Data at Rest - JIRA Tasks HADOOP-9331 (Hadoop Crypto Codec Framework and Crypto Codec Implementation) and MAPREDUCE-5025 (Key Distribution and Management for Supporting Crypto Codec in MapReduce) . ZOOKEEPER-1688 will provide the ability for transparent encryption of snapshots and commit logs on disk, protecting against the leakage of sensitive information from files at rest.Token-Based Authentication & Unified Authorization Framework - JIRA TasksHADOOP-9392 (Token-Based Authentication and Single Sign-On) and HADOOP-9466(Unified Authorization Framework) Improved Security in HBase - The JIRA Task HBASE-6222 (Add Per-KeyValue Security) adds cell-level authorization to HBase – something that Apache Accumulo has but HBase does not. HBASE-7544 builds on the encryption framework being developed, extending it to HBase, providing transparent table encryption.

Page 9: Hadoop and Big Data Security

What’s the Best Guidance Now?

Identify and Understand the Sensitivity Levels of Your DataAre there access control policies associated with your data?

Understand the Impact of the Release of Your DataNetflix example – Could someone couple your data with open source data to gain new (and unintended) insight?

Develop Policies & Procedures relating to Security & Privacy of Your Data Sets

Data IngestAccess Control within Your OrganizationCleansing/Sanitization/DestructionAuditingMonitoring ProceduresIncident Response

Develop a Technical Security Approach that Complements Hadoop Security

Page 10: Hadoop and Big Data Security

Questions?

Ksmith <AT> Novetta.COM