Hadoop Security Architecture

30
Hadoop Hadoop Security Architecture Owen O’Malley [email protected]

description

A deep dive on Hadoop security given at Yahoo to other architects when Yahoo was rolling out the first version of Hadoop with security 0.20.100.

Transcript of Hadoop Security Architecture

Page 1: Hadoop Security Architecture

Hadoop

Hadoop SecurityArchitecture

Owen O’Malley

[email protected]

Page 2: Hadoop Security Architecture

Hadoop

Outline

• Problem Statement• Security Threats• Solutions to Threats• HDFS• MapReduce• Oozie

• Interfaces

• Performance

• Reliability and Availability

• Operations and Monitoring

2

Page 3: Hadoop Security Architecture

Hadoop

Problem Statement• The fundamental goal of adding Hadoop security is that Yahoo's data

stored in HDFS must be secure from unauthorized access. Furthermore, it must do so without adding significant effort to operating or using the Grid. Based on that goal, there are a few implications.

– All HDFS clients must be authenticated to ensure that the user is who they claim to be. That implies that all map/reduce users, including services such as Oozie, must be also authenticated and that tasks must run with the privileges and identity of the submitting user.

– Since Data Nodes and TaskTrackers are entrusted with user data and credentials, they must authenticate themselves to ensure they are running as part of the Grid and are not trojan horses.

– Kerberos will be the underlying authentication service so that users can be authenticated using their system credentials.

Page 4: Hadoop Security Architecture

Hadoop

Communication and Threats

4

Page 5: Hadoop Security Architecture

Hadoop

Security Threats in Hadoop• User to Service Authentication

– No User Authentication on NameNode or JobTracker

• Client code supplies user and group names

– No User Authorization on DataNode – Fixed in 0.21

• Users can read/write any block

– No User Authorization on JobTracker

• Users can modify or kill other user’s jobs

• Users can modify the persistent state of JobTracker

• Service to Service Authentication

– No Authentication of DataNodes and TaskTrackers

• Users can start fake DataNodes and TaskTrackers

• No Encryption on Wire or Disk

Page 6: Hadoop Security Architecture

Hadoop

Solutions to Threats

• Add Kerberos-based authentication to NameNode and JobTracker.

• Add delegation tokens to HDFS and support for them in MapReduce.

• Determine user’s group membership on the NameNode and JobTracker.

• Protect MapReduce system directory from users.

• Add authorization model to MapReduce so that only submitting user can modify or kill job.

• Add Backyard authentication to Web UI’s.

6

Page 7: Hadoop Security Architecture

Hadoop

Out of Scope for 0.20.100

• Protecting against root on slave nodes:

– Encryption of RPC messages

– Encryption of block transfer protocol

– Encryption of MapReduce transient files

– Encryption of HDFS block files

• Passing Kerberos tickets to MapReduce tasks for third party Kerborized services.

7

Page 8: Hadoop Security Architecture

Hadoop

HDFS Security

• Users authenticate via Kerberos

• MapReduce jobs can obtain delegation tokens for later use.

• When clients are reading or writing an HDFS file, the NameNode generates a block access token that will be verified by the DataNode.

8

Page 9: Hadoop Security Architecture

Hadoop

HDFS Authentication

• Clients authenticate to NameNode via:

– Kerberos

– Delegation tokens

• Client demonstrates authorization to DataNode via block access token

• DataNode authenticates to NameNode via Kerberos

Page 10: Hadoop Security Architecture

Hadoop

What does this *really* look like?

• Need a Kerberos ticket to work

– kinit –l 7d [email protected]

– hadoop fs –ls

– hadoop jar my.jar in-dir out-dir

•Works using ticket cache!

– Can display ticket cache with klist.

10

Page 11: Hadoop Security Architecture

Hadoop

Kerberos Dataflows

11

Page 12: Hadoop Security Architecture

Hadoop

Delegation Token

• Advantages over using Kerberos directly:

– Don’t trust JobTracker with credentials

– Avoid MapReduce task authorization flood

– Renewable by third party (ie. JobTracker)

– Revocable when job finishes

• tokenId = {owner prin, renewer prin, issueDate, maxDate}

• tokenAuthenticator = HMAC(masterKey, tokenId)

• Token = {tokenId, tokenAuthenticator}

12

Page 13: Hadoop Security Architecture

Hadoop

Block Access Token

• Only NameNode knows the set of users allowed to access a specific block, so the NameNodes gives an authorized clients a block access token.

• Capabilities include read, write, copy, or replace.

• The NameNode and DataNodes share a dynamically rolled secret key to secure the tokens.

• tokenId={expiration, keygen, owner, block, access}

• tokenAuthenticator = HMAC(blockKey, tokenId)

• token = {tokenId, tokenAuthenticator}

13

Page 14: Hadoop Security Architecture

Hadoop

MapReduce Security

• Require Kerberos authentication from client.

• Secure the information about pending and running jobs

– Store the job configuration and input splits in HDFS under ~user/.staging/$jobid

– Store the job’s location and secrets in private directory

• JobTracker creates a random job token. It it used for:

– Connecting to TaskTracker’s RPC

– Authorizing http get for shuffle

• HMAC(job token, URL) sent from reduce tasks to TaskTracker

14

Page 15: Hadoop Security Architecture

Hadoop

MapReduce Authentication

• Client authenticates to JobTracker via Kerberos

• TaskTracker authenticates to JobTracker via Kerberos

• Task authenticates to the TaskTrackers using the job token

• Task authenticates to HDFS using a delegation token

• NFS is not Kerberized.

15

Page 16: Hadoop Security Architecture

Hadoop

MapReduce Task Security

• Users have separate task directories with permissions set to 700.

• Distributed cache is now divided based on the source’s visibility

– Global – shared with other users

– Private - protected from other users

16

Page 17: Hadoop Security Architecture

Hadoop

Web UI

• MapReduce makes heavy use of Web UI for displaying state of cluster and running jobs.

• HDFS also has a web browsing interface.

• Use Backyard to authenticate Web UI users

• Only allow submitting user of job to view stdout and stderr of job’s tasks.

• HDFS web browser checks user’s authorization.

17

Page 18: Hadoop Security Architecture

Hadoop

Oozie

• Client authenticates to Oozie

– Custom auth for Yahoo!

• Oozie authenticates to HDFS and MapReduce as “oozie” principal

• “oozie” is configured as a super-user for HDFS and MapReduce and may act as other users.

18

Page 19: Hadoop Security Architecture

Hadoop

Proxy Services Trust Model

• Requires trust that service (eg. Oozie) principal is secure.

• Explored and rejected

– Having user headless principals stored on Oozie machine “x/oozie” for user “x”

– Passing user headless principal keytab to Oozie

– Generalizing delegation token to have token granting tokens.

19

Page 20: Hadoop Security Architecture

Hadoop

Protocols• RPC

– Change RPC to use SASL and either:

• Kerberos authentication (GSSAPI)

• Tokens (DIGEST-MD5)

– User’s Kerberos tickets obtained at login used automatically.

– Changes RPC format

– Can easily add encryption later

• Block transfer protocol

– Block access tokens in data stream

20

Page 21: Hadoop Security Architecture

Hadoop

Protocols

• HTTP

– User/Browser facing

• Yahoo – Custom Authentication

• External – SPNEGO or Kerberos login module

– Web Services

• HFTP – Hadoop File Transfer Protocol

• Others later

• SPNEGO or Delegation Token via RPC

– Shuffle

• Use HMAC of URL hashed with Job Token

21

Page 22: Hadoop Security Architecture

Hadoop

Summary• RPC

– Kerberos

• Application to NameNode, JobTracker

• DataNode to NameNode

• TaskTracker to JobTracker

– Digest-MD5

• MapReduce task to NameNode, TaskTracker

• Block Access Token

• Backyard

– User to Web UI

22

Page 23: Hadoop Security Architecture

Hadoop

NNRP

C

HT

TP

Ker

bero

sD

IGE

ST

-MD

5

User (initial access),2ndNN, BalancerUser (initial access),2ndNN, Balancer

Task

Accessing as userAccessing as user

Bac

kyar

d

Browser, 2ndNN, fsckBrowser, 2ndNN, fsck

Task

distcp accessing as userdistcp accessing as user

HF

TP

DNS

ocke

t

HT

TP

User, DN, Balancer, TaskUser, DN, Balancer, Task

Bac

kyar

d

Browser

Task

distcp accessing as userdistcp accessing as user

HF

TP

access token

Forward delegation tokenForward delegation token

Task

HTTP-DIGEST w/ delegation tokenHTTP-DIGEST w/ delegation token

Kerberos

Page 24: Hadoop Security Architecture

Hadoop

JTRP

C

HT

TP

Bac

kyar

d

BrowserBrowserKer

bero

s

UserUser

TT

TTRP

C

HT

TP

DIG

ES

T-M

D5

Task

Local TaskLocal Task

Bac

kyar

d-H

MA

C

BrowserBrowser

Task

Reduce Task getting Map outputReduce Task getting Map output

Kerberos

Page 25: Hadoop Security Architecture

Hadoop

Authentication Paths

25

Page 26: Hadoop Security Architecture

Hadoop

Interfaces (and their scope and stability)

• Imported Interfaces

– JAAS – Java API for supporting authentication

– SASL – Standard for supporting token and Kerberos authentication

– GSSAPI – Kerberos part of SASL authentication

– HMAC-SHA1 – Shared secret authentication

– SPNEGO – Use Kerberos tickets over HTTP

• Exported Interfaces – Both Limited Private

– HDFS adds a method to get delegation tokens

– RPC adds a doAs method

• Major Internal (Inter-system Interfaces)

– MapReduce Shuffle uses HMAC-SHA1

– RPC uses Kerberos and DIGEST-MD5

26

Page 27: Hadoop Security Architecture

Hadoop

Pluggability

• Pluggability in Hadoop supports different environments

• HTTP browser user authentication

– Yahoo – Backyard

– External – SPNEGO or Kerberos login module

• RPC transport

– SASL supports DIGEST-MD5, Kerberos, and others

• Acquiring credentials

– JAAS supports Kerberos, and others

27

Page 28: Hadoop Security Architecture

Hadoop

Performance

•The authentication should not introduce substantial performance penalties.

•Delegation token design to avoid authentication flood by MapReduce tasks

•Required to be less than 3% on GridMix.

28

Page 29: Hadoop Security Architecture

Hadoop

Reliability and Availability

• The Kerberos KDC can not be a single point of failure.

– Kerberos clients automatically fail over to secondary KDC’s

– Secondary KDC’s can be sync’ed automatically from the primary since the data rarely changes.

• The cluster must remain stable when Kerberos fails.

– The slaves (TaskTrackers and DataNodes) will lose their ability to reconnect to the master, when their RPC socket closes, their service ticket has expired, and both the primary and secondary KDC’s have failed.

– Decided not to use special tokens to handle this case.

• Once the MapReduce job is submitted, the KDC is not required for the job to continue running.

29

Page 30: Hadoop Security Architecture

Hadoop 30

Operations and Monitoring

• The number of Kerberos authorizations will be logged on the NameNode and JobTracker.

• Authorization failures will be logged.

• Authentication failures will be logged.

• The authorization logs will be a separate log4j logger, so they can be directed to a separate file.