Hdfs 2016-hadoop-summit-dublin-v1

HDFS: Optimization, Stabilization and Supportability

April 13, 2016

Chris Naurothemail: [email protected]: @cnauroth

© Hortonworks Inc. 2011

About Me

Chris Nauroth• Member of Technical Staff, Hortonworks

– Apache Hadoop committer, PMC member, and Apache Software Foundation member– Major contributor to HDFS ACLs, Windows compatibility, and operability improvements

• Hadoop user since 2010– Prior employment experience deploying, maintaining and using Hadoop clusters

Architecting the Future of Big Data


Motivation

• HDFS engineers are on the front line for operational support of Hadoop.– HDFS is the foundational storage layer for typical Hadoop deployments.– Therefore, challenges in HDFS have the potential to impact the entire Hadoop ecosystem.– Conversely, application problems can become visible at the layer of HDFS operations.

• Analysis of Hadoop Support Cases– Support case trends reveal common patterns for HDFS operational challenges.– Those challenges inform what needs to improve in the software.

• Software Improvements– Optimization: Identify bottlenecks and make them faster.– Stabilization: Prevent unusual circumstances from harming cluster uptime.– Supportability: When something goes wrong, provide visibility and tools to fix it.

Thank you to the entire community of Apache contributors.



Logging

• Logging requires a careful balance.– Too little logging hides valuable operational information.– Too much logging causes information overload, increased load and greater garbage collection overhead.

• Logging APIs– Hadoop codebase currently uses a mix of logging APIs.– Commons Logging and Log4J 1 require additional guard logic to prevent execution of expensive messages.if (LOG.isDebugEnabled()) { LOG.debug(“Processing block: “ + block); // expensive toString() implementation!}

– SLF4J simplifies this.LOG.debug(“Processing block: {}”, block); // calls toString() only if debug enabled

• Pitfalls– Forgotten guard logic.– Logging in a tight loop.– Logging while holding a shared resource, such as a mutually exclusive lock.



HADOOP-12318: better logging of LDAP exceptions

• Failure to log full details of an authentication failure.– Very simple patch, huge payoff.– Include exception details when logging failure.

• Before:throw new SaslException("PLAIN auth failed: " + e.getMessage());

• After:throw new SaslException("PLAIN auth failed: " + e.getMessage(), e);



HDFS-9434: Recommission a datanode with 500k blocks may pause NN for 30 seconds• Logging is too verbose

– Summary of patch: don’t log too much!– Move detailed logging to trace level.– It’s still accessible for edge case troubleshooting, but it doesn’t impact base operations.

• Before:LOG.info("BLOCK* processOverReplicatedBlock: " + "Postponing processing of over-replicated " + block + " since storage + " + storage + "datanode " + cur + " does not yet have up-to-date " + "block information.");

• After:if (LOG.isTraceEnabled()) { LOG.trace("BLOCK* processOverReplicatedBlock: Postponing " + block + " since storage " + storage + " does not yet have up-to-date information.");}



Troubleshooting

• Kerberos is hard.– Many moving parts: KDC, DNS, principals, keytabs and Hadoop configuration.– Management tools like Apache Ambari automate initial provisioning of principals, keytabs and configuration.– When it doesn’t work, finding root cause is challenging.

• Metrics are vital for diagnosis of most operational problems.– Metrics must be capable of showing that there is a problem. (e.g. RPC call volume spike)– Metrics also must be capable of identifying the source of that problem. (e.g. user issuing RPC calls)



HADOOP-12426: kdiag

• Kerberos misconfiguration diagnosis.– Attempts to diagnose multiple sources of potential Kerberos misconfiguration problems.– DNS– Hadoop configuration files– KDC configuration

• kdiag: a command-line tool for diagnosis of Kerberos problems– Automatically trigger Java diagnostics, such as -Dsun.security.krb5.debug.– Prints various environment variables, Java system properties and Hadoop configuration options related to

security.– Attempt a login.– If keytab used, print principal information from keytab.– Print krb5.conf.– Validate kinit executable (used for ticket renewals).



HDFS-6982: nntop

• Find activity trends of HDFS operations.– HDFS audit log contains a record of each file system operation to the NameNode.– NameNode metrics contain raw counts of operations.– Identifying load trends from particular users or particular operations has always required ad-hoc scripting to

analyze the above sources of information.• nntop: HDFS operation counts aggregated per operation and per user within time windows.

– curl 'http://127.0.0.1:50070/jmx?qry=Hadoop:service=NameNode,name=FSNamesystemState’

– Look for the “TopUserOpCounts” section in the returned JSON. "ops": [ { "totalCount": 1, "opType": "delete", "topUsers": [ { "count": 1, "user": "chris" }



HDFS-7182: JMX metrics aren't accessible when NN is busy

• Lock contention while attempting to query NameNode JMX metrics.– JMX metrics are often queried in response to operational problems.– Some metrics data required acquisition of a lock inside the NameNode. If another thread held this lock, then

metrics could not be accessed.– During times of high load, the lock is likely to be held by another thread.– At a time when the metrics are most likely to be needed, they were inaccessible.– This patch addressed the problem by acquiring the metrics data without requiring the lock held.



Managing Load

• RPC call load.– It’s too easy for a single inefficient job to overwhelm a cluster with too much RPC load.– RPC servers accept calls into a single shared queue.– Overflowing that queue causes increased latency and rejection of calls for all callers, not just the single inefficient

job that caused the problem.– Load problems can be mitigated with enhanced admission control, client back-off and throttling policies

tailored to real-world usage patterns.



HADOOP-10282: FairCallQueue

• Hadoop RPC Architecture– Traditionally, Hadoop RPC internally admits incoming RPC calls into a single shared queue.– Worker threads consume the incoming calls from that shared queue and process them.– In an overloaded situation, calls spend more time waiting in the queue for a worker thread to become available.– At the extreme, the queue overflows, which then requires rejecting the calls.– This tends to punish all callers, not just the caller that triggered the unusually high load.

• RPC Congestion Control with FairCallQueue– Replace single shared queue with multiple prioritized queues.– Call is placed into a queue with priority selected based on the calling user’s current history.– Calls are dequeued and processed with greater frequency from higher-priority queues.– Under normal operations, when the RPC server can keep up with load, this is not noticeably different from the

original architecture.– Under high load, this tends to deprioritize users triggering unusually high load, thus allowing room for other

processes to make progress. There is less risk of a single runaway job overwhelming a cluster.



HADOOP-10597: RPC Server signals backoff to clients when all request queues are full• Client-side backoff from overloaded RPC servers.

– Builds upon work of the RPC FairCallQueue.– If an RPC server’s queue is full, then optionally send a signal to additional incoming clients to request backoff.– Clients are aware of the signal, and react by performing exponential backoff before sending additional calls.– Improves quality of service for clients when server is under heavy load. RPC calls that would have failed will

instead succeed, but with longer latency.– Improves likelihood of server recovering, because client backoff will give it more opportunity to catch up.



HADOOP-12916: Allow RPC scheduler/callqueue backoff using response times• More flexibility in back-off policies.

– Triggering backoff when the queue is full is in some sense too late. The problem has already grown too severe.– Instead, track call response time, and trigger backoff when response time exceeds bounds.– Any amount of queueing increases RPC response latency. Reacting to unusually high RPC response time can

prevent the problem from becoming so severe that the queue overflows.



Performance

• Garbage Collection– NameNode heap must scale up in relation to the number of file system objects (files, directories, blocks, etc.).– Recent hardware trends can cause larger DataNode heaps too. (Nodes have more disks and those disks are

larger, therefore the memory footprint has increased for tracking block state.)– Much has been written about garbage collection tuning for large heap JVM processes.– In addition to recommending configuration best practices, we can optimize the codebase to reduce garbage

collection pressure.

• Block Reporting– The process by which DataNodes report information about their stored blocks to the NameNode.

– Full Block Report: a complete catalog of all of the node’s blocks, sent infrequently.

– Incremental Block Report: partial information about recently added or deleted blocks, sent more frequently.

– All block reporting occurs asynchronous of any user-facing operations, so it does not impact end user latency directly.

– However, inefficiencies in block reporting can overwhelm a cluster to the point that it can no longer serve end user operations sufficiently.



HDFS-7097: Allow block reports to be processed during checkpointing on standby name node• Coarse-grained locking impedes block report processing.

– NameNode has a global lock required to enforce mutual exclusion for some operations.– One such operation is checkpointing performed at the HA standby NameNode: process of creating a new fsimage

representing the full metadata state and beginning a new edit log. This can take a long time in large clusters.– Block report processing also required holding the lock, and therefore could not proceed during a checkpoint.

• Coarse-grained lock contention can lead to cascading failure and downtime.– Checkpointing holds lock.– Frequent incremental block reports from DataNodes block waiting to acquire lock.– Eventually consumes all available RPC handler threads, all waiting to acquire lock.– In extreme case, blocks HA NameNode failover, because there is no RPC handler thread available to handle the

failover request.– Even if HA failover can succeed, may still leave cluster in a state where it appears many nodes have gone dead,

because their blocked heartbeats couldn’t be processed.• Solution: allow block report processing without holding global lock.

– Block reports now can be processed concurrently with a checkpoint in progress.– Like most multi-threading and locking logic, required careful reasoning to ensure change was safe.



HDFS-7435: PB encoding of block reports is very inefficient

• Block report RPC message encoding can cause memory allocation inefficiency and garbage collection churn.– HDFS RPC messages are encoded using Protocol Buffers.– Block reports encoded each block ID, length and generation stamp in a Protocol Buffers repeated long field.– Behind the scenes, this becomes an ArrayList with a default capacity of 10.

– DataNodes in large clusters almost always send a larger block report than this, so ArrayList reallocation churn is almost guaranteed.

– Data type contained in the ArrayList is Long (note captialization, not primitive long).– Boxing and unboxing causes additional allocation requirements.

• Solution: a more GC-friendly encoding of block reports.– Within the Protocol Buffers RPC message, take over serialization directly.– Manually encode number of longs, followed by list of primitive longs.– Eliminates ArrayList reallocation costs.– Eliminates boxing and unboxing costs by deserializing straight to primitive long.



HDFS-7609: Avoid retry cache collision when Standby NameNode loading edits• Idempotence and at-most-once delivery of HDFS RPC messages.

– Some RPC message processing is inherently idempotent: can be applied multiple times, and the final result is still the same. Example: setPermission.

– Other messages are not inherently idempotent, but the NameNode can still provide an “at-most-once” processing guarantee by temporarily tracking recently executed operations by a unique call ID. Example: rename.

– The data structure that does this is called the RetryCache.– This is important in failure modes, such as an HA failover or a network partition, which may cause a client to send

the same message more than once.• Erroneous multiple RetryCache entries for same operation.

– Duplicate entries caused slowdown.– Particularly noticeable during an HA transition.– Bug fix to prevent duplicate entries.



HDFS-9710: Change DN to send block receipt IBRs in batches

• Incremental block reports trigger multiple RPC calls.– When a DataNode receives a block, it sends an incremental block report RPC to the NameNode immediately.– Even multiple block receipts translate to multiple individual incremental block report RPCs.– With consideration of all DataNodes in a large cluster, this can become a huge number of RPC messages for the

NameNode to process.• Solution: batch multiple block receipt events into a single RPC message.

– Reduces RPC overhead of sending multiple messages.– Scales better with respect to number of nodes and number of blocks in a cluster.



Liveness

• "...make progress despite the fact that its concurrently executing components ("processes") may have to "take turns" in critical sections, parts of the program that cannot be simultaneously run by multiple processes." -Wikipedia

• DataNode Heartbeats– Responsible for reporting health of a DataNode to the NameNode.– Operational problems of managing load and performance can block timely heartbeat processing.– Heartbeat processing at the NameNode can be surprisingly costly due to contention on a global lock and

asynchronous dispatch of commands (e.g. delete block).• Blocked heartbeat processing can cause cascading failure and downtime.

– Blocked heartbeat processing can make the NameNode think DataNodes are not heartbeating at all, and therefore are not running.

– DataNodes that stop running are flagged by the NameNode as dead.– Too many dead DataNodes makes the cluster inoperable as a whole.– Dead DataNodes must have their replicas copied to other DataNodes to satisfy replication requirements.– Erroneously flagging DataNodes as dead can cause a storm of wasteful re-replication activity.



HDFS-9239: DataNode Lifeline Protocol: an alternative protocol for reporting DataNode health• The lifeline keeps the DataNode alive, despite conditions of unusually high load.

– Optionally run a separate RPC server within the NameNode dedicated to processing of lifeline messages sent by DataNodes.

– Lifeline messages are a simplified form of heartbeat messages, but do not have the same costly requirements for asynchronous command dispatch, and therefore do not need to contend on a shared lock.

– Even if the main NameNode RPC queue is overwhelmed, the lifeline still keeps the DataNode alive.– Prevents erroneous and costly re-replication activity.



HDFS-9311: Support optional offload of NameNode HA service health checks to a separate RPC server.• RPC offload of HA health check and failover messages.

– Similar to problem of timely heartbeat message delivery.– NameNode HA requires messages sent from the ZKFC (ZooKeeper Failover Controller) process to the

NameNode.– Messages are related to handling periodic health checks and initiating shutdown and failover if necessary.– A NameNode overwhelmed with unusually high load cannot process these messages.– Delayed processing of these messages slows down NameNode failover, and thus creates a visibly prolonged

outage period.– The lifeline RPC server can be used to offload HA messages, and similarly keep processing them even in the

case of unusually high load.



Optimizing Applications

• HDFS Utilization Patterns– Sometimes it’s helpful to look a layer higher and assess what applications are doing with HDFS.– FileSystem API unfortunately can make it too easy to implement inefficient call patterns.



HIVE-10223: Consolidate several redundant FileSystem API calls.• Hadoop FileSystem API can cause applications to make redundant RPC calls.• Before:

if (fs.isFile(file)) { // RPC #1 ...} else if (fs.isDirectory(file)) { // RPC #2 ...}

• After:FileStatus fileStatus = fs.getFileStatus(file); // Just 1 RPCif (fileStatus.isFile()) { // Local, no RPC ...} else if (fileStatus.isDirectory()) { // Local, no RPC ...}

• Good for Hive, because it reduces latency associated with NameNode RPCs.• Good for the whole ecosystem, because it reduces load on the NameNode, a shared service.



PIG-4442: Eliminate redundant RPC call to get file information in HPath.• A similar story of redundant RPC within Pig code.• Before:

long blockSize = fs.getHFS().getFileStatus(path).getBlockSize(); // RPC #1short replication = fs.getHFS().getFileStatus(path).getReplication(); // RPC #2

• After:FileStatus fileStatus = fs.getHFS().getFileStatus(path); // Just 1 RPClong blockSize = fileStatus.getBlockSize(); // Local, no RPCshort replication = fileStatus.getReplication(); // Local, no RPC

• Revealed from inspection of HDFS audit log.– HDFS audit log shows a record of each file system operation executed against the NameNode.– This continues to be one of the most significant sources of HDFS troubleshooting information.– In this case, manual inspection revealed a suspicious pattern of multiple getfileinfo calls for the same path from a

Pig job submission.



HDFS-9924: Asynchronous HDFS Access

• Current Hadoop FileSystem API is inherently synchronous.– Issue a single synchronous file system call.– In the case of HDFS, that call is implemented with a synchronous RPC.– Block waiting for the result.– Then, client application may proceed.

• Some application usage patterns would benefit from asynchronous access.– Some applications regularly issue a large sequence of multiple file system calls, with no data dependencies

between the results of those calls.– For example, Hive partition logic can involve hundreds or thousands of rename operations, where each rename

can execute independently, with no data dependencies on the results of other renames.public Future<Boolean> rename(Path src, Path dst) throws IOException;



Summary

• A variety of recent enhancements have improved the ability of HDFS to serve as the foundational storage layer of the Hadoop ecosystem.

• Optimization– Performance– Optimizing Applications

• Stabilization– Liveness– Managing Load

• Supportability– Logging– Troubleshooting



Thank you!

Q&A

Hdfs 2016-hadoop-summit-dublin-v1

Software

Transcript of Hdfs 2016-hadoop-summit-dublin-v1