Hdfs 2016-hadoop-summit-san-jose-v4

HDFS: Optimization, Stabilization and Supportability

June 28, 2016

Chris Naurothemail: [email protected]: @cnauroth

Arpit Agarwalemail: [email protected]: @aagarw

mailto:[email protected]

© Hortonworks Inc. 2011

About Us

Chris Nauroth• Member of Technical Staff, Hortonworks

– Apache Hadoop committer, PMC member, and Apache Software Foundation member– Major contributor to HDFS ACLs, Windows compatibility, and operability improvements

• Hadoop user since 2010– Prior employment experience deploying, maintaining and using Hadoop clusters

Arpit Agarwal• Member of Technical Staff, Hortonworks

– Apache Hadoop Committer, PMC Member– Major contributor to HDFS Heterogeneous Storage Support, Windows Compatibility

Architecting the Future of Big Data


Motivation

• HDFS engineers are on the front line for operational support of Hadoop.– HDFS is the foundational storage layer for typical Hadoop deployments.– Therefore, challenges in HDFS have the potential to impact the entire Hadoop ecosystem.– Conversely, application problems can become visible at the layer of HDFS operations.

• Analysis of Hadoop Support Cases– Support case trends reveal common patterns for HDFS operational challenges.– Those challenges inform what needs to improve in the software.

• Software Improvements– Optimization: Identify and mitigate bottlenecks.– Stabilization: Prevent unusual circumstances from harming cluster uptime.– Supportability: When something goes wrong, provide visibility and tools to fix it.

Thank you to the entire community of Apache contributors.



Performance

• Garbage Collection– NameNode heap must scale up in relation to the number of file system objects (files, directories, blocks, etc.).– Recent hardware trends can cause larger DataNode heaps too. (Nodes have more disks and those disks are

larger, therefore the memory footprint has increased for tracking block state)– Much has been written about garbage collection tuning for large heap JVM processes.– In addition to recommending configuration best practices, we can optimize the codebase to reduce garbage

collection pressure.



Performance

• Block Reporting– The process by which DataNodes report information about their stored blocks to the NameNode.

– Full Block Report: a complete catalog of all of the node’s blocks, sent infrequently.

– Incremental Block Report: partial information about recently added or deleted blocks, sent more frequently.

– All block reporting occurs asynchronous of any user-facing operations, so it does not impact end user latency directly.

– However, inefficiencies in block reporting can overwhelm a cluster to the point that it can no longer serve end user operations sufficiently.



HDFS-7435: PB encoding of block reports is very inefficient• Block report RPC message encoding can cause memory allocation inefficiency and garbage

collection churn.– HDFS RPC messages are encoded using Protocol Buffers.– Block reports encode each block as a sequence of 3 64-bit long fields.– Behind the scenes, this becomes an ArrayList<Integer> with a default capacity of 10.

– DataNodes almost always send a larger block report than this, so array reallocation churn is almost guaranteed.– Boxing and unboxing causes additional allocation requirements.

• Solution: a more GC-friendly encoding of block reports.– Take over serialization directly.– Manually encode number of longs, followed by list of primitive longs.– Eliminates ArrayList reallocation costs.– Eliminates boxing and unboxing costs by deserializing straight to primitive long.



HDFS-9710: Change DN to send block receipt IBRs in batches• Incremental block reports trigger multiple RPC calls.

– When a DataNode receives a block, it sends an incremental block report RPC to the NameNode immediately.– Even multiple block receipts translate to multiple individual incremental block report RPCs.– With consideration of all DataNodes in a large cluster, this can become a huge number of RPC messages for the

NameNode to process.• Solution: batch multiple block receipt events into a single RPC message.

– Reduces RPC overhead of sending multiple messages.– Scales better with respect to number of nodes and number of blocks in a cluster.



Liveness

• "...make progress despite the fact that its concurrently executing components ("processes") may have to "take turns" in critical sections..." -Wikipedia

• DataNode Heartbeats– Responsible for reporting health of a DataNode to the NameNode.– Operational problems of managing load and performance can block timely heartbeat processing.– Heartbeat processing at the NameNode can be surprisingly costly due to contention on a global lock and

asynchronous dispatch of commands (e.g. delete block).• Blocked heartbeat processing can cause cascading failure and downtime.

– Blocked heartbeat processing: looks the same as DataNode not running at all.– DataNodes not running: flagged by the NameNode as stale, then dead.– Multiple stale DataNodes: reduced cluster capacity.– Multiple dead DataNodes: storm of wasteful re-replication activity.



HDFS-9239: DataNode Lifeline Protocol: an alternative protocol for reporting DataNode health• The lifeline keeps the DataNode alive, despite conditions of unusually high load.

– Optionally run a separate RPC server within the NameNode dedicated to processing of lifeline messages sent by DataNodes.

– Lifeline messages are a simplified form of heartbeat messages, but do not have the same costly requirements for asynchronous command dispatch, and therefore do not need to contend on a shared lock.

– Even if the main NameNode RPC queue is overwhelmed, the lifeline still keeps the DataNode alive.– Prevents erroneous and costly re-replication activity.



HDFS-9311: Support optional offload of NameNode HA service health checks to a separate RPC server.• RPC offload of HA health check and failover messages.

– Similar to problem of timely heartbeat message delivery.– NameNode HA requires messages sent from the ZKFC (ZooKeeper Failover Controller) process to the

NameNode.– Messages are related to handling periodic health checks and initiating shutdown and failover if necessary.– A NameNode overwhelmed with unusually high load cannot process these messages.– Delayed processing of these messages slows down NameNode failover, and thus creates a visibly prolonged

outage period.– The lifeline RPC server can be used to offload HA messages, and similarly keep processing them even in the

case of unusually high load.



Optimizing Applications

• HDFS Utilization Patterns– Sometimes it’s helpful to look a layer higher and assess what applications are doing with HDFS.– FileSystem API unfortunately can make it too easy to implement inefficient call patterns.



HIVE-10223: Consolidate several redundant FileSystem API calls.• Hadoop FileSystem API can cause applications to make redundant RPC calls.• Before:

if (fs.isFile(file)) { // RPC #1 ...} else if (fs.isDirectory(file)) { // RPC #2 ...}

• After:FileStatus fileStatus = fs.getFileStatus(file); // Just 1 RPCif (fileStatus.isFile()) { // Local, no RPC ...} else if (fileStatus.isDirectory()) { // Local, no RPC ...}

• Good for Hive, because it reduces latency associated with NameNode RPCs.• Good for the whole ecosystem, because it reduces load on the NameNode, a shared service.



PIG-4442: Eliminate redundant RPC call to get file information in HPath.• A similar story of redundant RPC within Pig code.• Before:

long blockSize = fs.getHFS().getFileStatus(path).getBlockSize(); // RPC #1short replication = fs.getHFS().getFileStatus(path).getReplication(); // RPC #2

• After:FileStatus fileStatus = fs.getHFS().getFileStatus(path); // Just 1 RPClong blockSize = fileStatus.getBlockSize(); // Local, no RPCshort replication = fileStatus.getReplication(); // Local, no RPC

• Revealed from inspection of HDFS audit log.– HDFS audit log shows a record of each file system operation executed against the NameNode.– This continues to be one of the most significant sources of HDFS troubleshooting information.– In this case, manual inspection revealed a suspicious pattern of multiple getfileinfo calls for the same path from a

Pig job submission.



Managing NameNode Load

• NameNode no longer a single point of failure

–However NameNode performance can still be a bottleneck

• Assumption that applications will be well-behaved

• A single inefficient job can easily overwhelm the NameNode with too much RPC load.



Hadoop RPC Architecture

• Hadoop RPC admits incoming calls into a shared queue.• Worker threads consume incoming calls from that shared queue and process them• In an overloaded situation, calls spend longer waiting in the queue for a worker thread

to become available.• If the RPC queue overflows, requests are queued in the OS socket buffers.

–More buffering leads to higher RPC latencies and potentially client side timeouts.

–Timeouts often result in job failures and restarts

–Restarted jobs cause more work - positive feedback loop.

• Affects all callers, not just the caller that triggered the unusually high load.


HADOOP-10597: RPC Server signals backoff to clients when all request queues are full

• If an RPC server’s queue is full, respond to new requests with a backoff signal.

• Clients react by performing exponential backoff before retrying the call.

–Reduce job failures by avoiding client timeouts

• Improves QoS for clients when server is under heavy load.

• RPC calls that would have timed out will instead succeed, but with longer latency.



HADOOP-10282: FairCallQueue

• Replace single RPC queue with multiple prioritized queues.

• Server maintains sliding window of RPC request counts, by user.

• New RPC calls placed into queues with priority based on the calling user’s history

• Calls are de-queued and processed with higher probability from higher-priority queues

• De-prioritizes heavy users under high load, prevents starvation of other jobs

• Complements RPC Congestion Control.



HADOOP-12916: Allow RPC scheduler/CallQueue backoff using response times

• Flexible back-off policies.– Triggering backoff when the queue is full is often too late.

– Clients may be already experiencing timeouts before the RPC queue overflows.

• Instead, track call response time and trigger backoff when response time exceeds bounds.

• Further reduces the probability of client timeouts and hence reduces job failures.



HADOOP-13128: Manage Hadoop RPC resource usage via resource coupon (proposed feature)

• Multi-tenancy is a key challenge in large enterprise deployments.

• Allows HDFS and the YARN ResourceManager to coordinate allocation of RPC resources to multiple applications running concurrently in a multi-tenant deployment.

• FairCallQueue can lead to priority inversion– NameNode is not aware of relative priorities of YARN jobs

– Requests from a high priority application can be demoted to a lower-priority RPC call queue.

– Resource coupon presented by incoming RPC requests.

• Allow the Resource Manager to request a slice of NameNode capacity via a coupon.



Logging

• Logging requires a careful balance.• Too much logging causes

– Information overload

– Increased system load - Rendering strings is expensive, creates garbage

• Too little logging hides valuable operational information.



Too much logging

• Benign errors can confuse administrators

– INFO ipc.Server (Server.java:run(2165)) - IPC Server handler 32 on 8021, call org.apache.hadoop.hdfs.protocol.ClientProtocol.getListing from 192.168.22.1:60216 Call#9371 Retry#0: org.apache.hadoop.ipc.StandbyException: Operation category READ is not supported in state standby

– ERROR datanode.DataNode (DataXceiver.java:run(278)) – myhost.hortonworks.com:50010:DataXceiver error processing unknown operation src: /127.0.0.1:60681 dst: /127.0.0.1:50010 java.io.EOFException


Logging Pitfalls

• Forgotten guard logic.– if (LOG.isDebugEnabled()) {

LOG.debug(“Processing block: “ + block); // expensive toString() implementation!}

• Switching the logging API to SLF4J can eliminate the need for log-level guards in most cases.

– LOG.debug(“Processing block: {}”, block); // calls toString() only if debug enabled

• Logging in a tight loop.

• Logging while holding a shared resource, such as a mutually exclusive lock.



HDFS-9434: Recommission a datanode with 500k blocks may pause NN for 30 seconds

• Logging is too verbose– Summary of patch: don’t log too much!– Move detailed logging to debug or trace level.

• Before:LOG.info("BLOCK* processOverReplicatedBlock: " + "Postponing processing of over-replicated " + block + " since storage + " + storage + "datanode " + cur + " does not yet have up-to-date " + "block information.");

• After: LOG.trace("BLOCK* processOverReplicatedBlock: Postponing {}" + " since storage {} does not yet have up-to-date information.", block, storage);



Troubleshooting

• Metrics are vital for diagnosis of most operational problems.– Metrics must be capable of showing that there is a problem. (e.g. RPC call volume spike)

– Metrics also must be capable of identifying the source of that problem. (e.g. user issuing RPC calls)



HDFS-6982: nntop

• Find activity trends of HDFS operations.– HDFS audit log contains a record of each file system operation to the NameNode.

2015-11-16 21:00:00,109 INFO FSNamesystem.audit: allowed=true ugi=bob (auth:SIMPLE) ip=/192.168.1.5 cmd=listStatus src=/app-logs/pcd_batch/application_1431545431771/ dst=null perm=null

– However identifying sources of load from audit log requires ad-hoc scripting.

• nntop: HDFS operation counts aggregated per operation and per user within time windows.– TopUserOpCounts - default time windows of 1 minute, 5 minutes, 25 minutes

– curl 'http://127.0.0.1:50070/jmx?qry=Hadoop:service=NameNode,name=FSNamesystemState’



nnTop sample Output

"windowLenMs": 60000, "ops": [ { "opType": "create", "topUsers": [ { "user": "[email protected]", "count": 4632 }, { "user": "[email protected]", "count": 1387 } ], "totalCount": 6019 }...


Troubleshooting Kerberos

• Kerberos is hard.– Many moving parts: KDC, DNS, principals, keytabs and Hadoop configuration.

– Management tools like Apache Ambari automate initial provisioning of principals, keytabs and configuration.

– When it doesn’t work, finding root cause is challenging.


HADOOP-12426: kdiag

• Kerberos misconfiguration diagnosis.– DNS– Hadoop configuration files– KDC configuration

• kdiag: a command-line tool for diagnosis of Kerberos problems– Prints various environment variables, Java system properties and Hadoop configuration options

related to security.– Attempt a login.– If keytab used, print principal information from keytab.– Print krb5.conf.– Validate kinit executable (used for ticket renewals).



kdiag Sample Output - misconfigured DNS

[hdfs@c6401 ~]$ hadoop org.apache.hadoop.security.KDiag

== Kerberos Diagnostics scan at Mon Jun 27 23:13:40 UTC 2016 ==

16/06/27 23:13:40 ERROR security.KDiag: java.net.UnknownHostException:java.net.UnknownHostException: c6401.ambari.apache.org: c6401.ambari.apache.org: unknown error at java.net.InetAddress.getLocalHost(InetAddress.java:1505) at org.apache.hadoop.security.KDiag.execute(KDiag.java:266) at org.apache.hadoop.security.KDiag.run(KDiag.java:221) at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:76) at org.apache.hadoop.security.KDiag.exec(KDiag.java:926) at org.apache.hadoop.security.KDiag.main(KDiag.java:936)

...


Summary

• A variety of recent enhancements have improved the ability of HDFS to serve as the foundational storage layer of the Hadoop ecosystem.

• Optimization

– Performance

– Optimizing Applications

• Stabilization

– Liveness

– Managing Load

• Supportability

– Logging

– Troubleshooting



Thank you! Q&A

• A few recommended best practices while we address questions…– Enable HDFS audit logs and periodically monitor audit logs/nnTop for unexpected patterns.

– Configure service heap settings correctly.– https://docs.hortonworks.com/HDPDocuments/HDP2/HDP-2.4.2/bk_installing_manually_book/content/ref-

80953924-1cbf-4655-9953-1e744290a6c3.1.html

– Use dedicated disks for NN metadata directories/journal node directories.– http://hortonworks.com/blog/hdfs-metadata-directories-explained/

– Run balancer (and soon disk-balancer) periodically.– http://hadoop.apache.org/docs/r2.7.2/hadoop-project-dist/hadoop-hdfs/HdfsUserGuide.html#Balancer

– Monitor for LDAP group lookup performance issues.– https://community.hortonworks.com/content/kbentry/38591/hadoop-and-ldap-usage-load-patterns-and-tuning.html

– Use SmartSense for proactive analysis of potential issues and recommended fixes.– http://hortonworks.com/products/subscriptions/smartsense/


Hdfs 2016-hadoop-summit-san-jose-v4

Software

Transcript of Hdfs 2016-hadoop-summit-san-jose-v4