Anti patterns in Hadoop Cluster deployment

Presented by Rohith Sharma, Naganarasimha &Sunil

Naganarasimha Garla

Hi all i think we need to look the order once seems like some contents i had similar to sunil....

Sunil Govind

hi naga. you can have this slide as end rt. so queue planing will cover everything, and you can summarize all

About us..Rohith Sharma K S, -Hadoop Committer, Works for Huawei -5+ year of experience in Hadoop ecosystems

Naganarasimha G R, -Apache Hadoop Contributor for YARN, Huawei -4+ year of experience in Hadoop ecosystems

Sunil Govind-Apache Hadoop Contributor for YARN and MapReduce-3+ year of experience in Hadoop ecosystems

Agenda➔ Overview about general cluster deployment

➔ Yarn cluster resource configurations walk through

➔ Anti Patterns◆ MapReduce◆ YARN

● RM Restart/HA ● Queue Planning

➔ Summary

Brief Overview: General Cluster DeploymentA sample Hadoop Cluster Layout with HA

NM DNRM (Master)

NN (Master)

RM (Backup)

NN (Backup)

NM

NM

NM DN

DN

DN

Client

ATS RM - Resource ManagerNM - Node ManagerNN - Name NodeDN - Data NodeATS - Application Timeline ServerZK - ZooKeeper

ZK

ZK

ZK

ZooKeeper Cluster

YARN Configuration : An ExampleLegacy NodeManager’s or DataNode’s were having low resource configurations. Nowadays most of the systems has high end capability and customers wants high end machines with less number of nodes (50~100 nodes) to achieve better performance.

Sample NodeManager configurations could be like:

-64 GB in Memory

-8/16 cores of CPU

-1Gb Network cards

-100 TB disk (or Disk Arrays)

We are now more focussing on these set of deployment and will try to cover anti-patterns OR best usages in coming slides.

YARN Configuration: Related to Resources

NodeManager:

●yarn.nodemanager.resource.memory-mb ●yarn.nodemanager.resource.cpu-vcores ●yarn.nodemanager.vmem-pmem-ratio●yarn.nodemanager.log-dirs●yarn.nodemanager.local-dirs

Scheduler:

●yarn.scheduler.minimum-allocation-mb●yarn.scheduler.maximum-allocation-mb

MapReduce:●mapreduce.map/reduce.java.opts●mapreduce.map/reduce.memory.mb●mapreduce.map/reduce.cpu.vcores

YARN and MR has these various resource tuning configurations to help for a better resource allocation.●With “vmem-pmem-ratio” (2:1 for example), Node Manager can kill container if its Virtual Memory shoots twice to its configured memory usage.●It’s advised to configure “local-dirs” and “log-dirs” in different mount points.

Anti Pattern in MRAppMaster

Container Memory Vs Container Heap MemoryCustomer : “Enough container memory is configured, still job runs slowly and sometimes when data is relatively more, tasks fails with OOM”

Resolution:

1.Container memory and container Heap Size both are different configurations.

2.Make sure if mapreduce.map/reduce.memory.mb is configured then configure mapreduce.map/reduce.java.opts for heap size.

3.Since this was common mistake from users, currently in trunk we have handled this scenario. RM will set 0.8 of container configured/requested memory as its heap memory.

1. if mapreduce.map/reduce.memory.mb values are specified, but no -Xmx is supplied for mapreduce.map/reduce.java.opts keys, then the -Xmx value will be derived from the former's value.

2. For both these conversions, a scaling factor specified by property mapreduce.job.heap.memory-mb.ratio is used (default 80%), to account for overheads between heap usage vs. actual physical memory usage.

Shuffle phase is taking long timeCustomer: “500 GB data Job finished in 4 hours, and on the cluster 1000 GB data job is running since 12 hours in reducer phase. I think job is stuck.”After enquiring more about resource configuration,

The same resource configurations used for both the jobs

Resolution:

1.Job is NOT hanged/stuck, rather time has spent on copying map output.

2.Increase the task resources

3.Tuning configurations

mapreduce.reduce.shuffle.parallelcopiesmapreduce.reduce.shuffle.input.buffer.percent

Anti Pattern in YARN

RM Restart : RMStateStore LimitCustomer: “Configured to yarn.resourcemanager.max-completed-applications to 100000. Completed applications in cluster has reached the limit and there many applications are in running. Observation is RM service to be up, takes 10-15 seconds”

Resolution:

1.It is NOT suggested to configure 100000 max-completed-applications.

2.Suggested to use TimelimeServer for history of YARN applications

3.Higher the value significantly impact on the RM recovery

Queue planning

Queue planning : Queue Mapping

Queue planning : Queue Capacity Planning and Preemption

Queue planning : Queue Capacity Planning for multiple usersCustomer : “I have multiple users submitting apps to a queue, seems like all the resources have been taken by single user’s app(s) though other apps are activated“Queue Capacity Planning :

CS provides options to control resources used by different users under a queue. yarn.scheduler.capacity.<queue-path>.minimum-user-limit-percent and yarn.scheduler.capacity.<queue-path>.user-limit-factor are the configurations which determines what amount of resources each user gets

yarn.scheduler.capacity.<queue-path>.minimum-user-limit-percent defaults to 100% which implies no user limits are imposed. This defines how much minimum resource each user is going to get.

yarn.scheduler.capacity.<queue-path>.user-limit-factor defaults to 1 which implies that a single user can never take complete queue’s resources. Needs to be configured such that even when other users are not using the queue, how much a particular user can take.

Queue planning : AM Resource LimitCustomer: “Hey buddy, most of my Jobs are in ACCEPTED state and never starts to run. What should be the problem?”

“All my Jobs were running fine. But after RM switchover, few Jobs didn’t resume its work. Why RM is not able to allocate new containers to these Jobs?”Resolution:

1.User need to ensure that AM Resource Limit is properly configured w.r.t the User’s deployment needs. Maximum resource limit for running AM containers need to be analyzed and configured correctly to ensure effective progress of applications.

a. Refer yarn.scheduler.capacity.maximum-am-resource-percent

2.After RM switchover if few NMs were not registered back, it can result a change in cluster size compared to what was there prior to failover. This will affect the AM Resource Limit, and hence less AMs will be activated after restart.

3.For analytical : more AM limit, For Batch queries : less AM limit

Queue planning : Application Priority within QueueCustomer : “I have many applications running in my cluster, and few are very important jobs which has to execute fast. I now use separate queues to run some very important applications. Configuration seems very complex here and I feel cluster resources are not utilized well because of this.”

Resolution:root

sales (50%) inventory(50%)

low 40%

high 20%

med 40%

low 40%

high 20%

med 40%

Configuration seems very complex for this case and cluster resources may not be utilized very well.

Suggesting to use Application Priority instead.

Resolution:

Application Priority will be available in YARN from 2.8 release onwards. A brief heads-up about this feature.

1.Configure “yarn.cluster.max-application-priority” in yarn-site.xml. This will be the maximum priority for any user/application which can be configured.

2.Within a queue, currently applications are selected by using OrderingPolicy (FIFO/Fair). If applications are submitted with priority, Capacity Scheduler will also consider prioirity of application in FiFoOrderingPolicy. Hence an application with highest priority will always be picked for resource allocation.

3.For MapReduce, use “mapreduce.job.priority” to set priority.

Application Priority within Queue (contd..)

Resource Request LimitsCustomer: “I am not very sure about the capacity of node managers and maximum-allocation resource configuration. But my application is not getting any containers or its getting killed.”

Resolution/Suggestion:

NMs are not having more than 6GB memory. If container request has big memory/cpu demand which may more than a node manager’s memory and less than default “maximum-allocation-mb”, then container requests will not be served by RM. Unfortunately this is not thrown as an error to the user side, and application will continuously wait for allocation. On the other hand, Scheduler will also be waiting for some nodes to meet this heavy resource requests.

User yarn.scheduler.maximum-allocation-mb and yarn.scheduler.maximum-allocation-vcores effectively by looking up on the NodeManager memory/cpu limit.

Reservation IssueCustomer : “My Application has reserved container in a node and never able to get new containers.”

Resolution:

Reservation feature in Capacity Scheduler serves a great deal to ensure a better linear resource allocation. However it’s possible that there can be few corner cases. For example, an application has made a reservation to a node. But this node has various containers running (long-lived), so chances of getting some free resources from this node is minimal in an immediate time frame.

Configurations like below can help in having some time-framed reservation for effective cluster usage.

●yarn.scheduler.capacity.reservations-continue-look-all-nodes will help in looking for a suitable resource in other nodes too.

Suggestions in Resource Configuration

Thank you

Anti patterns in Hadoop Cluster deployment

Software

Transcript of Anti patterns in Hadoop Cluster deployment