BigDataTech 2015 Is Hadoop Enterprise ready?

51
Is Hadoop Enterpris e ready? Building Hadoop cluster Krzysztof Adamski

Transcript of BigDataTech 2015 Is Hadoop Enterprise ready?

Is Hadoop Enterprise

ready?Building Hadoop cluster

Krzysztof Adamski

Agenda

• About ISP• Team• Architecture• Automated Hadoop deployment• Monitoring• Security• Q&A

About ING Services Polska

ISP Service Catalogue

ISP promotes ambitious goals

(actuals 2014)

WPC

85% FTEs

489.3Headcount

490

assessed byErnst & Young

Process maturity

4

(actuals 2014)

Average systems

availability

100%

of KPIs on target

General IT controls

86%

Security monitoring,Remote management,

System hosting,Security services

ISP has been growing as a solid business partner

Countries

18SLAs

191Business partners

35+

(1-10 scale, Q4 2014)

Customer satisfaction

8.1

Services

The team

The „A” team

• Don’t hire, train them!• Break out of the silo mentality• DevOps• Agile• Let them choose their own tools• Automation

http://www.pragmatictestlabs.com/

Architecture

Hadoop deployment options

Cloud vs on-premise

• Legal and Regulatory Issues (e.g. data locality, limited responsibility)• Network speed (we are talking BIG data)• Time to market• Initial costs

http://www.softwarefit.com/cloud-erp-vs-on-premise-erp/

Basic network principles

• Machines should be on an isolated network from the rest of the data center

• Machines should have static IPs • Reverse DNS should be setup• Top-of-the-rack switches – hadoop servers are quite chatty• Multi-homed networks are tricky

VLAN configuration example

VLAN Fabric NIC Port Function Failovervlan160_mgmt A eth0 Management,

User connectivity

Fabric failover to B

vlan12_HDFS B eth1 Hadoop Fabric failover to A

vlan11_DATA A eth2 SAN/NAS access, ETL

Fabric failover to B

Cisco reference architecture

ToR vs Cisco ref. architecture

Linux general recommendations

• Use FQDNs – required by Ambari, Kerberos • Disable IPTables – since we are within isolated network • Disable SELinux – enabling it can be very challenging• Set swappiness to 1 • Set ulimits to 64k• Disable Transparent Huge Pages• Disable atime• Enable NTP• JBOD for hadoop drives• RAID1 for system drives (if dedicated)

http://blog.cloudera.com/blog/2015/01/how-to-deploy-apache-hadoop-clusters-like-a-boss/

What else do we need?

• Code repository e.g. Stash, GitLab• Open Source package repository for Python (pip), Perl (cpan), R (cran),

Maven Repository Manager …• Integration tools e.g Jenkins• Stepping stone (edge) server • Other RDBMS to store aggregates e.g. MySQL, PostgreSQL• Data scientists server – RStudio, Ipython etc.

Did you know?

Hadoop DR strategy

• No inherent cross data center replication• DistCp can be used for large inter/intra-cluster copying• Data can be ingested into two separate hadoop clusters• Wandisco Non-Stop Hadoop

https://www.wandisco.com/system/files/documentation/WD-Datasheet-NonStop-Hadoop-HortonWorks-WEB.pdf

Automated deployment

RHEL

• Kickstart installation• Bladelogic jobs to provision software components e.g. monitoring

agents, security monitoring components• Bladelogic jobs to harden RHEL security according to best practicies• Red Hat Satellite as package distribution and versioning center

• Let Hadoop team manager servers themself – create organization• Create server profile template• Create profiles from a template

UCS Manager - organisation

UCS Manager – fabric interconnect

Ambari

Ambari

Ambari HA wizard

Ambari blueprints

Ambari blueprint example

{ "configurations" : [ { "configuration-type" : { "property-name" : "property-value", "property-name2" : "property-value" } }, { "configuration-type2" : { "property-name" : "property-value" } } ... ], "host_groups" : [ { "name" : "host-group-name", "components" : [...

https://cwiki.apache.org/confluence/display/AMBARI/Blueprints

Ambari REST API

curl -u admin:$PASSWORD -i -H 'X-Requested-By: ambari' -X PUT -d '{"RequestInfo": {"context" :"Start HDFS via REST"}, "Body": {"ServiceInfo": {"state": "STARTED"}}}' http://AMBARI_SERVER_HOST:8080/api/v1/clusters/CLUSTER_NAME/services/HDFS

curl -u admin:$PASSWORD -H 'X-Requested-By: ambari' -X GET "http://AMBARI_SERVER_HOST:8080/api/v1/clusters/ing_hdp/components/?ServiceComponentInfo/category.in(SLAVE,MASTER)&host_components/HostRoles/host_name=CLUSTERNODE&fields=host_components/HostRoles/component_name,host_components/HostRoles/state

https://cwiki.apache.org/confluence/display/AMBARI/API+usage+scenarios%2C+troubleshooting%2C+and+other+FAQs

Leverage docker

http://blog.sequenceiq.com/blog/2014/07/25/cloudbreak-technology/

Did you know?

• Upgrading hadoop stack can be still a painful (80 man pages) proceshttp://docs.hortonworks.com/HDPDocuments/Ambari-1.7.0.0/Ambari_Upgrade_v170/Ambari_Upgrade_v170.pdf

• Automated rolling upgrade proces TBDhttps://issues.apache.org/jira/browse/AMBARI-7804

Monitoring

Hadoop Availability Monitoring (service health)

Hadoop metrics monitoring

http://hakunamapdata.com/ganglia-configuration-for-a-small-hadoop-cluster-and-some-troubleshooting/

Ambari roadmap – AMBARI-5707

https://issues.apache.org/jira/browse/AMBARI-5707

Did you know?

Check your region and language settings ;)

Security

Hadoop security

• Hadoop is not a single product, choose your components wisely• Up until recently there was no single point for user managment• Maintaining ACL in HDFS is a painful process• No out of the box Active Directory integration

http://blogs.gartner.com/merv-adrian/2014/01/21/security-for-hadoop-dont-look-now/

Hadoop – ring of defense

Apache Knox Gateway

Typical Flow – Add Wire and File Encryption

http://www.slideshare.net/hortonworks/hdp-security-overview

Is there anything we can do? Start simple!

1. Do not store sensitive data within Hadoop 2. Separate Hadoop environment in a separate network zone (dedicated

vlan/s, firewall filtered traffic)3. Kerberize cluster environment

a) Watch for unkerberized componentsb) Keep your keytabs safe

4. LDAP for central user managment5. Manager your ACLs – start simple with POSIX groups6. Auditting7. Automated HDP cluster kerberization (TBD) https://issues.apache.org/jira/secure/attachment/12671235/12671235_AmbariClusterKerberization.pdf

IPA

At the most basic level, Red Hat Identity Management is a domain controller for Linux and Unix machines. 

IPA – server – client communication

IPA

Did you know?

IPA 3 for RHEL 6 has issues when installing using external CA option

Central user and policy management

Ranger

Where to continue from here?

• hadoop distribution best practicies • Reference architecture papers• http://docs.hortonworks.com/HDPDocuments/HDP2/HDP-2.2.0/

Cluster_Plan_Gd_v22/Cluster_Plan_Gd_v22.pdf• http://hortonworks.com/get-started/• http://blog.cloudera.com/blog/2015/01/how-to-deploy-apache-hadoop-

clusters-like-a-boss/• http://www.slideshare.net/vinnies12/hadoop-security-today-tomorrow-

apache-knox• http://www.slideshare.net/Hadoop_Summit/radia-srinivas-

june261120amroom210c• http://www.slideshare.net/KevinMinder/knox-

hadoopsummit20140505v6pub• http://blog.sequenceiq.com/blog/2014/12/04/multinode-ambari-1-7-0/

Interesting books and docs