Hadoop workshop

Hadoop workshopCloud Connect ShanghaiSep 15, 2013

Ari Flink – Operations Architect

Mac Fang – Manager, Hadoop development

Dean Zhu – Hadoop Developer

Agenda

1. Introductions (5 minutes)

2. Hadoop and Big Data Concepts (20 minutes)

3. Cisco Webex Hadoop architecture (10 minutes)

4. Cisco UCS Hadoop Common Platform Architecture (10 minutes)

5. Exercise 1 (30 minutes)– Configure a Hadoop single node VM on a laptop

6. Hive and Impala concepts (15 minutes)

7. Exercise 2 (30 minutes)– Analytics using Apache Hive and Cloudera Impala

8. Q & A

Hadoop and Big Data Overview– Enterprise data management and big data– Problems, Opportunities and Use case examples– Hadoop architecture concepts

For our purposes, big data refers to distributed computing architectures specifically aimed at the “3 V’s” of data: Volume, Velocity, and Variety

What is Big Data?

Operational(OLTP)

Traditional Enterprise Data Management

Operational(OLTP)

ETL EDW BI/Reports

Online Transactional Processing

Extract, Transform, and Load (batch processing)

Enterprise Data Warehouse

Business Intelligence

Operational(OLTP)

Traditional Business Intelligence Questions

Transactional Data (e.g. OLTP)

Real-time, but limited reporting/analytics

• What are the top 5 most active stocks traded in the last hour?

• How many new purchase orders have we received since noon?

High value, structured, indexed, cleansed

• How many more hurricane windows are sold in Gulf-area stores during hurricane season vs. the rest of the year?

• What were the top 10 most frequently back-ordered products over the past year?

So what has changed?The Explosion of Unstructured Data

2005 20152010

• More than 90% is unstructured data

• Approx. 500 quadrillion files

• Quantity doubles every 2 years• Most unstructured data is neither stored nor analyzed!

1.8 trillion gigabytes of data was created in 2011…

10,000

STRUCTURED DATA

UNSTRUCTURED DATA

Source: Cloudera

Machine

Operational(OLTP)

ETLBI/Reports

Operational(OLTP)

Enterprise Data Management with Big Data

Dashboards

In-memory analytics

Big Data

(Hadoop, etc.)

MPP EDW

Traditional Business Intelligence QuestionsTransactional Data (e.g.

Fast data, real-time

• What are the top 5 most active stocks traded in the last hour?

• How many new purchase orders have we received since noon?

High value, structured, indexed, cleansed

• How many more hurricane windows are sold in Gulf-area stores during hurricane season vs. the rest of the year?

• What were the top 10 most frequently back-ordered products over the past year?

Big Data

Lower value, semi-structured, multi-source, raw/”dirty”

• Which products do customers click on the most and/or spend the most time browsing without buying?

• How do we optimally set pricing for each product in each store for individual customers everyday?

• Did the recent marketing launch generate the expected online buzz, and did that translate to sales?

Example: Web and Location Analytics

iPhone searches Amazon for Vizio TV’s in Electronics

1336083635.130 10.8.8.158 TCP_MISS/200 8400 GET http://www.amazon.com/gp/aw/s/ref=is_box_?k=Visio+tv… "Mozilla/5.0 (iPhone; CPU iPhone OS 5_0_1 like Mac OS X) AppleWebKit/534.46 (KHTML, like Gecko) Version/5.1 Mobile/9A405 Safari/7534.48.3"

Big Data and Key Infrastructure Attributes

Usually not blade servers (not enough local storage)

Usually not virtualized (hypervisor only adds overhead)

Usually not highly oversubscribed (significant east-west traffic)

Usually not SAN/NAS

(What big data isn’t)

Move the compute to the storage

Low-cost, DAS-based, scale-out

clustered filesystem

Cost, Performance, and Capacity

EnterpriseDatabase

Massive Scale-Out Column Store

Hadoop No SQL

Data C

apacity Cost

Structured Data: Relational Database

Unstructured Data: Machine Logs, Web Click Stream, Call Data Records, Satellite Feeds, GPS Data, Sensor Readings, Sales Data, Blogs, Emails, Video

$20K/TB

$10K/TB

$300-$1K/TB

HW:SW $ split 70:30

HW:SW $ split 30:70

Big Data Software Architectures

Three basic big data software architectures

•Greenplum DB (Pivotal DB)*

•ParAccel*•Vertica•Netezza•Teradata

MPP Relational Database

Scale-out BI/DW

•Cloudera*•MapR*•Intel Hadoop*•Pivotal HD*

Batch-oriented Hadoop

Heavy lifting, processing

Real-time NoSQLFast key-value store/retrieve

•HBase (part of Apache Hadoop)*

•DataStax (Cassandra)*

•Oracle NoSQL*•Amazon Dynamo

*Cisco Partners

Hadoop is a distributed, fault-tolerant framework for storing and analyzing data.

Its two primary components are the Hadoop Filesystem HDFS and the MapReduce application engine.

What Is Hadoop?

Hadoop Components and OperationsHadoop Distributed File System (HDFS)

Block 1

Block 2

Block 3

Block 4

Block 5

Block 6

Scalable & Fault Tolerant Filesystem is distributed, stored

across all data nodes in the cluster Files are divided into multiple large

blocks – 64MB default, typically 128MB – 512MB

Data is stored reliably. Each block is replicated 3 times by default

Types of Node Functions– Name Node - Manages HDFS– Job Tracker – Manages MapReduce

Jobs– Data Node/Task Tracker – stores

blocks/does work

ToR FEX/switch

Data node 1

Data node 2

Data node 3

Data node 4

Data node 5

ToR FEX/switch

Data node 6

Data node 7

Data node 8

Data node 9

Data node 10

ToR FEX/switch

Data node 11

Data node 12

Data node 13

Name Node

Job Tracker

HDFS Architecture

ToR FEX/switch

Data node 1

Data node 2

Data node 3

Data node 4

Data node 5

ToR FEX/switch

Data node 6

Data node 7

Data node 8

Data node 9

Data node 10

ToR FEX/switch

Data node 11

Data node 12

Data node 13

Data node 14

Data node 15

Switch

Name Node

/usr/sean/foo.txt:blk_1,blk_2/usr/jacob/bar.txt:blk_3,blk_4

Data node 1:blk_1Data node 2:blk_2, blk_3Data node 3:blk_3

Rack Awareness

Rack Awareness provides Hadoop the optional ability to group nodes together in logical “racks” (i.e. failure domains)

Logical “racks” may or may not correspond to physical data center racks

Distributes blocks across different “racks” to avoid failure domain of a single “rack”

It can also lessen block movement between “racks”

“Rack” 1

Data node 1

Data node 2

Data node 3

Data node 4

Data node 5

“Rack” 2

Data node 6

Data node 7

Data node 8

Data node 9

Data node 10

“Rack” 3

Data node 11

Data node 12

Data node 13

Data node 14

Data node 15

MapReduce Example: Word Count

the quickbrown

the fox ate the mouse

how nowbrown cow

Reduce

brown, 2fox, 2how, 1now, 1the, 3

ate, 1cow, 1mouse,

1quick, 1

the, 1brown, 1fox, 1quick, 1

quick, 1

the, 1fox, 1the, 1

how, 1now, 1brown, 1

ate, 1mouse, 1

cow, 1

Input Map Shuffle & Sort Reduce Output

MapReduce Architecture

ToR FEX/switch

Task Tracker 1

Task Tracker 2

Task Tracker 3

Task Tracker 4

Task Tracker 5

ToR FEX/switch

Task Tracker 6

Task Tracker 7

Task Tracker 8

Task Tracker 9

Task Tracker 10

ToR FEX/switch

Task Tracker 11

Task Tracker 12

Task Tracker 13

Task Tracker 14

Task Tracker 15

Switch

Job Tracker

Job1:TT1:Mapper1,Mapper2Job1:TT4:Mapper3,Reducer1

Job2:TT6:Reducer2Job2:TT7:Mapper1,Mapper3

Cisco Webex Cloud and Hadoop Architecture

Cisco WebEx Collaboration Cloud

Datacenter / PoP

Leased network link

Global Scale: 13 datacenters & iPoPs around the globe

Dedicated network: dual path 10G circuits between DCs

Multi-tenant: 95k sites

Real-time collaboration: voice, desktop sharing, video, chat

Things happen ..

Datacenter / PoP

Leased network link

People make mistakesHardware failsSoftware failsEven failovers sometimes fail

Cisco WebEx log collection overview

Syslog

Other Sinks

SolrSink

Thrift

AMQP RDBMS

HTTP/REST

Unstructured/semi-structured data Structured data

Cisco UCS C240 M3 servers

12 x 3TB = 36 TB / server

HDFSSink

SolrCloud

Raw dataSolr index

Cisco UCS and Big Data

Building a big data cluster with the UCS Common Platform Architecture (CPA)

CPA NetworkingCPA Sizing and Scaling

The evolution of big data deployments

Experimental use of Big Data

Deployed into IT Ops mandated infrastructures

“Skunk works”

Small to medium clusters

App team mandated infrastructure

Purpose built for Big Data

Big Data has established business value

Performance matters

Large or small clusters

IT Infrastructure

Big Data

VMware

WEBSAP

Generic IT servers

General Purpose IT Data Center

X86 servers

Big Data

Dedicated “Pod” for Big Data

Hadoop Hardware Evolving in the Enterprise

Typical 2009 Hadoop node

• 1RU server• 4 x 1TB 3.5”

spindles• 2 x 4-core CPU• 1 x GE• 24 GB RAM• Single PSU• Running Apache• $

Economics favor “fat” nodes

• 6x-9x more data/node

• 3x-6x more IOPS/node

• Saturated gigabit, 10GE on the rise

• Fewer total nodes lowers licensing/support costs

• Increased significance of node and switch failure

Typical 2013 Hadoop node

• 2RU server• 12 x 3TB 3.5” or 24

x 1TB 2.5” spindles• 2 x 8-core CPU• 1-2 x 10GE• 128 GB RAM• Dual PSU• Running

commercial/licensed distribution

• $$$

Cisco UCS Common Platform Architecture (CPA)Building Blocks for Big Data

UCS 6200 SeriesFabric Interconnects

Nexus 2232Fabric Extenders

UCS Manager

UCS 240 M3 Servers

LAN, SAN, Management

CPA Network Design for Big Data

CPA: TopologySingle wire for data and management

8 x 10GE uplinks per FEX= 2:1 oversub (16 servers/rack), no portchannel (static pinning)

2 x 10GE links per server for all traffic, data and management

CPA Recommended FEX Connectivity2 FEX’s and 2 FI’s

• 2232 FEX has 4 buffer groups: ports 1-8, 9-16, 17-24, 25-32 • Distribute servers across port groups to maximize buffer

performance and predictably distribute static pinning on uplinks

Can Hadoop really push 10GE?

Analytic workloads tend to be lighter on the network

Transform workloads tend to be heavier on the network

Hadoop has numerous parameters which affect network

Take advantage of 10GE CPA:– mapred.reduce.slowstart.completed.maps– dfs.balance.bandwidthPerSec– mapred.reduce.parallel.copies– mapred.reduce.tasks– mapred.tasktracker.reduce.tasks.maximum– mapred.compress.map.output

It can, depending on workload, so tune for it!

CPA Sizing and Scaling for Big Data

Cisco UCS Reference Configurations for Big Data

Full Rack UCS Solutions Bundle for Hadoop

Capacity

Full Rack UCS Solutions Bundle for Hadoop, NoSQL Performance

2 x UCS 62962 x Nexus 2232 PP16 x C240 M3 (LFF)

E5-2640 (12 cores)128GB

12x 3TB 7.2K SATA

2 x UCS 62962 x Nexus 2232 PP16 x C240 M3 (SFF)

2x E5-2665 (16 cores)256GB

24 x 1TB 7.2K SAS

Sizing

Start with current storage requirement– Factor in replication (typically 3x) and compression (varies by data set)– Factor in 20-30% free space for temp (Hadoop) or up to 50% for some NoSQL systems– Factor in average daily/weekly data ingest rate– Factor in expected growth rate (i.e. increase in ingest rate over time)

If I/O requirement known, use next table for guidance

Most big data architectures are very linear, so more nodes = more capacity and better performance

Strike a balance between price/performance of individual nodes vs. total # of nodes

Part science, part art

CPA sizing and application guidelines

Server

CPU2 x E5-2690 2 x E5-2665 2 x E5-2640

Memory (GB) 256 256 128

Disk Drives 24 x 600GB 10K 24 x 1TB 7.2K 12 x 3TB 7.2K

IO Bandwidth (GB/Sec) 2.6 2.0 1.1

Rack-Level

Cores 256 256 192

Memory (TB) 4 4 2

Capacity (TB) 225 384 576

IO Bandwidth (GB/Sec) 41.3 31.9 16.9

Applications MPP DBNoSQL

HadoopNoSQL Hadoop

Best Performance Best Price/TB

Scaling the CPA

Single Rack 16 servers

Single Domain Up to 10 racks, 160 servers

Multiple Domains

L2/L3 Switching

Consider intra- and inter-domain bandwidth:

Servers Per Domain

(Pair of Fabric Interconnects)

Available North-Bound 10GE ports(per fabric)

Southbound oversubscription

(per fabric)

Northbound oversubscription

(per fabric)

Intra-domain server-to-server bandwidth (per

fabric, Gbits/sec)

Inter-domain server-to-server bandwidth (per

fabric, Gbits/sec)

160 16 2:1 5:1 5 1

144 24 2:1 3:1 5 1.67

128 32 2:1 2:1 5 2.5

Scaling the Common Platform ArchitectureMultiple domains based on 16 servers per rack and 2 x 2232 FEXs

Multi-Domain CPA Customer Example

• 10 Gits/sec Intra-Domain Server to Server NW Bandwidth

• 5 Gbits/sec Inter-Domain Server to Server NW Bandwidth

• Static pinning from FEX to FI (no port-channel)

Recommendations: UCS Domains and Racks

Single Domain Recommendation

Turn off or enable at physical rack level

• For simplicity and ease of use, leave Rack Awareness off

• Consider turning it on to limit physical rack level fault domain (e.g. localized failures due to physical data center issues – water, power, cooling, etc.)

Multi Domain Recommendation

Create one Hadoop rack per UCS Domain

• With multiple domains, enable Rack Awareness such that each UCS Domain is its own Hadoop rack

• Provides HDFS data protection across domains

• Helps minimize cross-domain traffic

Exercise 1

Set up a single node VM cluster on the laptop– Step 1: copy files from USB memory stick– Step 2: Mac & Dean to fill in …– Step 3: Mac & Dean to fill in …– etc

An SQL-like interface to Hadoop

Top level Apache project – http://hive.apache.org/

Hive history– Created at Facebook to allow people to quickly and easily leverage Hadoop without the effort of

writing Java MapReduce– Currently used at many companies for log processing, business intelligence and analytics

Hive Components

Shell: allows interactive queries Driver: session handles, fetch, execute Compiler: parse, plan, optimize Execution engine: DAG of stages (MR, HDFS, metadata) Metastore: schema, location in HDFS, SerDe

Data Model

Tables– Typed columns (int, float, string, boolean)– Also, list: map (for JSON-like data)

Partitions– For example, range-partition tables by date

Buckets– Hash partitions within ranges (useful for sampling, join optimization)

DBMS Hive

Language SQL-92 standard Subset of SQL-92 plus Hive extensions

Updates INSERT, UPDATE, DELETE INSERT OVERWRITENo UPDATE or DELETE

Transactions Yes No

Latency Sub-second Minutes to hours

Indexes Any number of indexes, important to performance

No indexes, data is always scanned in parallel

Dataset size TBs PBs

Metastore

Database: namespace containing a set of tables Holds table definitions (column types, physical layout) Holds partitioning information Can be stored in Derby, MySQL, and other relational databases

Source: cc-licensed slide by Cloudera

Hive components

Source: cc-licensed slide by Cloudera

Hive MetaStore

InputFormat

Hadoop cluster

Hive MetaStore

MetaStore

Impala

HCatalog

HiveServer2

HiveCLI

BeelineCLI

Hive Physical Layout

Warehouse directory in HDFS– E.g., /user/hive/warehouse

Tables stored in subdirectories of warehouse– Partitions form subdirectories of tables

Actual data stored in HDFS files– E.g. text, SequenceFile, RCfile, Avro– Arbitrary format with a custom SerDe

External and Hive managed tables

Hive managed tables– Data moved to location /user/hive/warehouse– Can be stored in a more efficient format than text e.g. RCFile– If you drop the table, the raw data is lost

hive> CREATE TABLE test(id INT, name STRING) ROW FORMAT DELIMITED FIELDS TERMINATED BY ',' LINES TERMINATED BY '\n' STORED AS TEXTFILE;

External tables– Can overlay multiple tables all pointing to the same raw data– To create external table, simply point to the location of data while creating the tables

hive> CREATE TABLE test (id INT, name STRING) ROW FORMAT DELIMITED FIELDS TERMINATED BY ',' LINES TERMINATED BY '\n' STORED AS TEXTFILE LOCATION '/home/test/data';

Hive: Example

Hive looks similar to an SQL database Relational join on two tables:

– Table of word counts from Shakespeare collection– Table of word counts from the bible

SELECT s.word, s.freq, k.freq FROM shakespeare s JOIN bible k ON (s.word = k.word) WHERE s.freq >= 1 AND k.freq >= 1 ORDER BY s.freq DESC LIMIT 5;

the 25848 62394I 23031 8854and 19671 38985to 18038 13526of 16700 34654

Impala

Impala General purpose MPP SQL query engine for Hadoop

– Query latency milliseconds to hours, interactive data exploration– Runs on the existing Hadoop cluster on existing HDFS files and hardware

High performance– C++– Direct access to HDFS and Hbase data, no MapReduce

Unified platform– Use existing Hive metadata and query language (HiveQL)– Submit queries via ODBC or Thrift API

Performance– Disk throughput limited by hw to 100MB/sec– 3 .. 90 x faster than Hive, depending on the type of the query

Impala Details

Query Planner

Query Coordinator

Query Exec Engine

HDFS DN HBase

SQL App

Hive Metastore

HDFS NN

StateStored

HiveQL interfaceUnified metadata

impalad

Query Planner

Query Coordinator

Query Exec Engine

HDFS DN HBase

impalad

Query Planner

Query Coordinator

Query Exec Engine

HDFS DN HBase

impalad

Impala Details

Query Planner

Query Coordinator

Query Exec Engine

HDFS DN HBase

SQL App

Hive Metastore

HDFS NN

StateStored

impalad

Query Planner

Query Coordinator

Query Exec Engine

HDFS DN HBase

impalad

Query Planner

Query Coordinator

Query Exec Engine

HDFS DN HBase

impalad

Impalad keep contact to StateStored to update their state and to receive metadata for query planning

Impala Details

Query Planner

Query Coordinator

Query Exec Engine

HDFS DN HBase

SQL App

Hive Metastore

HDFS NN

StateStore

impalad

Query Planner

Query Coordinator

Query Exec Engine

HDFS DN HBase

impalad

Query Planner

Query Coordinator

Query Exec Engine

HDFS DN HBase

impalad

Query coordinator initiates

execution on remote impalad’s

Impala Details

Query Planner

Query Coordinator

Query Exec Engine

HDFS DN HBase

SQL App

Hive Metastore

HDFS NN

StateStore

impalad

Query Planner

Query Coordinator

Query Exec Engine

HDFS DN HBase

impalad

Query Planner

Query Coordinator

Query Exec Engine

HDFS DN HBase

impalad

Intermediate results are streamed between impalad’s

and query results are streamed back to client

Exercise 2

Analytics with Hive and Impala– Step 1: copy test dataset from USB memory stick– Step 2: Mac & Dean to fill in …– Step 3: Mac & Dean to fill in …– etc

Hadoop workshop

Technology

Transcript of Hadoop workshop

Hue: The Hadoop UI - Hadoop Singapore

HADOOP WORKSHOP - TerpConnectlzhong/files/Intro.pdf · Introduction to Hadoop Cloudera tutorial . BIG DATA PROBLEMS Big data is generated by everything around us High velocity, volume

Hadoop/MapReduce Workshop - McGill University · Hadoop/MapReduce Workshop guillimin@calculquebec.ca August 14, 2015 Dan Mazur daniel.mazur@mcgill.ca Simon Nderitu simon.nderitu@mcgill.ca.

Hadoop 3 (2017 hadoop taiwan workshop)

ACM Bay Area Data Mining Workshop: Pattern, PMML, Hadoop

Hadoop 1.0 vs Hadoop 2.0

Hadoop Conf 2014 - Hadoop BigQuery Connector

Workshop - Hadoop + R by CARLOS GIL BELLOSTA at Big Data Spain 2013

Apache Spark Workshop at Hadoop Summit

Hadoop Training #4: Programming with Hadoop

Hadoop Technical Workshop Academic Hadoop Usage. Overview University of Washington Curriculum –Teaching Methods –Reflections –Student Background –Course.

Hadoop Eco-System A Practitioner Approach · A two days hands on workshop on “Hadoop Eco-System A Practitioner Approach “was conducted by Department of Computer Science and Engineering

Hadoop workshop

A Crash Course in Apache Hadoop - Blancoblanco.io/assets/hadoop-workshop/crash-course-1.pdf · A Crash Course in Apache Hadoop. Event Outline 1. What is Hadoop 2. Current data challenges

Hadoop Installation Guide | Hadoop Configuration

Hadoop Operations Powered By ... Hadoop (Hadoop Summit 2014 Amsterdam)

CRITERION-1 · Splunk 30 Hadoop EC INNOTRONICS (IOT workshop) Workshop on Raspberry Pi (Aug to Nov) 38 IoT and Image processing . Fundamentals of IoT ... Workshop on “From SQL To

Why use Hadoop?, Challenges / Learning Hadoop & Average Salary of Hadoop Professional

Hortonworks Technical Workshop: Real Time Monitoring with Apache Hadoop

Analyzing Hadoop with Hadoop