Hadoop workshop

Post on 26-Jan-2015

117 views 4 download

Tags:

description

 

Transcript of Hadoop workshop

Hadoop workshopCloud Connect ShanghaiSep 15, 2013

Ari Flink – Operations Architect

Mac Fang – Manager, Hadoop development

Dean Zhu – Hadoop Developer

© 2013 Cisco and/or its affiliates. All rights reserved.Cloud Connect 2013 Shanghai

Agenda

1. Introductions (5 minutes)

2. Hadoop and Big Data Concepts (20 minutes)

3. Cisco Webex Hadoop architecture (10 minutes)

4. Cisco UCS Hadoop Common Platform Architecture (10 minutes)

5. Exercise 1 (30 minutes)– Configure a Hadoop single node VM on a laptop

6. Hive and Impala concepts (15 minutes)

7. Exercise 2 (30 minutes)– Analytics using Apache Hive and Cloudera Impala

8. Q & A

2

© 2013 Cisco and/or its affiliates. All rights reserved.Cloud Connect 2013 Shanghai

Hadoop and Big Data Overview– Enterprise data management and big data– Problems, Opportunities and Use case examples– Hadoop architecture concepts

3

© 2013 Cisco and/or its affiliates. All rights reserved.Cloud Connect 2013 Shanghai

For our purposes, big data refers to distributed computing architectures specifically aimed at the “3 V’s” of data: Volume, Velocity, and Variety

What is Big Data?

4

© 2013 Cisco and/or its affiliates. All rights reserved.Cloud Connect 2013 Shanghai 5

Operational(OLTP)

Traditional Enterprise Data Management

Operational(OLTP)

ETL EDW BI/Reports

Online Transactional Processing

Extract, Transform, and Load (batch processing)

Enterprise Data Warehouse

Business Intelligence

Operational(OLTP)

© 2013 Cisco and/or its affiliates. All rights reserved.Cloud Connect 2013 Shanghai 6

Traditional Business Intelligence Questions

Transactional Data (e.g. OLTP)

Real-time, but limited reporting/analytics

• What are the top 5 most active stocks traded in the last hour?

• How many new purchase orders have we received since noon?

Enterprise Data Warehouse

High value, structured, indexed, cleansed

• How many more hurricane windows are sold in Gulf-area stores during hurricane season vs. the rest of the year?

• What were the top 10 most frequently back-ordered products over the past year?

© 2013 Cisco and/or its affiliates. All rights reserved.Cloud Connect 2013 Shanghai

So what has changed?The Explosion of Unstructured Data

7

2005 20152010

• More than 90% is unstructured data

• Approx. 500 quadrillion files

• Quantity doubles every 2 years• Most unstructured data is neither stored nor analyzed!

1.8 trillion gigabytes of data was created in 2011…

10,000

0

GB

of

Da

ta

(IN

BIL

LIO

NS

)

STRUCTURED DATA

UNSTRUCTURED DATA

Source: Cloudera

© 2013 Cisco and/or its affiliates. All rights reserved.Cloud Connect 2013 Shanghai 8

Machine

Operational(OLTP)

Operational(OLTP)

ETLBI/Reports

Operational(OLTP)

Enterprise Data Management with Big Data

Web

ETL

Dashboards

In-memory analytics

Big Data

(Hadoop, etc.)

MPP EDW

© 2013 Cisco and/or its affiliates. All rights reserved.Cloud Connect 2013 Shanghai 9

Traditional Business Intelligence QuestionsTransactional Data (e.g.

OLTP)

Fast data, real-time

• What are the top 5 most active stocks traded in the last hour?

• How many new purchase orders have we received since noon?

Enterprise Data Warehouse

High value, structured, indexed, cleansed

• How many more hurricane windows are sold in Gulf-area stores during hurricane season vs. the rest of the year?

• What were the top 10 most frequently back-ordered products over the past year?

Big Data

Lower value, semi-structured, multi-source, raw/”dirty”

• Which products do customers click on the most and/or spend the most time browsing without buying?

• How do we optimally set pricing for each product in each store for individual customers everyday?

• Did the recent marketing launch generate the expected online buzz, and did that translate to sales?

© 2013 Cisco and/or its affiliates. All rights reserved.Cloud Connect 2013 Shanghai 10

Example: Web and Location Analytics

iPhone searches Amazon for Vizio TV’s in Electronics

1336083635.130 10.8.8.158 TCP_MISS/200 8400 GET http://www.amazon.com/gp/aw/s/ref=is_box_?k=Visio+tv… "Mozilla/5.0 (iPhone; CPU iPhone OS 5_0_1 like Mac OS X) AppleWebKit/534.46 (KHTML, like Gecko) Version/5.1 Mobile/9A405 Safari/7534.48.3"

© 2013 Cisco and/or its affiliates. All rights reserved.Cloud Connect 2013 Shanghai

Big Data and Key Infrastructure Attributes

Usually not blade servers (not enough local storage)

Usually not virtualized (hypervisor only adds overhead)

Usually not highly oversubscribed (significant east-west traffic)

Usually not SAN/NAS

(What big data isn’t)

11

Move the compute to the storage

Low-cost, DAS-based, scale-out

clustered filesystem

11

$$$

© 2013 Cisco and/or its affiliates. All rights reserved.Cloud Connect 2013 Shanghai

Cost, Performance, and Capacity

EnterpriseDatabase

Massive Scale-Out Column Store

Hadoop No SQL

Data C

apacity Cost

Structured Data: Relational Database

Unstructured Data: Machine Logs, Web Click Stream, Call Data Records, Satellite Feeds, GPS Data, Sensor Readings, Sales Data, Blogs, Emails, Video

Dat

a S

tora

ge C

apac

ity

$20K/TB

$10K/TB

$300-$1K/TB

HW:SW $ split 70:30

HW:SW $ split 30:70

12

© 2013 Cisco and/or its affiliates. All rights reserved.Cloud Connect 2013 Shanghai

Big Data Software Architectures

13

© 2013 Cisco and/or its affiliates. All rights reserved.Cloud Connect 2013 Shanghai

Three basic big data software architectures

•Greenplum DB (Pivotal DB)*

•ParAccel*•Vertica•Netezza•Teradata

MPP Relational Database

Scale-out BI/DW

•Cloudera*•MapR*•Intel Hadoop*•Pivotal HD*

Batch-oriented Hadoop

Heavy lifting, processing

Real-time NoSQLFast key-value store/retrieve

•HBase (part of Apache Hadoop)*

•DataStax (Cassandra)*

•Oracle NoSQL*•Amazon Dynamo

*Cisco Partners

14

© 2013 Cisco and/or its affiliates. All rights reserved.Cloud Connect 2013 Shanghai

Hadoop is a distributed, fault-tolerant framework for storing and analyzing data.

Its two primary components are the Hadoop Filesystem HDFS and the MapReduce application engine.

What Is Hadoop?

15

© 2013 Cisco and/or its affiliates. All rights reserved.Cloud Connect 2013 Shanghai

Hadoop Components and OperationsHadoop Distributed File System (HDFS)

Block 1

Block 2

Block 3

Block 4

Block 5

Block 6

Scalable & Fault Tolerant Filesystem is distributed, stored

across all data nodes in the cluster Files are divided into multiple large

blocks – 64MB default, typically 128MB – 512MB

Data is stored reliably. Each block is replicated 3 times by default

Types of Node Functions– Name Node - Manages HDFS– Job Tracker – Manages MapReduce

Jobs– Data Node/Task Tracker – stores

blocks/does work

ToR FEX/switch

Data node 1

Data node 2

Data node 3

Data node 4

Data node 5

ToR FEX/switch

Data node 6

Data node 7

Data node 8

Data node 9

Data node 10

ToR FEX/switch

Data node 11

Data node 12

Data node 13

Name Node

Job Tracker

File

16

© 2013 Cisco and/or its affiliates. All rights reserved.Cloud Connect 2013 Shanghai 17

HDFS Architecture

ToR FEX/switch

Data node 1

Data node 2

Data node 3

Data node 4

Data node 5

ToR FEX/switch

Data node 6

Data node 7

Data node 8

Data node 9

Data node 10

ToR FEX/switch

Data node 11

Data node 12

Data node 13

Data node 14

Data node 15

1

Switch

Name Node

/usr/sean/foo.txt:blk_1,blk_2/usr/jacob/bar.txt:blk_3,blk_4

Data node 1:blk_1Data node 2:blk_2, blk_3Data node 3:blk_3

1

1

2

2

2

3

3

3

4

4

44

© 2013 Cisco and/or its affiliates. All rights reserved.Cloud Connect 2013 Shanghai 18

Rack Awareness

Rack Awareness provides Hadoop the optional ability to group nodes together in logical “racks” (i.e. failure domains)

Logical “racks” may or may not correspond to physical data center racks

Distributes blocks across different “racks” to avoid failure domain of a single “rack”

It can also lessen block movement between “racks”

“Rack” 1

Data node 1

Data node 2

Data node 3

Data node 4

Data node 5

“Rack” 2

Data node 6

Data node 7

Data node 8

Data node 9

Data node 10

“Rack” 3

Data node 11

Data node 12

Data node 13

Data node 14

Data node 15

1

1

1

2

2

2

3

3

3

4

4

4

© 2013 Cisco and/or its affiliates. All rights reserved.Cloud Connect 2013 Shanghai

MapReduce Example: Word Count

the quickbrown

fox

the fox ate the mouse

how nowbrown cow

Map

Map

Map

Reduce

Reduce

brown, 2fox, 2how, 1now, 1the, 3

ate, 1cow, 1mouse,

1quick, 1

the, 1brown, 1fox, 1quick, 1

quick, 1

the, 1fox, 1the, 1

how, 1now, 1brown, 1

ate, 1mouse, 1

cow, 1

Input Map Shuffle & Sort Reduce Output

19

© 2013 Cisco and/or its affiliates. All rights reserved.Cloud Connect 2013 Shanghai 20

MapReduce Architecture

ToR FEX/switch

Task Tracker 1

Task Tracker 2

Task Tracker 3

Task Tracker 4

Task Tracker 5

ToR FEX/switch

Task Tracker 6

Task Tracker 7

Task Tracker 8

Task Tracker 9

Task Tracker 10

ToR FEX/switch

Task Tracker 11

Task Tracker 12

Task Tracker 13

Task Tracker 14

Task Tracker 15

Switch

Job Tracker

Job1:TT1:Mapper1,Mapper2Job1:TT4:Mapper3,Reducer1

Job2:TT6:Reducer2Job2:TT7:Mapper1,Mapper3

M1

M2

R1

M3

M1

M3

R2

M2

© 2013 Cisco and/or its affiliates. All rights reserved.Cloud Connect 2013 Shanghai

Cisco Webex Cloud and Hadoop Architecture

21

C97-717209-00 © 2012 Cisco and/or its affiliates. All rights reserved. 22© 2010 Cisco and/or its affiliates. All rights reserved. 22

Cisco WebEx Collaboration Cloud

Datacenter / PoP

Leased network link

Global Scale: 13 datacenters & iPoPs around the globe

Dedicated network: dual path 10G circuits between DCs

Multi-tenant: 95k sites

Real-time collaboration: voice, desktop sharing, video, chat

C97-717209-00 © 2012 Cisco and/or its affiliates. All rights reserved. 23© 2010 Cisco and/or its affiliates. All rights reserved. 23

Things happen ..

Datacenter / PoP

Leased network link

People make mistakesHardware failsSoftware failsEven failovers sometimes fail

C97-717209-00 © 2012 Cisco and/or its affiliates. All rights reserved. 24

Cisco WebEx log collection overview

Flume

Log4j

File

Avro

Syslog

Other Sinks

SolrSink

App

licat

ion

stat

e &

AP

Is

HDFS

Thrift

AMQP RDBMS

Sqoop

HTTP/REST

MySQL

Unstructured/semi-structured data Structured data

Cisco UCS C240 M3 servers

12 x 3TB = 36 TB / server

HDFSSink

SolrCloud

Raw dataSolr index

© 2013 Cisco and/or its affiliates. All rights reserved.Cloud Connect 2013 Shanghai

Cisco UCS and Big Data

Building a big data cluster with the UCS Common Platform Architecture (CPA)

CPA NetworkingCPA Sizing and Scaling

25

© 2013 Cisco and/or its affiliates. All rights reserved.Cloud Connect 2013 Shanghai

The evolution of big data deployments

Experimental use of Big Data

Deployed into IT Ops mandated infrastructures

“Skunk works”

Small to medium clusters

App team mandated infrastructure

Purpose built for Big Data

Big Data has established business value

Performance matters

Large or small clusters

IT Infrastructure

Big Data

VMware

WEBSAP

Generic IT servers

General Purpose IT Data Center

X86 servers

Big Data

Dedicated “Pod” for Big Data

© 2013 Cisco and/or its affiliates. All rights reserved.Cloud Connect 2013 Shanghai 27

Hadoop Hardware Evolving in the Enterprise

Typical 2009 Hadoop node

• 1RU server• 4 x 1TB 3.5”

spindles• 2 x 4-core CPU• 1 x GE• 24 GB RAM• Single PSU• Running Apache• $

Economics favor “fat” nodes

• 6x-9x more data/node

• 3x-6x more IOPS/node

• Saturated gigabit, 10GE on the rise

• Fewer total nodes lowers licensing/support costs

• Increased significance of node and switch failure

Typical 2013 Hadoop node

• 2RU server• 12 x 3TB 3.5” or 24

x 1TB 2.5” spindles• 2 x 8-core CPU• 1-2 x 10GE• 128 GB RAM• Dual PSU• Running

commercial/licensed distribution

• $$$

© 2013 Cisco and/or its affiliates. All rights reserved.Cloud Connect 2013 Shanghai 28

Cisco UCS Common Platform Architecture (CPA)Building Blocks for Big Data

UCS 6200 SeriesFabric Interconnects

Nexus 2232Fabric Extenders

UCS Manager

UCS 240 M3 Servers

LAN, SAN, Management

© 2013 Cisco and/or its affiliates. All rights reserved.Cloud Connect 2013 Shanghai

CPA Network Design for Big Data

29

© 2013 Cisco and/or its affiliates. All rights reserved.Cloud Connect 2013 Shanghai

CPA: TopologySingle wire for data and management

8 x 10GE uplinks per FEX= 2:1 oversub (16 servers/rack), no portchannel (static pinning)

2 x 10GE links per server for all traffic, data and management

© 2013 Cisco and/or its affiliates. All rights reserved.Cloud Connect 2013 Shanghai

CPA Recommended FEX Connectivity2 FEX’s and 2 FI’s

• 2232 FEX has 4 buffer groups: ports 1-8, 9-16, 17-24, 25-32 • Distribute servers across port groups to maximize buffer

performance and predictably distribute static pinning on uplinks

© 2013 Cisco and/or its affiliates. All rights reserved.Cloud Connect 2013 Shanghai

Can Hadoop really push 10GE?

Analytic workloads tend to be lighter on the network

Transform workloads tend to be heavier on the network

Hadoop has numerous parameters which affect network

Take advantage of 10GE CPA:– mapred.reduce.slowstart.completed.maps– dfs.balance.bandwidthPerSec– mapred.reduce.parallel.copies– mapred.reduce.tasks– mapred.tasktracker.reduce.tasks.maximum– mapred.compress.map.output

It can, depending on workload, so tune for it!

32

© 2013 Cisco and/or its affiliates. All rights reserved.Cloud Connect 2013 Shanghai

CPA Sizing and Scaling for Big Data

33

© 2013 Cisco and/or its affiliates. All rights reserved.Cloud Connect 2013 Shanghai 34

Cisco UCS Reference Configurations for Big Data

Full Rack UCS Solutions Bundle for Hadoop

Capacity

Full Rack UCS Solutions Bundle for Hadoop, NoSQL Performance

2 x UCS 62962 x Nexus 2232 PP16 x C240 M3 (LFF)

E5-2640 (12 cores)128GB

12x 3TB 7.2K SATA

2 x UCS 62962 x Nexus 2232 PP16 x C240 M3 (SFF)

2x E5-2665 (16 cores)256GB

24 x 1TB 7.2K SAS

© 2013 Cisco and/or its affiliates. All rights reserved.Cloud Connect 2013 Shanghai

Sizing

Start with current storage requirement– Factor in replication (typically 3x) and compression (varies by data set)– Factor in 20-30% free space for temp (Hadoop) or up to 50% for some NoSQL systems– Factor in average daily/weekly data ingest rate– Factor in expected growth rate (i.e. increase in ingest rate over time)

If I/O requirement known, use next table for guidance

Most big data architectures are very linear, so more nodes = more capacity and better performance

Strike a balance between price/performance of individual nodes vs. total # of nodes

Part science, part art

35

© 2013 Cisco and/or its affiliates. All rights reserved.Cloud Connect 2013 Shanghai 36

CPA sizing and application guidelines

Server

CPU2 x E5-2690 2 x E5-2665 2 x E5-2640

Memory (GB) 256 256 128

Disk Drives 24 x 600GB 10K 24 x 1TB 7.2K 12 x 3TB 7.2K

IO Bandwidth (GB/Sec) 2.6 2.0 1.1

Rack-Level

Cores 256 256 192

Memory (TB) 4 4 2

Capacity (TB) 225 384 576

IO Bandwidth (GB/Sec) 41.3 31.9 16.9

Applications MPP DBNoSQL

HadoopNoSQL Hadoop

Best Performance Best Price/TB

© 2013 Cisco and/or its affiliates. All rights reserved.Cloud Connect 2013 Shanghai

Scaling the CPA

Single Rack 16 servers

Single Domain Up to 10 racks, 160 servers

37

Multiple Domains

L2/L3 Switching

© 2013 Cisco and/or its affiliates. All rights reserved.Cloud Connect 2013 Shanghai

Consider intra- and inter-domain bandwidth:

Servers Per Domain

(Pair of Fabric Interconnects)

Available North-Bound 10GE ports(per fabric)

Southbound oversubscription

(per fabric)

Northbound oversubscription

(per fabric)

Intra-domain server-to-server bandwidth (per

fabric, Gbits/sec)

Inter-domain server-to-server bandwidth (per

fabric, Gbits/sec)

160 16 2:1 5:1 5 1

144 24 2:1 3:1 5 1.67

128 32 2:1 2:1 5 2.5

Scaling the Common Platform ArchitectureMultiple domains based on 16 servers per rack and 2 x 2232 FEXs

38

© 2013 Cisco and/or its affiliates. All rights reserved.Cloud Connect 2013 Shanghai

Multi-Domain CPA Customer Example

39

• 10 Gits/sec Intra-Domain Server to Server NW Bandwidth

• 5 Gbits/sec Inter-Domain Server to Server NW Bandwidth

• Static pinning from FEX to FI (no port-channel)

© 2013 Cisco and/or its affiliates. All rights reserved.Cloud Connect 2013 Shanghai

Recommendations: UCS Domains and Racks

40

Single Domain Recommendation

Turn off or enable at physical rack level

• For simplicity and ease of use, leave Rack Awareness off

• Consider turning it on to limit physical rack level fault domain (e.g. localized failures due to physical data center issues – water, power, cooling, etc.)

Multi Domain Recommendation

Create one Hadoop rack per UCS Domain

• With multiple domains, enable Rack Awareness such that each UCS Domain is its own Hadoop rack

• Provides HDFS data protection across domains

• Helps minimize cross-domain traffic

© 2013 Cisco and/or its affiliates. All rights reserved.Cloud Connect 2013 Shanghai

Exercise 1

Set up a single node VM cluster on the laptop– Step 1: copy files from USB memory stick– Step 2: Mac & Dean to fill in …– Step 3: Mac & Dean to fill in …– etc

41

© 2013 Cisco and/or its affiliates. All rights reserved.Cloud Connect 2013 Shanghai 42

© 2013 Cisco and/or its affiliates. All rights reserved.Cloud Connect 2013 Shanghai

Hive

An SQL-like interface to Hadoop

Top level Apache project – http://hive.apache.org/

Hive history– Created at Facebook to allow people to quickly and easily leverage Hadoop without the effort of

writing Java MapReduce– Currently used at many companies for log processing, business intelligence and analytics

43

© 2013 Cisco and/or its affiliates. All rights reserved.Cloud Connect 2013 Shanghai

Hive Components

Shell: allows interactive queries Driver: session handles, fetch, execute Compiler: parse, plan, optimize Execution engine: DAG of stages (MR, HDFS, metadata) Metastore: schema, location in HDFS, SerDe

© 2013 Cisco and/or its affiliates. All rights reserved.Cloud Connect 2013 Shanghai

Data Model

Tables– Typed columns (int, float, string, boolean)– Also, list: map (for JSON-like data)

Partitions– For example, range-partition tables by date

Buckets– Hash partitions within ranges (useful for sampling, join optimization)

© 2013 Cisco and/or its affiliates. All rights reserved.Cloud Connect 2013 Shanghai

Hive

46

DBMS Hive

Language SQL-92 standard Subset of SQL-92 plus Hive extensions

Updates INSERT, UPDATE, DELETE INSERT OVERWRITENo UPDATE or DELETE

Transactions Yes No

Latency Sub-second Minutes to hours

Indexes Any number of indexes, important to performance

No indexes, data is always scanned in parallel

Dataset size TBs PBs

© 2013 Cisco and/or its affiliates. All rights reserved.Cloud Connect 2013 Shanghai

Metastore

Database: namespace containing a set of tables Holds table definitions (column types, physical layout) Holds partitioning information Can be stored in Derby, MySQL, and other relational databases

Source: cc-licensed slide by Cloudera

© 2013 Cisco and/or its affiliates. All rights reserved.Cloud Connect 2013 Shanghai

Hive components

Source: cc-licensed slide by Cloudera

Hive MetaStore

SerDe

InputFormat

Hadoop cluster

© 2013 Cisco and/or its affiliates. All rights reserved.Cloud Connect 2013 Shanghai

Hive MetaStore

MetaStore

Impala

RDBMS

HCatalog

Pig

HiveServer2

HiveCLI

BeelineCLI

© 2013 Cisco and/or its affiliates. All rights reserved.Cloud Connect 2013 Shanghai

Hive Physical Layout

Warehouse directory in HDFS– E.g., /user/hive/warehouse

Tables stored in subdirectories of warehouse– Partitions form subdirectories of tables

Actual data stored in HDFS files– E.g. text, SequenceFile, RCfile, Avro– Arbitrary format with a custom SerDe

© 2013 Cisco and/or its affiliates. All rights reserved.Cloud Connect 2013 Shanghai

External and Hive managed tables

Hive managed tables– Data moved to location /user/hive/warehouse– Can be stored in a more efficient format than text e.g. RCFile– If you drop the table, the raw data is lost

hive> CREATE TABLE test(id INT, name STRING) ROW FORMAT DELIMITED FIELDS TERMINATED BY ',' LINES TERMINATED BY '\n' STORED AS TEXTFILE;

External tables– Can overlay multiple tables all pointing to the same raw data– To create external table, simply point to the location of data while creating the tables

hive> CREATE TABLE test (id INT, name STRING) ROW FORMAT DELIMITED FIELDS TERMINATED BY ',' LINES TERMINATED BY '\n' STORED AS TEXTFILE LOCATION '/home/test/data';

© 2013 Cisco and/or its affiliates. All rights reserved.Cloud Connect 2013 Shanghai

Hive: Example

Hive looks similar to an SQL database Relational join on two tables:

– Table of word counts from Shakespeare collection– Table of word counts from the bible

SELECT s.word, s.freq, k.freq FROM shakespeare s JOIN bible k ON (s.word = k.word) WHERE s.freq >= 1 AND k.freq >= 1 ORDER BY s.freq DESC LIMIT 5;

the 25848 62394I 23031 8854and 19671 38985to 18038 13526of 16700 34654

© 2013 Cisco and/or its affiliates. All rights reserved.Cloud Connect 2013 Shanghai 53

Impala

© 2013 Cisco and/or its affiliates. All rights reserved.Cloud Connect 2013 Shanghai 54

Impala General purpose MPP SQL query engine for Hadoop

– Query latency milliseconds to hours, interactive data exploration– Runs on the existing Hadoop cluster on existing HDFS files and hardware

High performance– C++– Direct access to HDFS and Hbase data, no MapReduce

Unified platform– Use existing Hive metadata and query language (HiveQL)– Submit queries via ODBC or Thrift API

Performance– Disk throughput limited by hw to 100MB/sec– 3 .. 90 x faster than Hive, depending on the type of the query

© 2013 Cisco and/or its affiliates. All rights reserved.Cloud Connect 2013 Shanghai

Impala Details

55

Query Planner

Query Coordinator

Query Exec Engine

HDFS DN HBase

SQL App

ODBC

Hive Metastore

HDFS NN

StateStored

HiveQL interfaceUnified metadata

impalad

Query Planner

Query Coordinator

Query Exec Engine

HDFS DN HBase

impalad

Query Planner

Query Coordinator

Query Exec Engine

HDFS DN HBase

impalad

© 2013 Cisco and/or its affiliates. All rights reserved.Cloud Connect 2013 Shanghai

Impala Details

56

Query Planner

Query Coordinator

Query Exec Engine

HDFS DN HBase

SQL App

ODBC

Hive Metastore

HDFS NN

StateStored

HiveQL interfaceUnified metadata

impalad

Query Planner

Query Coordinator

Query Exec Engine

HDFS DN HBase

impalad

Query Planner

Query Coordinator

Query Exec Engine

HDFS DN HBase

impalad

Impalad keep contact to StateStored to update their state and to receive metadata for query planning

© 2013 Cisco and/or its affiliates. All rights reserved.Cloud Connect 2013 Shanghai

Impala Details

57

Query Planner

Query Coordinator

Query Exec Engine

HDFS DN HBase

SQL App

ODBC

Hive Metastore

HDFS NN

StateStore

HiveQL interfaceUnified metadata

impalad

Query Planner

Query Coordinator

Query Exec Engine

HDFS DN HBase

impalad

Query Planner

Query Coordinator

Query Exec Engine

HDFS DN HBase

impalad

Query coordinator initiates

execution on remote impalad’s

© 2013 Cisco and/or its affiliates. All rights reserved.Cloud Connect 2013 Shanghai

Impala Details

58

Query Planner

Query Coordinator

Query Exec Engine

HDFS DN HBase

SQL App

ODBC

Hive Metastore

HDFS NN

StateStore

HiveQL interfaceUnified metadata

impalad

Query Planner

Query Coordinator

Query Exec Engine

HDFS DN HBase

impalad

Query Planner

Query Coordinator

Query Exec Engine

HDFS DN HBase

impalad

Intermediate results are streamed between impalad’s

and query results are streamed back to client

© 2013 Cisco and/or its affiliates. All rights reserved.Cloud Connect 2013 Shanghai

Exercise 2

Analytics with Hive and Impala– Step 1: copy test dataset from USB memory stick– Step 2: Mac & Dean to fill in …– Step 3: Mac & Dean to fill in …– etc

59