Hortonworks.bdb

Hortonworks: We Do Hadoop.Our mission is to enable your Modern Data Architecture

by Delivering Enterprise Apache Hadoop

March 2014

Our Mission:

Our Commitment

Open LeadershipDrive innovation in the open exclusively via the

Apache community-driven open source process

Enterprise RigorEngineer, test and certify Apache Hadoop with

the enterprise in mind

Ecosystem EndorsementFocus on deep integration with existing data

center technologies and skills

Headquarters: Palo Alto, CA

Employees: 300+ and growing

Trusted Partners

Enable your Modern Data Architecture by

Delivering Enterprise Apache Hadoop

Requirements for

Enterprise Hadoop in the

Modern Data Architecture

1Key ServicesPlatform, Operational and

Data services essential for

the enterprise

SkillsLeverage your existing

skills: development,

analytics, operations

2

Requirements for Enterprise Hadoop

CORE SERVICES

Enterprise ReadinessHigh Availability, Disaster

Recovery, Rolling Upgrades,

Security and Snapshots

OPERATIONAL SERVICES

HDFS

SQOOP

FLUME

NFS

WebHDFS

KNOX*

OOZIE

AMBARI

FALCON*

YARN

MAP TEZREDUCE

HIVE &HCATALOG

PIGHBASE

IntegrationInteroperable with existing

data center investments 3


DATASERVICES

CORE SERVICES

Schedule




Storage

Resource Management

Process

Data Movement

ClusterMgmt Dataset

MgmtData Access

Data Security

1Key ServicesPlatform, Operational and

Data services essential for

the enterprise




2

HDP: A Complete Hadoop Distribution

OS/VM Cloud Appliance

CORE SERVICES

CORE




HORTONWORKS DATA PLATFORM (HDP)


DATASERVICES

HDFS

SQOOP

FLUME

NFS

LOAD & EXTRACT

WebHDFS

KNOX*

OOZIE

AMBARI

FALCON*

YARN

MAP TEZREDUCE

HIVE &HCATALOG

PIGHBASE




DATASERVICES

CORE SERVICES


Schedule




Storage

Resource Management

Process

Data Movement

ClusterMgmnt Dataset

MgmntData Access

CORE SERVICES



DATASERVICES

HDFS

SQOOP

FLUMEAMBARIFALCON

YARN

MAP TEZREDUCE

HIVEPIGHBASE

OOZIE




LOAD & EXTRACT

WebHDFS

NFS

KNOX

Store all date in a single place, interact in multiple ways

Hadoop 2: The Introduction of YARN

1st Gen of

Hadoop

HDFS(redundant, reliable storage)

MapReduce(cluster resource management

& data processing)

HADOOP 2

Single Use System

Batch Apps

Multi Use Data Platform

Batch, Interactive, Online, Streaming, …

Redundant, Reliable Storage(HDFS)

Efficient Cluster Resource Management & Shared Services

(YARN)

Standard QueryProcessing

Hive, Pig

BatchMapReduce

InteractiveTez

Online Data Processing

HBase, Accumulo

Real Time Stream Processing

Storm

others…

Apache Hadoop YARN

FlexibleEnables other purpose-built data

processing models beyond

MapReduce (batch), such as

interactive and streaming

EfficientDouble processing IN Hadoop on

the same hardware while

providing predictable

performance & quality of service

SharedProvides a stable, reliable,

secure foundation and

shared operational services

across multiple workloads

The data operating system for Hadoop 2.0

Data Processing Engines Run Natively IN Hadoop

BATCHMapReduce

INTERACTIVETez

STREAMINGStorm

IN-MEMORYSpark

GRAPHGiraph

SASLASR, HPA

ONLINEHBase, Accumulo

OTHERS

HDFS: Redundant, Reliable Storage

YARN: Cluster Resource Management

Driving Our Innovation Through Apache

147,933 lines

614,041 lines

End Users

449,768 lines

Total Net Lines Contributed

to Apache Hadoop

Yahoo: 10

Cloudera: 7

IBM: 3

10 Others

21

Facebook: 5

LinkedIn: 3

Total Number of Committers

to Apache Hadoop

63total

Hortonworks mission is

to power your modern data architecture by enabling

Hadoop to be an enterprise data platform that

deeply integrates with your data center technologies

Apache

ProjectCommitters

PMC

Members

Hadoop 21 13

Tez 10 4

Hive 11 3

HBase 8 3

Pig 6 5

Sqoop 1 0

Ambari 20 12

Knox 6 2

Falcon 2 2

Oozie 2 2

Zookeepe

r2 1

Flume 1 0

Accumulo 2 2

Storm 1 0

Drill 1 0

TOTAL 95 48

Patterns for Hadoop Applications

1


data center investments

Key ServicesPlatform, operational and

data services essential for

the enterprise




2

3D

EVEL

OP

AN

ALY

ZEO

PER

ATE

COLLECT PROCESS BUILD

EXPLORE QUERY DELIVER

PROVISION MANAGE MONITOR

Familiar and Existing Tools

1Key ServicesPlatform, operational and


the enterprise




2

DEV

ELO

PA

NA

LYZE

OP

ERA

TE

COLLECT PROCESS BUILD

EXPLORE QUERY DELIVER

PROVISION MANAGE MONITOR

BusinessObjects BI



SQL Interactive Query & Apache Hive

1Key ServicesPlatform, operational and


the enterprise




2



Stinger InitiativeBroad, community based effort to deliver the

next generation of Apache Hive

Scale

The only SQL interface

to Hadoop designed for

queries that scale from

TB to PB

SQL

Support broadest range

of SQL semantics for

analytic applications

against Hadoop

Speed

Improve Hive query

performance by 100X to

allow for interactive

query times (seconds)

SQL

Apache Hive• The defacto standard for Hadoop SQL access

• Used by your current data center partners

• Built for batch AND interactive query

AP

PLI

CA

TIO

NS

DA

TA S

YST

EM

REPOSITORIES

SOU

RC

ES Existing Sources (CRM, ERP, Clickstream, Logs)

RDBMS EDW MPP

Emerging Sources (Sensor, Sentiment, Geo, Unstructured)

OPERATIONALTOOLS

MANAGE & MONITOR

DEV & DATATOOLS

BUILD & TEST

Business Analytics

Custom Applications

PackagedApplications

Requirements for Enterprise Hadoop



Integrate with

ApplicationsBusiness Intelligence,

Developer IDEs,

Data Integration

SystemsData Systems & Storage,

Systems Management

PlatformsOperating Systems,

Virtualization, Cloud,

Appliances

Broad Ecosystem Integration

AP

PLI

CA

TIO

NS

DA

TA S

YST

EMSO

UR

CES

RDBMS EDW MPP

Emerging Sources (Sensor, Sentiment, Geo, Unstructured)

HANA

BusinessObjects BI

OPERATIONAL TOOLS

DEV & DATA TOOLS

Existing Sources (CRM, ERP, Clickstream, Logs)

INFRASTRUCTURE

Apache Hive and Stinger:

SQL in Hadoop

Arun Murthy (@acmurthy)

Alan Gates (@alanfgates)

Owen O’Malley (@owen_omalley)

@hortonworks

Stinger Project(announced February 2013)

Batch AND Interactive SQL-IN-Hadoop

Stinger InitiativeA broad, community-based effort to

drive the next generation of HIVE

Coming Soon:• Hive on Apache Tez• Query Service• Buffer Cache• Cost Based Optimizer (Optiq)• Vectorized Processing

Hive 0.11, May 2013:

• Base Optimizations• SQL Analytic Functions• ORCFile, Modern File Format

Hive 0.12, October 2013:

• VARCHAR, DATE Types• ORCFile predicate pushdown• Advanced Optimizations• Performance Boosts via YARN

SpeedImprove Hive query performance by 100X to

allow for interactive query times (seconds)

Scale

The only SQL interface to Hadoop designed

for queries that scale from TB to PB

SQL

Support broadest range of SQL semantics for

analytic applications running against Hadoop

…all IN Hadoop

Goals:

Hive 0.12

Hive 0.12

Release Theme Speed, Scale and SQL

Specific Features • 10x faster query launch when using large number

(500+) of partitions

• ORCFile predicate pushdown speeds queries

• Evaluate LIMIT on the map side

• Parallel ORDER BY

• New query optimizer

• Introduces VARCHAR and DATE datatypes

• GROUP BY on structs or unions

Included

Components

Apache Hive 0.12

SPEED: Increasing Hive Performance

Performance Improvements

included in Hive 12

– Base & advanced query optimization

– Startup time improvement

– Join optimizations

Interactive Query Times across ALL use cases

• Simple and advanced queries in seconds

• Integrates seamlessly with existing tools

• Currently a >100x improvement in just nine months

Stinger Phase 3: Unlocking Interactive Query

Stinger Phase 3: Features and Benefits

Container Pre-LaunchOvercomes Java VM startup latency by pre-

launching hot containers ready to serve queries

Container Re-Use

Finished Maps and Reduces pick up more work

rather than exiting. Reduces latency and

eliminates difficult split size tuning

Tez IntegrationTez Broadcast Edge and Intermediate Reduce

pattern improve query scale and throughput

In-Memory Cache Hot data kept in RAM for fast access

Stinger Phase 3: Speed, Scale, and SQL

Release Theme Prove Hive for both large-scale and interactive SQL /

analytics

Specific Features • < 10s SQL queries over 200GB datasets through Hive

• Tez container pre-launch

• Tez container re-use

• Use of Tez Intermediate Reduce pattern

• In-memory HDFS caching

Made available as part of the Tech Preview for Stinger Phase 3

Stinger Phase 3: Beyond Tech Preview

Release Theme Speed, SQL,…and Security

Specific Features • Hive-on-Tez: Interactive query on Hive

• SQL Improvements:

• Sub-query for WHERE

• Standard JOIN semantics

• Support for Common Table Expressions (CTE)

• Phase 1 of ACID Semantics support

• Automatic JOIN order optimization

• CHAR datatype

• PAM authentication support

• SSL encryption

SQL: Enhancing SQL Semantics

Hive SQL Datatypes Hive SQL Semantics

INT SELECT, INSERT

TINYINT/SMALLINT/BIGINT GROUP BY, ORDER BY, SORT BY

BOOLEAN JOIN on explicit join key

FLOAT Inner, outer, cross and semi joins

DOUBLE Sub-queries in FROM clause

STRING ROLLUP and CUBE

TIMESTAMP UNION

BINARY Windowing Functions (OVER, RANK, etc)

DECIMAL Custom Java UDFs

ARRAY, MAP, STRUCT, UNION Standard Aggregation (SUM, AVG, etc.)

DATE Advanced UDFs (ngram, Xpath, URL)

VARCHAR Sub-queries in WHERE, HAVING

CHAR Expanded JOIN Syntax

SQL Compliant Security (GRANT, etc.)

INSERT/UPDATE/DELETE (ACID)

Hive 0.12

Available

Roadmap

SQL Compliance

Hive 12 provides a wide

array of SQL datatypes

and semantics so your

existing tools integrate

more seamlessly with

Hadoop

Vectorized Query Execution

•Designed for Modern Processor Architectures

–Avoid branching in the inner loop.

–Make the most use of L1 and L2 cache.

•How It Works

–Process records in batches of 1,000 rows

–Generate code from templates to minimize branching.

•What It Gives

–30x improvement in rows processed per second.

–Initial prototype: 100M rows/sec on laptop

Hive – MR Hive – Tez

Hive-on-MR vs. Hive-on-Tez

SELECT a.x, AVERAGE(b.y) AS avg

FROM a JOIN b ON (a.id = b.id) GROUP BY a

UNION SELECT x, AVERAGE(y) AS AVG

FROM c GROUP BY x

ORDER BY AVG;

SELECT a.state

JOIN (a, c)

SELECT c.price

SELECT b.id

JOIN(a, b)

GROUP BY a.state

COUNT(*)

AVERAGE(c.price)

M M M

R R

M M

R

M M

R

M M

R

HDFS

HDFS

HDFS

M M M

R R

R

M M

R

R

SELECT a.state,

c.itemId

JOIN (a, c)

JOIN(a, b)

GROUP BY a.state

COUNT(*)

AVERAGE(c.price)

SELECT b.id

Tez avoids

unneeded writes to

HDFS

Tez Delivers Interactive Query - Out of the Box!

Feature Description Benefit

Tez SessionOvercomes Map-Reduce job-launch latency by pre-launching Tez AppMaster

Latency

Tez Container Pre-Launch

Overcomes Map-Reduce latency by pre-launching hot containers ready to serve queries.

Latency

Tez Container Re-UseFinished maps and reduces pick up more work rather than exiting. Reduces latency and eliminates difficult split-size tuning. Out of box performance!

Latency

Runtime re-configuration of DAG

Runtime query tuning by picking aggregation parallelism using online query statistics

Throughput

Tez In-Memory Cache Hot data kept in RAM for fast access. Latency

Complex DAGsTez Broadcast Edge and Map-Reduce-Reducepattern improve query scale and throughput.

Throughput

How Stinger Phase 3 Delivers Interactive Query

Feature Description Benefit

Tez Integration Tez is significantly better engine than MapReduce Latency

Vectorized QueryTake advantage of modern hardware by processing thousand-row blocks rather than row-at-a-time.

Throughput

Query Planner

Using extensive statistics now available in Metastoreto better plan and optimize query, including predicate pushdown during compilation to eliminate portions of input (beyond partition pruning)

Latency

Cost Based Optimizer (Optiq)

Join re-ordering and other optimizations based on column statistics including histograms etc.

Latency

Next Steps

• Blog

http://hortonworks.com/blog/delivering-on-stinger-a-phase-3-progress-update/

• Stinger Initiative

http://hortonworks.com/labs/stinger/

• Stinger Phase 3 Tech preview• http://hortonworks.com/blog/announcing-stinger-phase-3-technical-preview/

• http://hadoopwrangler.com

Hortonworks: The Value of “Open” for You

Validate & Try

1. Download the

Hortonworks Sandbox

2. Learn Hadoop using the

technical tutorials

3. Investigate a business

case using the step-by-

step business cases

scenarios

4. Validate YOUR business

case using your data in

the sandbox

Connect With the Hadoop CommunityWe employ a large number of Apache project committers & innovators so

that you are represented in the open source community

Avoid Vendor Lock-InHortonworks Data Platform remain as close to the open source trunk as

possible and is developed 100% in the open so you are never locked in

The Partners you Rely On, Rely On Hortonworks We work with partners to deeply integrate Hadoop with data center

technologies so you can leverage existing skills and investments

Certified for the EnterpriseWe engineer, test and certify the Hortonworks Data Platform at scale to

ensure reliability and stability you require for enterprise use

Support from the ExpertsWe provide the highest quality of support for deploying at scale. You are

supported by hundreds of years of Hadoop experience

Engage

1. Execute a Business Case

Discovery Workshop with

our architects

2. Build a business case for

Hadoop today

Hortonworks.bdb

Documents

Transcript of Hortonworks.bdb