Hortonworks.bdb
-
Upload
emil-andreas-siemes -
Category
Documents
-
view
332 -
download
3
Transcript of Hortonworks.bdb
![Page 1: Hortonworks.bdb](https://reader034.fdocuments.in/reader034/viewer/2022051617/55a4ddd61a28ab37768b461e/html5/thumbnails/1.jpg)
Hortonworks: We Do Hadoop.Our mission is to enable your Modern Data Architecture
by Delivering Enterprise Apache Hadoop
March 2014
![Page 2: Hortonworks.bdb](https://reader034.fdocuments.in/reader034/viewer/2022051617/55a4ddd61a28ab37768b461e/html5/thumbnails/2.jpg)
Our Mission:
Our Commitment
Open LeadershipDrive innovation in the open exclusively via the
Apache community-driven open source process
Enterprise RigorEngineer, test and certify Apache Hadoop with
the enterprise in mind
Ecosystem EndorsementFocus on deep integration with existing data
center technologies and skills
Page 2
Headquarters: Palo Alto, CA
Employees: 300+ and growing
Trusted Partners
Enable your Modern Data Architecture by
Delivering Enterprise Apache Hadoop
![Page 3: Hortonworks.bdb](https://reader034.fdocuments.in/reader034/viewer/2022051617/55a4ddd61a28ab37768b461e/html5/thumbnails/3.jpg)
Requirements for
Enterprise Hadoop in the
Modern Data Architecture
Page 3
![Page 4: Hortonworks.bdb](https://reader034.fdocuments.in/reader034/viewer/2022051617/55a4ddd61a28ab37768b461e/html5/thumbnails/4.jpg)
1Key ServicesPlatform, Operational and
Data services essential for
the enterprise
SkillsLeverage your existing
skills: development,
analytics, operations
2
Requirements for Enterprise Hadoop
Page 4
CORE SERVICES
Enterprise ReadinessHigh Availability, Disaster
Recovery, Rolling Upgrades,
Security and Snapshots
OPERATIONAL SERVICES
HDFS
SQOOP
FLUME
NFS
WebHDFS
KNOX*
OOZIE
AMBARI
FALCON*
YARN
MAP TEZREDUCE
HIVE &HCATALOG
PIGHBASE
IntegrationInteroperable with existing
data center investments 3
OPERATIONAL SERVICES
DATASERVICES
CORE SERVICES
Schedule
Enterprise ReadinessHigh Availability, Disaster
Recovery, Rolling Upgrades,
Security and Snapshots
Storage
Resource Management
Process
Data Movement
ClusterMgmt Dataset
MgmtData Access
Data Security
![Page 5: Hortonworks.bdb](https://reader034.fdocuments.in/reader034/viewer/2022051617/55a4ddd61a28ab37768b461e/html5/thumbnails/5.jpg)
1Key ServicesPlatform, Operational and
Data services essential for
the enterprise
SkillsLeverage your existing
skills: development,
analytics, operations
2
HDP: A Complete Hadoop Distribution
Page 5
OS/VM Cloud Appliance
CORE SERVICES
CORE
Enterprise ReadinessHigh Availability, Disaster
Recovery, Rolling Upgrades,
Security and Snapshots
HORTONWORKS DATA PLATFORM (HDP)
OPERATIONAL SERVICES
DATASERVICES
HDFS
SQOOP
FLUME
NFS
LOAD & EXTRACT
WebHDFS
KNOX*
OOZIE
AMBARI
FALCON*
YARN
MAP TEZREDUCE
HIVE &HCATALOG
PIGHBASE
IntegrationInteroperable with existing
data center investments 3
OPERATIONAL SERVICES
DATASERVICES
CORE SERVICES
HORTONWORKS DATA PLATFORM (HDP)
Schedule
Enterprise ReadinessHigh Availability, Disaster
Recovery, Rolling Upgrades,
Security and Snapshots
Storage
Resource Management
Process
Data Movement
ClusterMgmnt Dataset
MgmntData Access
CORE SERVICES
HORTONWORKS DATA PLATFORM (HDP)
OPERATIONAL SERVICES
DATASERVICES
HDFS
SQOOP
FLUMEAMBARIFALCON
YARN
MAP TEZREDUCE
HIVEPIGHBASE
OOZIE
Enterprise ReadinessHigh Availability, Disaster
Recovery, Rolling Upgrades,
Security and Snapshots
LOAD & EXTRACT
WebHDFS
NFS
KNOX
![Page 6: Hortonworks.bdb](https://reader034.fdocuments.in/reader034/viewer/2022051617/55a4ddd61a28ab37768b461e/html5/thumbnails/6.jpg)
Store all date in a single place, interact in multiple ways
Hadoop 2: The Introduction of YARN
1st Gen of
Hadoop
HDFS(redundant, reliable storage)
MapReduce(cluster resource management
& data processing)
HADOOP 2
Single Use System
Batch Apps
Multi Use Data Platform
Batch, Interactive, Online, Streaming, …
Page 6
Redundant, Reliable Storage(HDFS)
Efficient Cluster Resource Management & Shared Services
(YARN)
Standard QueryProcessing
Hive, Pig
BatchMapReduce
InteractiveTez
Online Data Processing
HBase, Accumulo
Real Time Stream Processing
Storm
others…
![Page 7: Hortonworks.bdb](https://reader034.fdocuments.in/reader034/viewer/2022051617/55a4ddd61a28ab37768b461e/html5/thumbnails/7.jpg)
Apache Hadoop YARN
Page 7
FlexibleEnables other purpose-built data
processing models beyond
MapReduce (batch), such as
interactive and streaming
EfficientDouble processing IN Hadoop on
the same hardware while
providing predictable
performance & quality of service
SharedProvides a stable, reliable,
secure foundation and
shared operational services
across multiple workloads
The data operating system for Hadoop 2.0
Data Processing Engines Run Natively IN Hadoop
BATCHMapReduce
INTERACTIVETez
STREAMINGStorm
IN-MEMORYSpark
GRAPHGiraph
SASLASR, HPA
ONLINEHBase, Accumulo
OTHERS
HDFS: Redundant, Reliable Storage
YARN: Cluster Resource Management
![Page 8: Hortonworks.bdb](https://reader034.fdocuments.in/reader034/viewer/2022051617/55a4ddd61a28ab37768b461e/html5/thumbnails/8.jpg)
Driving Our Innovation Through Apache
147,933 lines
614,041 lines
End Users
449,768 lines
Total Net Lines Contributed
to Apache Hadoop
Yahoo: 10
Cloudera: 7
IBM: 3
10 Others
21
Facebook: 5
LinkedIn: 3
Total Number of Committers
to Apache Hadoop
63total
Hortonworks mission is
to power your modern data architecture by enabling
Hadoop to be an enterprise data platform that
deeply integrates with your data center technologies
Page 8
Apache
ProjectCommitters
PMC
Members
Hadoop 21 13
Tez 10 4
Hive 11 3
HBase 8 3
Pig 6 5
Sqoop 1 0
Ambari 20 12
Knox 6 2
Falcon 2 2
Oozie 2 2
Zookeepe
r2 1
Flume 1 0
Accumulo 2 2
Storm 1 0
Drill 1 0
TOTAL 95 48
![Page 9: Hortonworks.bdb](https://reader034.fdocuments.in/reader034/viewer/2022051617/55a4ddd61a28ab37768b461e/html5/thumbnails/9.jpg)
Patterns for Hadoop Applications
Page 9
1
IntegrationInteroperable with existing
data center investments
Key ServicesPlatform, operational and
data services essential for
the enterprise
SkillsLeverage your existing
skills: development,
analytics, operations
2
3D
EVEL
OP
AN
ALY
ZEO
PER
ATE
COLLECT PROCESS BUILD
EXPLORE QUERY DELIVER
PROVISION MANAGE MONITOR
![Page 10: Hortonworks.bdb](https://reader034.fdocuments.in/reader034/viewer/2022051617/55a4ddd61a28ab37768b461e/html5/thumbnails/10.jpg)
Familiar and Existing Tools
Page 10
1Key ServicesPlatform, operational and
data services essential for
the enterprise
SkillsLeverage your existing
skills: development,
analytics, operations
2
DEV
ELO
PA
NA
LYZE
OP
ERA
TE
COLLECT PROCESS BUILD
EXPLORE QUERY DELIVER
PROVISION MANAGE MONITOR
BusinessObjects BI
IntegrationInteroperable with existing
data center investments 3
![Page 11: Hortonworks.bdb](https://reader034.fdocuments.in/reader034/viewer/2022051617/55a4ddd61a28ab37768b461e/html5/thumbnails/11.jpg)
SQL Interactive Query & Apache Hive
Page 11
1Key ServicesPlatform, operational and
data services essential for
the enterprise
SkillsLeverage your existing
skills: development,
analytics, operations
2
IntegrationInteroperable with existing
data center investments 3
Stinger InitiativeBroad, community based effort to deliver the
next generation of Apache Hive
Scale
The only SQL interface
to Hadoop designed for
queries that scale from
TB to PB
SQL
Support broadest range
of SQL semantics for
analytic applications
against Hadoop
Speed
Improve Hive query
performance by 100X to
allow for interactive
query times (seconds)
SQL
Apache Hive• The defacto standard for Hadoop SQL access
• Used by your current data center partners
• Built for batch AND interactive query
![Page 12: Hortonworks.bdb](https://reader034.fdocuments.in/reader034/viewer/2022051617/55a4ddd61a28ab37768b461e/html5/thumbnails/12.jpg)
AP
PLI
CA
TIO
NS
DA
TA S
YST
EM
REPOSITORIES
SOU
RC
ES Existing Sources (CRM, ERP, Clickstream, Logs)
RDBMS EDW MPP
Emerging Sources (Sensor, Sentiment, Geo, Unstructured)
OPERATIONALTOOLS
MANAGE & MONITOR
DEV & DATATOOLS
BUILD & TEST
Business Analytics
Custom Applications
PackagedApplications
Requirements for Enterprise Hadoop
Page 12
IntegrationInteroperable with existing
data center investments 3
Integrate with
ApplicationsBusiness Intelligence,
Developer IDEs,
Data Integration
SystemsData Systems & Storage,
Systems Management
PlatformsOperating Systems,
Virtualization, Cloud,
Appliances
![Page 13: Hortonworks.bdb](https://reader034.fdocuments.in/reader034/viewer/2022051617/55a4ddd61a28ab37768b461e/html5/thumbnails/13.jpg)
Broad Ecosystem Integration
Page 13
AP
PLI
CA
TIO
NS
DA
TA S
YST
EMSO
UR
CES
RDBMS EDW MPP
Emerging Sources (Sensor, Sentiment, Geo, Unstructured)
HANA
BusinessObjects BI
OPERATIONAL TOOLS
DEV & DATA TOOLS
Existing Sources (CRM, ERP, Clickstream, Logs)
INFRASTRUCTURE
![Page 14: Hortonworks.bdb](https://reader034.fdocuments.in/reader034/viewer/2022051617/55a4ddd61a28ab37768b461e/html5/thumbnails/14.jpg)
Apache Hive and Stinger:
SQL in Hadoop
Arun Murthy (@acmurthy)
Alan Gates (@alanfgates)
Owen O’Malley (@owen_omalley)
@hortonworks
![Page 15: Hortonworks.bdb](https://reader034.fdocuments.in/reader034/viewer/2022051617/55a4ddd61a28ab37768b461e/html5/thumbnails/15.jpg)
Stinger Project(announced February 2013)
Batch AND Interactive SQL-IN-Hadoop
Stinger InitiativeA broad, community-based effort to
drive the next generation of HIVE
Coming Soon:• Hive on Apache Tez• Query Service• Buffer Cache• Cost Based Optimizer (Optiq)• Vectorized Processing
Hive 0.11, May 2013:
• Base Optimizations• SQL Analytic Functions• ORCFile, Modern File Format
Hive 0.12, October 2013:
• VARCHAR, DATE Types• ORCFile predicate pushdown• Advanced Optimizations• Performance Boosts via YARN
SpeedImprove Hive query performance by 100X to
allow for interactive query times (seconds)
Scale
The only SQL interface to Hadoop designed
for queries that scale from TB to PB
SQL
Support broadest range of SQL semantics for
analytic applications running against Hadoop
…all IN Hadoop
Goals:
![Page 16: Hortonworks.bdb](https://reader034.fdocuments.in/reader034/viewer/2022051617/55a4ddd61a28ab37768b461e/html5/thumbnails/16.jpg)
Hive 0.12
Hive 0.12
Release Theme Speed, Scale and SQL
Specific Features • 10x faster query launch when using large number
(500+) of partitions
• ORCFile predicate pushdown speeds queries
• Evaluate LIMIT on the map side
• Parallel ORDER BY
• New query optimizer
• Introduces VARCHAR and DATE datatypes
• GROUP BY on structs or unions
Included
Components
Apache Hive 0.12
![Page 17: Hortonworks.bdb](https://reader034.fdocuments.in/reader034/viewer/2022051617/55a4ddd61a28ab37768b461e/html5/thumbnails/17.jpg)
SPEED: Increasing Hive Performance
Performance Improvements
included in Hive 12
– Base & advanced query optimization
– Startup time improvement
– Join optimizations
Interactive Query Times across ALL use cases
• Simple and advanced queries in seconds
• Integrates seamlessly with existing tools
• Currently a >100x improvement in just nine months
![Page 18: Hortonworks.bdb](https://reader034.fdocuments.in/reader034/viewer/2022051617/55a4ddd61a28ab37768b461e/html5/thumbnails/18.jpg)
Stinger Phase 3: Unlocking Interactive Query
Page 18
Stinger Phase 3: Features and Benefits
Container Pre-LaunchOvercomes Java VM startup latency by pre-
launching hot containers ready to serve queries
Container Re-Use
Finished Maps and Reduces pick up more work
rather than exiting. Reduces latency and
eliminates difficult split size tuning
Tez IntegrationTez Broadcast Edge and Intermediate Reduce
pattern improve query scale and throughput
In-Memory Cache Hot data kept in RAM for fast access
![Page 19: Hortonworks.bdb](https://reader034.fdocuments.in/reader034/viewer/2022051617/55a4ddd61a28ab37768b461e/html5/thumbnails/19.jpg)
Stinger Phase 3: Speed, Scale, and SQL
Page 19
Release Theme Prove Hive for both large-scale and interactive SQL /
analytics
Specific Features • < 10s SQL queries over 200GB datasets through Hive
• Tez container pre-launch
• Tez container re-use
• Use of Tez Intermediate Reduce pattern
• In-memory HDFS caching
Made available as part of the Tech Preview for Stinger Phase 3
![Page 20: Hortonworks.bdb](https://reader034.fdocuments.in/reader034/viewer/2022051617/55a4ddd61a28ab37768b461e/html5/thumbnails/20.jpg)
Stinger Phase 3: Beyond Tech Preview
Page 20
Release Theme Speed, SQL,…and Security
Specific Features • Hive-on-Tez: Interactive query on Hive
• SQL Improvements:
• Sub-query for WHERE
• Standard JOIN semantics
• Support for Common Table Expressions (CTE)
• Phase 1 of ACID Semantics support
• Automatic JOIN order optimization
• CHAR datatype
• PAM authentication support
• SSL encryption
![Page 21: Hortonworks.bdb](https://reader034.fdocuments.in/reader034/viewer/2022051617/55a4ddd61a28ab37768b461e/html5/thumbnails/21.jpg)
SQL: Enhancing SQL Semantics
Hive SQL Datatypes Hive SQL Semantics
INT SELECT, INSERT
TINYINT/SMALLINT/BIGINT GROUP BY, ORDER BY, SORT BY
BOOLEAN JOIN on explicit join key
FLOAT Inner, outer, cross and semi joins
DOUBLE Sub-queries in FROM clause
STRING ROLLUP and CUBE
TIMESTAMP UNION
BINARY Windowing Functions (OVER, RANK, etc)
DECIMAL Custom Java UDFs
ARRAY, MAP, STRUCT, UNION Standard Aggregation (SUM, AVG, etc.)
DATE Advanced UDFs (ngram, Xpath, URL)
VARCHAR Sub-queries in WHERE, HAVING
CHAR Expanded JOIN Syntax
SQL Compliant Security (GRANT, etc.)
INSERT/UPDATE/DELETE (ACID)
Hive 0.12
Available
Roadmap
SQL Compliance
Hive 12 provides a wide
array of SQL datatypes
and semantics so your
existing tools integrate
more seamlessly with
Hadoop
![Page 22: Hortonworks.bdb](https://reader034.fdocuments.in/reader034/viewer/2022051617/55a4ddd61a28ab37768b461e/html5/thumbnails/22.jpg)
![Page 23: Hortonworks.bdb](https://reader034.fdocuments.in/reader034/viewer/2022051617/55a4ddd61a28ab37768b461e/html5/thumbnails/23.jpg)
Vectorized Query Execution
•Designed for Modern Processor Architectures
–Avoid branching in the inner loop.
–Make the most use of L1 and L2 cache.
•How It Works
–Process records in batches of 1,000 rows
–Generate code from templates to minimize branching.
•What It Gives
–30x improvement in rows processed per second.
–Initial prototype: 100M rows/sec on laptop
Page 23
![Page 24: Hortonworks.bdb](https://reader034.fdocuments.in/reader034/viewer/2022051617/55a4ddd61a28ab37768b461e/html5/thumbnails/24.jpg)
![Page 25: Hortonworks.bdb](https://reader034.fdocuments.in/reader034/viewer/2022051617/55a4ddd61a28ab37768b461e/html5/thumbnails/25.jpg)
![Page 26: Hortonworks.bdb](https://reader034.fdocuments.in/reader034/viewer/2022051617/55a4ddd61a28ab37768b461e/html5/thumbnails/26.jpg)
Hive – MR Hive – Tez
Hive-on-MR vs. Hive-on-Tez
SELECT a.x, AVERAGE(b.y) AS avg
FROM a JOIN b ON (a.id = b.id) GROUP BY a
UNION SELECT x, AVERAGE(y) AS AVG
FROM c GROUP BY x
ORDER BY AVG;
SELECT a.state
JOIN (a, c)
SELECT c.price
SELECT b.id
JOIN(a, b)
GROUP BY a.state
COUNT(*)
AVERAGE(c.price)
M M M
R R
M M
R
M M
R
M M
R
HDFS
HDFS
HDFS
M M M
R R
R
M M
R
R
SELECT a.state,
c.itemId
JOIN (a, c)
JOIN(a, b)
GROUP BY a.state
COUNT(*)
AVERAGE(c.price)
SELECT b.id
Tez avoids
unneeded writes to
HDFS
![Page 27: Hortonworks.bdb](https://reader034.fdocuments.in/reader034/viewer/2022051617/55a4ddd61a28ab37768b461e/html5/thumbnails/27.jpg)
Tez Delivers Interactive Query - Out of the Box!
Page 27
Feature Description Benefit
Tez SessionOvercomes Map-Reduce job-launch latency by pre-launching Tez AppMaster
Latency
Tez Container Pre-Launch
Overcomes Map-Reduce latency by pre-launching hot containers ready to serve queries.
Latency
Tez Container Re-UseFinished maps and reduces pick up more work rather than exiting. Reduces latency and eliminates difficult split-size tuning. Out of box performance!
Latency
Runtime re-configuration of DAG
Runtime query tuning by picking aggregation parallelism using online query statistics
Throughput
Tez In-Memory Cache Hot data kept in RAM for fast access. Latency
Complex DAGsTez Broadcast Edge and Map-Reduce-Reducepattern improve query scale and throughput.
Throughput
![Page 28: Hortonworks.bdb](https://reader034.fdocuments.in/reader034/viewer/2022051617/55a4ddd61a28ab37768b461e/html5/thumbnails/28.jpg)
![Page 29: Hortonworks.bdb](https://reader034.fdocuments.in/reader034/viewer/2022051617/55a4ddd61a28ab37768b461e/html5/thumbnails/29.jpg)
![Page 30: Hortonworks.bdb](https://reader034.fdocuments.in/reader034/viewer/2022051617/55a4ddd61a28ab37768b461e/html5/thumbnails/30.jpg)
![Page 31: Hortonworks.bdb](https://reader034.fdocuments.in/reader034/viewer/2022051617/55a4ddd61a28ab37768b461e/html5/thumbnails/31.jpg)
![Page 32: Hortonworks.bdb](https://reader034.fdocuments.in/reader034/viewer/2022051617/55a4ddd61a28ab37768b461e/html5/thumbnails/32.jpg)
![Page 33: Hortonworks.bdb](https://reader034.fdocuments.in/reader034/viewer/2022051617/55a4ddd61a28ab37768b461e/html5/thumbnails/33.jpg)
![Page 34: Hortonworks.bdb](https://reader034.fdocuments.in/reader034/viewer/2022051617/55a4ddd61a28ab37768b461e/html5/thumbnails/34.jpg)
How Stinger Phase 3 Delivers Interactive Query
Page 34
Feature Description Benefit
Tez Integration Tez is significantly better engine than MapReduce Latency
Vectorized QueryTake advantage of modern hardware by processing thousand-row blocks rather than row-at-a-time.
Throughput
Query Planner
Using extensive statistics now available in Metastoreto better plan and optimize query, including predicate pushdown during compilation to eliminate portions of input (beyond partition pruning)
Latency
Cost Based Optimizer (Optiq)
Join re-ordering and other optimizations based on column statistics including histograms etc.
Latency
![Page 35: Hortonworks.bdb](https://reader034.fdocuments.in/reader034/viewer/2022051617/55a4ddd61a28ab37768b461e/html5/thumbnails/35.jpg)
Next Steps
• Blog
http://hortonworks.com/blog/delivering-on-stinger-a-phase-3-progress-update/
• Stinger Initiative
http://hortonworks.com/labs/stinger/
• Stinger Phase 3 Tech preview• http://hortonworks.com/blog/announcing-stinger-phase-3-technical-preview/
• http://hadoopwrangler.com
![Page 36: Hortonworks.bdb](https://reader034.fdocuments.in/reader034/viewer/2022051617/55a4ddd61a28ab37768b461e/html5/thumbnails/36.jpg)
Hortonworks: The Value of “Open” for You
Page 36
Validate & Try
1. Download the
Hortonworks Sandbox
2. Learn Hadoop using the
technical tutorials
3. Investigate a business
case using the step-by-
step business cases
scenarios
4. Validate YOUR business
case using your data in
the sandbox
Connect With the Hadoop CommunityWe employ a large number of Apache project committers & innovators so
that you are represented in the open source community
Avoid Vendor Lock-InHortonworks Data Platform remain as close to the open source trunk as
possible and is developed 100% in the open so you are never locked in
The Partners you Rely On, Rely On Hortonworks We work with partners to deeply integrate Hadoop with data center
technologies so you can leverage existing skills and investments
Certified for the EnterpriseWe engineer, test and certify the Hortonworks Data Platform at scale to
ensure reliability and stability you require for enterprise use
Support from the ExpertsWe provide the highest quality of support for deploying at scale. You are
supported by hundreds of years of Hadoop experience
Engage
1. Execute a Business Case
Discovery Workshop with
our architects
2. Build a business case for
Hadoop today