Where is Hadoop Going Next?datasys.cs.iit.edu/events/MTAGS14/keynote1.pdf•Live Long and Process...

23
© Hortonworks Inc. 2014 Where is Hadoop Going Next? November 2014 Page 1 Owen O’Malley [email protected] @owen_omalley

Transcript of Where is Hadoop Going Next?datasys.cs.iit.edu/events/MTAGS14/keynote1.pdf•Live Long and Process...

Page 1: Where is Hadoop Going Next?datasys.cs.iit.edu/events/MTAGS14/keynote1.pdf•Live Long and Process –Persistent Hive execution engine •JVM startup costs are huge –JIT cost alone

© Hortonworks Inc. 2014

Where is Hadoop Going Next?

November 2014

Page 1

Owen O’Malley

[email protected]

@owen_omalley

Page 2: Where is Hadoop Going Next?datasys.cs.iit.edu/events/MTAGS14/keynote1.pdf•Live Long and Process –Persistent Hive execution engine •JVM startup costs are huge –JIT cost alone

© Hortonworks Inc. 2014Page 2

•Worked at Yahoo Seach

–Webmap in a Week

–Dreadnaught to Juggernaut to …

•Hadoop

–MapReduce

–Security

•Hive

•Apache/Open Source Champion

•PhD in Software Engr from UC Irvine

Who am I?

Page 3: Where is Hadoop Going Next?datasys.cs.iit.edu/events/MTAGS14/keynote1.pdf•Live Long and Process –Persistent Hive execution engine •JVM startup costs are huge –JIT cost alone

© Hortonworks Inc. 2014Page 3

•Hadoop History

“A beginning is the time for taking the most

delicate care that the balances are correct.”

- Herbert

•Themes

–Storage

–Computation

–Security

Topics

Page 4: Where is Hadoop Going Next?datasys.cs.iit.edu/events/MTAGS14/keynote1.pdf•Live Long and Process –Persistent Hive execution engine •JVM startup costs are huge –JIT cost alone

© Hortonworks Inc. 2014Page 4

•Yahoo needed to build WebMaps faster

–Whole web analysis for Yahoo Search

–WebMap in a Week

•WebMap used Dreadnaught

–Roughly like MapReduce and HDFS

–Scaled to 800 machines

–Assigned nodes in backup pairs

–Single application cluster

•Started on C++ DFS & MapReduce

What was the Problem?

Page 5: Where is Hadoop Going Next?datasys.cs.iit.edu/events/MTAGS14/keynote1.pdf•Live Long and Process –Persistent Hive execution engine •JVM startup costs are huge –JIT cost alone

© Hortonworks Inc. 2014Page 5

•Focus on a few customers

–Helped Yahoo Search analytics team

–Terasort benchmarks

•Expected Failures

–Storage corrects automatically

–Healthy in minutes instead of hours

–Nodes are automatically assigned

•No chokepoints

–Data never travels through singleton

•RAM isn’t large enough

What did Hadoop Do Right?

Page 6: Where is Hadoop Going Next?datasys.cs.iit.edu/events/MTAGS14/keynote1.pdf•Live Long and Process –Persistent Hive execution engine •JVM startup costs are huge –JIT cost alone

© Hortonworks Inc. 2014Page 6

•Simplified FileSystem abstraction

–No random writes

•Apache

–Many companies working together

–Open governance

•Open Source

–Many hands and eyes

–“Use the source, Luke”

•Open platform

What did Hadoop Do Right?

Page 7: Where is Hadoop Going Next?datasys.cs.iit.edu/events/MTAGS14/keynote1.pdf•Live Long and Process –Persistent Hive execution engine •JVM startup costs are huge –JIT cost alone

© Hortonworks Inc. 2014Page 7

“The more storage you have, the more

stuff you accumulate.”

- Stewart

Storage

Page 8: Where is Hadoop Going Next?datasys.cs.iit.edu/events/MTAGS14/keynote1.pdf•Live Long and Process –Persistent Hive execution engine •JVM startup costs are huge –JIT cost alone

© Hortonworks Inc. 2014Page 8

•Phases

–Single HDFS NameNode

–Cross cluster references

–Federated HDFS NameNodes

•Need HDFS Block Storage factored out

–Wider variety of applications

•Need co-location of files

–Not entire table, but sections of the table

–ACID (and HBase) base and delta files

–Correlated tables

HDFS

Page 9: Where is Hadoop Going Next?datasys.cs.iit.edu/events/MTAGS14/keynote1.pdf•Live Long and Process –Persistent Hive execution engine •JVM startup costs are huge –JIT cost alone

© Hortonworks Inc. 2014Page 9

•Phases

–Text and Sequence File

–RCFile

–Avro

–ORC and Parquet

•Columnar formats

•Type specific encoding

•Self describing metadata at end

File Formats

Page 10: Where is Hadoop Going Next?datasys.cs.iit.edu/events/MTAGS14/keynote1.pdf•Live Long and Process –Persistent Hive execution engine •JVM startup costs are huge –JIT cost alone

© Hortonworks Inc. 2014Page 10

•Light-weight indexes

–Predicate pushdown

–Answers from metadata

•Seeking within file

•Available from Hive, Pig, & MapReduce

•C++ reader/writer coming

ORC

Page 11: Where is Hadoop Going Next?datasys.cs.iit.edu/events/MTAGS14/keynote1.pdf•Live Long and Process –Persistent Hive execution engine •JVM startup costs are huge –JIT cost alone

© Hortonworks Inc. 2014Page 11

“A process cannot be understood by

stopping it. Understanding must move

with the flow of the process, must join it

and flow with it.”

- Herbert

Computation

Page 12: Where is Hadoop Going Next?datasys.cs.iit.edu/events/MTAGS14/keynote1.pdf•Live Long and Process –Persistent Hive execution engine •JVM startup costs are huge –JIT cost alone

© Hortonworks Inc. 2014Page 12

•Hadoop and Hive have always…

–Worked without ACID

–Perceived as tradeoff for performance

–Add or replace entire partitions

•But, your data isn’t static

–It changes daily, hourly, or faster

–Managing change makes the user’s life better

•Need consistent views of changing data!

Why does Hadoop Need ACID?

Page 13: Where is Hadoop Going Next?datasys.cs.iit.edu/events/MTAGS14/keynote1.pdf•Live Long and Process –Persistent Hive execution engine •JVM startup costs are huge –JIT cost alone

© Hortonworks Inc. 2014Page 13

•Updating a Dimension Table

–Changing a customer’s address

•Delete Old Records

–Remove records for compliance

•Update/Restate Large Fact Tables

–Fix problems after they are in the warehouse

•Streaming Data Ingest

–A continual stream of data coming in

Use Cases

Page 14: Where is Hadoop Going Next?datasys.cs.iit.edu/events/MTAGS14/keynote1.pdf•Live Long and Process –Persistent Hive execution engine •JVM startup costs are huge –JIT cost alone

© Hortonworks Inc. 2014Page 14

•Multiple statement transactions

–Group statements that need to work together

•Query tables as they appeared in past

–Configurable length of history

•Row-level lineage

–Track users and queries that updated each row

Longer Term Use Cases

Page 15: Where is Hadoop Going Next?datasys.cs.iit.edu/events/MTAGS14/keynote1.pdf•Live Long and Process –Persistent Hive execution engine •JVM startup costs are huge –JIT cost alone

© Hortonworks Inc. 2014Page 15

•HDFS Does Not Allow Arbitrary Writes

–Store changes as delta files

–Stitched together by client on read

•Writes get a Transaction ID

–Sequentially assigned by Metastore

•Reads get Committed Transactions

–Provides snapshot consistency

–No locks required

–Provide a snapshot of data from start of query

Design

Page 16: Where is Hadoop Going Next?datasys.cs.iit.edu/events/MTAGS14/keynote1.pdf•Live Long and Process –Persistent Hive execution engine •JVM startup costs are huge –JIT cost alone

© Hortonworks Inc. 2014Page 16

•MapReduce’s RecordReader

–boolean next(K key, V value);

•Better to process 1000 rows at a time

–Amortizes the cost of method calls

–Use primitive arrays for tight inner loops

–No access methods

–Extremely important for operator trees

–Branches (including virtual dispatch) kill pipelining

•Can run at 100m rows/second

Vectorization

Page 17: Where is Hadoop Going Next?datasys.cs.iit.edu/events/MTAGS14/keynote1.pdf•Live Long and Process –Persistent Hive execution engine •JVM startup costs are huge –JIT cost alone

© Hortonworks Inc. 2014Page 17

•Replacing MapReduce as basis for

–Hive, Pig, Cascading

•Executes entire DAG of tasks

•More options for shuffle

•Scales up and down dynamically

•Queries scheduled as one application

instead of waves of jobs.

Tez

Page 18: Where is Hadoop Going Next?datasys.cs.iit.edu/events/MTAGS14/keynote1.pdf•Live Long and Process –Persistent Hive execution engine •JVM startup costs are huge –JIT cost alone

© Hortonworks Inc. 2014Page 18

•Current optimizer is a mess of rules

–Rule interactions are complex

•Optiq provides a framework

–YACC for optimizers

•Make better choices

–Huge impact on performance

•Obsoletes lots of old advice

Hive Cost Based Optimizer

Page 19: Where is Hadoop Going Next?datasys.cs.iit.edu/events/MTAGS14/keynote1.pdf•Live Long and Process –Persistent Hive execution engine •JVM startup costs are huge –JIT cost alone

© Hortonworks Inc. 2014Page 19

•Live Long and Process

–Persistent Hive execution engine

•JVM startup costs are huge

–JIT cost alone is staggering

•Hot Table Data Caching

–Keep hot columns and partitions in memory

•Sub-second answers

LLAP

Page 20: Where is Hadoop Going Next?datasys.cs.iit.edu/events/MTAGS14/keynote1.pdf•Live Long and Process –Persistent Hive execution engine •JVM startup costs are huge –JIT cost alone

© Hortonworks Inc. 2014Page 20

“There is no such thing as perfect

security, only varying levels of

insecurity.”

- Rushdie

Security

Page 21: Where is Hadoop Going Next?datasys.cs.iit.edu/events/MTAGS14/keynote1.pdf•Live Long and Process –Persistent Hive execution engine •JVM startup costs are huge –JIT cost alone

© Hortonworks Inc. 2014Page 21

•Three A’s of security

–Authentication, Authorization, and Audit

•Phases

–No users

–Users, but no authentication

–Authorization

•Next centralized authorization and audit

•Encryption

Audit and Authorization

Page 22: Where is Hadoop Going Next?datasys.cs.iit.edu/events/MTAGS14/keynote1.pdf•Live Long and Process –Persistent Hive execution engine •JVM startup costs are huge –JIT cost alone

© Hortonworks Inc. 2014Page 22

•Underlying file system

–Thief breaks into data center…

•HDFS encryption

–Parallels HDFS authorization

–Prevents AFN attacks

•Column encryption

–Encrypt just PII columns, rolling keys

•Value encryption

–No salt weak sauce so joins work

Encryption

Page 23: Where is Hadoop Going Next?datasys.cs.iit.edu/events/MTAGS14/keynote1.pdf•Live Long and Process –Persistent Hive execution engine •JVM startup costs are huge –JIT cost alone

© Hortonworks Inc. 2013

Thank You!

Questions & Answers

Page 23