Kudu: New Hadoop Storage for Fast Analytics on Fast Data

49
1 © Cloudera, Inc. All rights reserved. Todd Lipcon on behalf of the Kudu team Kudu: Resolving Transactional and Analytic Trade-offs in Hadoop 1

Transcript of Kudu: New Hadoop Storage for Fast Analytics on Fast Data

Page 1: Kudu: New Hadoop Storage for Fast Analytics on Fast Data

1© Cloudera, Inc. All rights reserved.

Todd Lipcon on behalf of the Kudu team

Kudu: Resolving Transactional and Analytic Trade-offs in Hadoop

1

Page 2: Kudu: New Hadoop Storage for Fast Analytics on Fast Data

2© Cloudera, Inc. All rights reserved.

The conference for and by Data Scientists, from startup to enterprisewrangleconf.com

Public registration is now open!

Who: Featuring data scientists from Salesforce, Uber, Pinterest, and moreWhen: Thursday, October 22, 2015Where: Broadway Studios, San Francisco

Page 3: Kudu: New Hadoop Storage for Fast Analytics on Fast Data

3© Cloudera, Inc. All rights reserved.

KuduStorage for Fast Analytics on Fast Data

• New updating column store for Hadoop

• Apache-licensed open source

• Beta now available

Columnar StoreKudu

Page 4: Kudu: New Hadoop Storage for Fast Analytics on Fast Data

4© Cloudera, Inc. All rights reserved.

Motivation and GoalsWhy build Kudu?

4

Page 5: Kudu: New Hadoop Storage for Fast Analytics on Fast Data

5© Cloudera, Inc. All rights reserved.

Motivating Questions

• Are there user problems that can we can’t address because of gaps in Hadoop ecosystem storage technologies?• Are we positioned to take advantage of advancements in the hardware

landscape?

Page 6: Kudu: New Hadoop Storage for Fast Analytics on Fast Data

6© Cloudera, Inc. All rights reserved.

Current Storage Landscape in Hadoop

HDFS excels at:• Efficiently scanning large amounts

of data• Accumulating data with high

throughputHBase excels at:• Efficiently finding and writing

individual rows• Making data mutable

Gaps exist when these properties are needed simultaneously

Page 7: Kudu: New Hadoop Storage for Fast Analytics on Fast Data

7© Cloudera, Inc. All rights reserved.

Changing Hardware landscape

• Spinning disk -> solid state storage• NAND flash: Up to 450k read 250k write iops, about 2GB/sec read and

1.5GB/sec write throughput, at a price of less than $3/GB and dropping• 3D XPoint memory (1000x faster than NAND, cheaper than RAM)

• RAM is cheaper and more abundant:• 64->128->256GB over last few years

• Takeaway 1: The next bottleneck is CPU, and current storage systems weren’t designed with CPU efficiency in mind.• Takeaway 2: Column stores are feasible for random access

Page 8: Kudu: New Hadoop Storage for Fast Analytics on Fast Data

8© Cloudera, Inc. All rights reserved.

• High throughput for big scans (columnar storage and replication)Goal: Within 2x of Parquet

• Low-latency for short accesses (primary key indexes and quorum replication)Goal: 1ms read/write on SSD

• Database-like semantics (initially single-row ACID)

• Relational data model• SQL query• “NoSQL” style scan/insert/update (Java client)

Kudu Design Goals

Page 9: Kudu: New Hadoop Storage for Fast Analytics on Fast Data

9© Cloudera, Inc. All rights reserved.

Kudu Usage

• Table has a SQL-like schema• Finite number of columns (unlike HBase/Cassandra)• Types: BOOL, INT8, INT16, INT32, INT64, FLOAT, DOUBLE, STRING, BINARY,

TIMESTAMP• Some subset of columns makes up a possibly-composite primary key• Fast ALTER TABLE

• Java and C++ “NoSQL” style APIs• Insert(), Update(), Delete(), Scan()

• Integrations with MapReduce, Spark, and Impala• more to come!

9

Page 10: Kudu: New Hadoop Storage for Fast Analytics on Fast Data

10© Cloudera, Inc. All rights reserved.

Use cases and architectures

Page 11: Kudu: New Hadoop Storage for Fast Analytics on Fast Data

11© Cloudera, Inc. All rights reserved.

Kudu Use Cases

Kudu is best for use cases requiring a simultaneous combination ofsequential and random reads and writes

● Time Series○ Examples: Stream market data; fraud detection & prevention; risk monitoring○ Workload: Insert, updates, scans, lookups

● Machine Data Analytics○ Examples: Network threat detection○ Workload: Inserts, scans, lookups

● Online Reporting○ Examples: ODS○ Workload: Inserts, updates, scans, lookups

Page 12: Kudu: New Hadoop Storage for Fast Analytics on Fast Data

12© Cloudera, Inc. All rights reserved.

Real-Time Analytics in Hadoop TodayFraud Detection in the Real World = Storage Complexity

Considerations:● How do I handle failure

during this process?

● How often do I reorganize data streaming in into a format appropriate for reporting?

● When reporting, how do I see data that has not yet been reorganized?

● How do I ensure that important jobs aren’t interrupted by maintenance?

New Partition

Most Recent Partition

Historic Data

HBase

Parquet File

Have we accumulated enough data?

Reorganize HBase file

into Parquet

• Wait for running operations to complete • Define new Impala partition referencing

the newly written Parquet file

Incoming Data (Messaging

System)

Reporting Request

Impala on HDFS

Page 13: Kudu: New Hadoop Storage for Fast Analytics on Fast Data

13© Cloudera, Inc. All rights reserved.

Real-Time Analytics in Hadoop with Kudu

Improvements:● One system to operate

● No cron jobs or background processes

● Handle late arrivals or data corrections with ease

● New data available immediately for analytics or operations

Historical and Real-timeData

Incoming Data (Messaging

System)

Reporting Request

Storage in Kudu

Page 14: Kudu: New Hadoop Storage for Fast Analytics on Fast Data

14© Cloudera, Inc. All rights reserved.

How it worksReplication and distribution

14

Page 15: Kudu: New Hadoop Storage for Fast Analytics on Fast Data

15© Cloudera, Inc. All rights reserved.

Tables and Tablets

• Table is horizontally partitioned into tablets• Range or hash partitioning• PRIMARY KEY (host, metric, timestamp) DISTRIBUTE BY HASH(timestamp) INTO 100 BUCKETS

• Each tablet has N replicas (3 or 5), with Raft consensus• Allow read from any replica, plus leader-driven writes with low MTTR

• Tablet servers host tablets• Store data on local disks (no HDFS)

15

Page 16: Kudu: New Hadoop Storage for Fast Analytics on Fast Data

16© Cloudera, Inc. All rights reserved.

Metadata

• Replicated master*• Acts as a tablet directory (“META” table)• Acts as a catalog (table schemas, etc)• Acts as a load balancer (tracks TS liveness, re-replicates under-replicated tablets)

• Caches all metadata in RAM for high performance• 80-node load test, GetTableLocations RPC perf:• 99th percentile: 68us, 99.99th percentile: 657us • <2% peak CPU usage

• Client configured with master addresses• Asks master for tablet locations as needed and caches them

16

Page 17: Kudu: New Hadoop Storage for Fast Analytics on Fast Data

17© Cloudera, Inc. All rights reserved.

Page 18: Kudu: New Hadoop Storage for Fast Analytics on Fast Data

18© Cloudera, Inc. All rights reserved.

Raft consensus

18

TS A

Tablet 1(LEADER)

Client

TS B

Tablet 1(FOLLOWER)

TS C

Tablet 1(FOLLOWER)

WAL

WALWAL

2b. Leader writes local WAL

1a. Client->Leader: Write() RPC

2a. Leader->Followers: UpdateConsensus() RPC

3. Follower: write WAL

4. Follower->Leader: success

3. Follower: write WAL

5. Leader has achieved majority

6. Leader->Client: Success!

Page 19: Kudu: New Hadoop Storage for Fast Analytics on Fast Data

19© Cloudera, Inc. All rights reserved.

Fault tolerance

• Transient FOLLOWER failure:• Leader can still achieve majority• Restart follower TS within 5 min and it will rejoin transparently

• Transient LEADER failure:• Followers expect to hear a heartbeat from their leader every 1.5 seconds• 3 missed heartbeats: leader election!• New LEADER is elected from remaining nodes within a few seconds

• Restart within 5 min and it rejoins as a FOLLOWER• N replicas handle (N-1)/2 failures

19

Page 20: Kudu: New Hadoop Storage for Fast Analytics on Fast Data

20© Cloudera, Inc. All rights reserved.

Fault tolerance (2)

• Permanent failure:• Leader notices that a follower has been dead for 5 minutes• Evicts that follower• Master selects a new replica• Leader copies the data over to the new one, which joins as a new FOLLOWER

20

Page 21: Kudu: New Hadoop Storage for Fast Analytics on Fast Data

21© Cloudera, Inc. All rights reserved.

How it worksStorage engine internals

21

Page 22: Kudu: New Hadoop Storage for Fast Analytics on Fast Data

22© Cloudera, Inc. All rights reserved.

Tablet design

• Inserts buffered in an in-memory store (like HBase’s memstore)• Flushed to disk• Columnar layout, similar to Apache Parquet

• Updates use MVCC (updates tagged with timestamp, not in-place)• Allow “SELECT AS OF <timestamp>” queries and consistent cross-tablet scans

• Near-optimal read path for “current time” scans• No per row branches, fast vectorized decoding and predicate evaluation

• Performance worsens based on number of recent updates

22

Page 23: Kudu: New Hadoop Storage for Fast Analytics on Fast Data

23© Cloudera, Inc. All rights reserved.

LSM vs Kudu

• LSM – Log Structured Merge (Cassandra, HBase, etc)• Inserts and updates all go to an in-memory map (MemStore) and later flush to

on-disk files (HFile/SSTable)• Reads perform an on-the-fly merge of all on-disk HFiles

• Kudu• Shares some traits (memstores, compactions)• More complex.• Slower writes in exchange for faster reads (especially scans)

23

Page 24: Kudu: New Hadoop Storage for Fast Analytics on Fast Data

24© Cloudera, Inc. All rights reserved.

LSM Insert Path

24

MemStoreINSERT

Row=r1 col=c1 val=“blah”Row=r1 col=c2 val=“1”

HFile 1Row=r1 col=c1 val=“blah”Row=r1 col=c2 val=“1”

flush

Page 25: Kudu: New Hadoop Storage for Fast Analytics on Fast Data

25© Cloudera, Inc. All rights reserved.

LSM Insert Path

25

MemStoreINSERT

Row=r1 col=c1 val=“blah”Row=r1 col=c2 val=“2”

HFile 2Row=r2 col=c1 val=“blah2”Row=r2 col=c2 val=“2”

flush

HFile 1Row=r1 col=c1 val=“blah”Row=r1 col=c2 val=“1”

Page 26: Kudu: New Hadoop Storage for Fast Analytics on Fast Data

26© Cloudera, Inc. All rights reserved.

LSM Update path

26

MemStoreUPDATE

HFile 1Row=r1 col=c1 val=“blah”Row=r1 col=c2 val=“2”

HFile 2Row=r2 col=c1 val=“v2”Row=r2 col=c2 val=“5”

Row=r2 col=c1 val=“newval”

Note: all updates are “fully decoupled” from reads. Random-write workload is transformed to fully sequential!

Page 27: Kudu: New Hadoop Storage for Fast Analytics on Fast Data

27© Cloudera, Inc. All rights reserved.

LSM Read path

27

MemStore

HFile 1Row=r1 col=c1 val=“blah”Row=r1 col=c2 val=“2”

HFile 2

Row=r2 col=c1 val=“v2”Row=r2 col=c2 val=“5”

Row=r2 col=c1 val=“newval”

Merge based on string row keys

R1: c1=blah c2=2R2: c1=newval c2=5….

CPU intensive!

Must always read rowkeys

Any given row may exist across multiple HFiles: must always

merge!

The more HFiles to merge, the slower it reads

Page 28: Kudu: New Hadoop Storage for Fast Analytics on Fast Data

28© Cloudera, Inc. All rights reserved.

Kudu storage – Inserts and Flushes

28

MemRowSetINSERT(“todd”, “$1000”,”engineer”)

name pay role

DiskRowSet 1

flush

Page 29: Kudu: New Hadoop Storage for Fast Analytics on Fast Data

29© Cloudera, Inc. All rights reserved.

Kudu storage – Inserts and Flushes

29

MemRowSet

name pay role

DiskRowSet 1

name pay role

DiskRowSet 2

INSERT(“doug”, “$1B”, “Hadoop man”)

flush

Page 30: Kudu: New Hadoop Storage for Fast Analytics on Fast Data

30© Cloudera, Inc. All rights reserved.

Kudu storage - Updates

30

MemRowSet

name pay role

DiskRowSet 1

name pay role

DiskRowSet 2Delta MS

Delta MS

Each DiskRowSet has its own DeltaMemStore to accumulate updates

base data

base data

Page 31: Kudu: New Hadoop Storage for Fast Analytics on Fast Data

31© Cloudera, Inc. All rights reserved.

Kudu storage - Updates

31

MemRowSet

name pay role

DiskRowSet 1

name pay role

DiskRowSet 2Delta MS

Delta MS

UPDATE set pay=“$1M” WHERE name=“todd”

Is the row in DiskRowSet 2?(check bloom filters)

Is the row in DiskRowSet 1?(check bloom filters)

Bloom says: no!

Bloom says: maybe!

Search key column to find offset: rowid = 150

150: col 1=$1M

base data

Page 32: Kudu: New Hadoop Storage for Fast Analytics on Fast Data

32© Cloudera, Inc. All rights reserved.

Kudu storage – Read path

32

MemRowSet

name pay role

DiskRowSet 1

name pay role

DiskRowSet 2Delta MS

Delta MS

150: pay=$1M

Read rows in DiskRowSet 2

Then, read rows in DiskRowSet 1

Any row is only in exactly one DiskRowSet– no need to merge cross-

DRS!

Updates are merged based on ordinal offset within DRS: array indexing, no

string compares

base data

base data

Page 33: Kudu: New Hadoop Storage for Fast Analytics on Fast Data

33© Cloudera, Inc. All rights reserved.

Kudu storage – Delta flushes

33

MemRowSet

name pay role

DiskRowSet 1

name pay role

DiskRowSet 2Delta MS

Delta MS

0: pay=fooREDO DeltaFileFlush

A REDO delta indicates how to transform between the ‘base data’

(columnar) and a later version

base data

base data

Page 34: Kudu: New Hadoop Storage for Fast Analytics on Fast Data

34© Cloudera, Inc. All rights reserved.

Kudu storage – Major delta compaction

34

name pay role

DiskRowSet(pre-compaction)Delta MS

REDO DeltaFile REDO DeltaFile REDO DeltaFile

Many deltas accumulate: lots of delta application work on reads

name pay role

DiskRowSet(post-compaction)Delta MS

Unmerged REDO deltasUNDO deltas

If a column has few updates, doesn’t need to be re-written: those deltas maintained in new DeltaFile

Merge updates for columns with high update percentage

base data

Page 35: Kudu: New Hadoop Storage for Fast Analytics on Fast Data

35© Cloudera, Inc. All rights reserved.

Kudu storage – RowSet Compactions

35

DRS 1 (32MB)

[PK=alice], [PK=joe], [PK=linda], [PK=zach]

DRS 2 (32MB)

[PK=bob], [PK=jon], [PK=mary] [PK=zeke]

DRS 3 (32MB)

[PK=carl], [PK=julie], [PK=omar] [PK=zoe]

DRS 4 (32MB) DRS 5 (32MB) DRS 6 (32MB)

[alice, bob, carl, joe]

[jon, julie, linda, mary]

[omar, zach, zeke, zoe]

Reorganize rows to avoid rowsets with overlapping key ranges

Page 36: Kudu: New Hadoop Storage for Fast Analytics on Fast Data

36© Cloudera, Inc. All rights reserved.

Kudu storage – Compaction policy

• Solves an optimization problem (knapsack problem)• Minimize “height” of rowsets for the average key lookup• Bound on number of seeks for write or random-read

• Restrict total IO of any compaction to a budget (128MB)• No long compactions, ever• No “minor” vs “major” distinction• Always be compacting or flushing• Low IO priority maintenance threads

36

Page 37: Kudu: New Hadoop Storage for Fast Analytics on Fast Data

37© Cloudera, Inc. All rights reserved.

Kudu trade-offs

• Random updates will be slower• HBase model allows random updates without incurring a disk seek• Kudu requires a key lookup before update, bloom lookup before insert

• Single-row reads may be slower• Columnar design is optimized for scans• Future: may introduce “column groups” for applications where single-row

access is more important• Especially slow at reading a row that has had many recent updates (e.g YCSB

“zipfian”)

37

Page 38: Kudu: New Hadoop Storage for Fast Analytics on Fast Data

38© Cloudera, Inc. All rights reserved.

Benchmarks

38

Page 39: Kudu: New Hadoop Storage for Fast Analytics on Fast Data

39© Cloudera, Inc. All rights reserved.

TPC-H (Analytics benchmark)

• 75TS + 1 master cluster• 12 (spinning) disk each, enough RAM to fit dataset• Using Kudu 0.5.0, Impala 2.2 with Kudu support, CDH 5.4• TPC-H Scale Factor 100 (100GB)

• Example query:• SELECT n_name, sum(l_extendedprice * (1 - l_discount)) as revenue FROM customer, orders, lineitem, supplier, nation, region WHERE c_custkey = o_custkey AND l_orderkey = o_orderkey AND l_suppkey = s_suppkey AND c_nationkey = s_nationkey AND s_nationkey = n_nationkey AND n_regionkey = r_regionkey AND r_name = 'ASIA' AND o_orderdate >= date '1994-01-01' AND o_orderdate < '1995-01-01’ GROUP BY n_name ORDER BY revenue desc;

39

Page 40: Kudu: New Hadoop Storage for Fast Analytics on Fast Data

40© Cloudera, Inc. All rights reserved.

- Kudu outperforms Parquet by 31% (geometric mean) for RAM-resident data- Parquet likely to outperform Kudu for HDD-resident (larger IO requests)

Page 41: Kudu: New Hadoop Storage for Fast Analytics on Fast Data

41© Cloudera, Inc. All rights reserved.

What about Apache Phoenix?• 10 node cluster (9 worker, 1 master)• HBase 1.0, Phoenix 4.3• TPC-H LINEITEM table only (6B rows)

41

Load TPCH Q1 COUNT(*)COUNT(*)WHERE…

single-rowlookup

0.01

0.1

1

10

100

1000

100002152

21976

131

0.04

1918

13.2

1.7

0.7 0.15

155

9.3

1.4 1.5 1.37

PhoenixKuduParquet

Tim

e (s

ec)

Page 42: Kudu: New Hadoop Storage for Fast Analytics on Fast Data

42© Cloudera, Inc. All rights reserved.

What about NoSQL-style random access? (YCSB)

• YCSB 0.5.0-snapshot• 10 node cluster

(9 worker, 1 master)• HBase 1.0• 100M rows, 10M ops

42

Page 43: Kudu: New Hadoop Storage for Fast Analytics on Fast Data

43© Cloudera, Inc. All rights reserved.

What Kudu is not

43

Page 44: Kudu: New Hadoop Storage for Fast Analytics on Fast Data

44© Cloudera, Inc. All rights reserved.

Kudu is…

• NOT a SQL database• “BYO SQL”

• NOT a filesystem• data must have tabular structure

• NOT a replacement for HBase or HDFS• Cloudera continues to invest in those systems• Many use cases where they’re still more appropriate

• NOT an in-memory database• Very fast for memory-sized workloads, but can operate on larger data too!

44

Page 45: Kudu: New Hadoop Storage for Fast Analytics on Fast Data

45© Cloudera, Inc. All rights reserved.

Getting started

45

Page 46: Kudu: New Hadoop Storage for Fast Analytics on Fast Data

46© Cloudera, Inc. All rights reserved.

Getting started as a user

• http://getkudu.io• [email protected]

• Quickstart VM• Easiest way to get started• Impala and Kudu in an easy-to-install VM

• CSD and Parcels• For installation on a Cloudera Manager-managed cluster

46

Page 47: Kudu: New Hadoop Storage for Fast Analytics on Fast Data

47© Cloudera, Inc. All rights reserved.

Getting started as a developer

• http://github.com/cloudera/kudu• All commits go here first

• Public gerrit: http://gerrit.cloudera.org• All code reviews happening here

• Public JIRA: http://issues.cloudera.org• Includes bugs going back to 2013. Come see our dirty laundry!

[email protected]

• Apache 2.0 license open source• Contributions are welcome and encouraged!

47

Page 48: Kudu: New Hadoop Storage for Fast Analytics on Fast Data

48© Cloudera, Inc. All rights reserved.

Demo?(if we have time and internet gods willing)

48

Page 49: Kudu: New Hadoop Storage for Fast Analytics on Fast Data

49© Cloudera, Inc. All rights reserved.

http://getkudu.io/@getkudu