Kudu: New Hadoop Storage for Fast Analytics on Fast Data

1© Cloudera, Inc. All rights reserved.

Todd Lipcon on behalf of the Kudu team

Kudu: Resolving Transactional and Analytic Trade-offs in Hadoop

1


The conference for and by Data Scientists, from startup to enterprisewrangleconf.com

Public registration is now open!

Who: Featuring data scientists from Salesforce, Uber, Pinterest, and moreWhen: Thursday, October 22, 2015Where: Broadway Studios, San Francisco

http://wrangleconf.com/


KuduStorage for Fast Analytics on Fast Data

• New updating column store for Hadoop

• Apache-licensed open source

• Beta now available

Columnar StoreKudu


Motivation and GoalsWhy build Kudu?

4


Motivating Questions

• Are there user problems that can we can’t address because of gaps in Hadoop ecosystem storage technologies?• Are we positioned to take advantage of advancements in the hardware

landscape?


Current Storage Landscape in Hadoop

HDFS excels at:• Efficiently scanning large amounts

of data• Accumulating data with high

throughputHBase excels at:• Efficiently finding and writing

individual rows• Making data mutable

Gaps exist when these properties are needed simultaneously


Changing Hardware landscape

• Spinning disk -> solid state storage• NAND flash: Up to 450k read 250k write iops, about 2GB/sec read and

1.5GB/sec write throughput, at a price of less than $3/GB and dropping• 3D XPoint memory (1000x faster than NAND, cheaper than RAM)

• RAM is cheaper and more abundant:• 64->128->256GB over last few years

• Takeaway 1: The next bottleneck is CPU, and current storage systems weren’t designed with CPU efficiency in mind.• Takeaway 2: Column stores are feasible for random access


• High throughput for big scans (columnar storage and replication)Goal: Within 2x of Parquet

• Low-latency for short accesses (primary key indexes and quorum replication)Goal: 1ms read/write on SSD

• Database-like semantics (initially single-row ACID)

• Relational data model• SQL query• “NoSQL” style scan/insert/update (Java client)

Kudu Design Goals


Kudu Usage

• Table has a SQL-like schema• Finite number of columns (unlike HBase/Cassandra)• Types: BOOL, INT8, INT16, INT32, INT64, FLOAT, DOUBLE, STRING, BINARY,

TIMESTAMP• Some subset of columns makes up a possibly-composite primary key• Fast ALTER TABLE

• Java and C++ “NoSQL” style APIs• Insert(), Update(), Delete(), Scan()

• Integrations with MapReduce, Spark, and Impala• more to come!

9


Use cases and architectures


Kudu Use Cases

Kudu is best for use cases requiring a simultaneous combination ofsequential and random reads and writes

● Time Series○ Examples: Stream market data; fraud detection & prevention; risk monitoring○ Workload: Insert, updates, scans, lookups

● Machine Data Analytics○ Examples: Network threat detection○ Workload: Inserts, scans, lookups

● Online Reporting○ Examples: ODS○ Workload: Inserts, updates, scans, lookups


Real-Time Analytics in Hadoop TodayFraud Detection in the Real World = Storage Complexity

Considerations:● How do I handle failure

during this process?

● How often do I reorganize data streaming in into a format appropriate for reporting?

● When reporting, how do I see data that has not yet been reorganized?

● How do I ensure that important jobs aren’t interrupted by maintenance?

New Partition

Most Recent Partition

Historic Data

HBase

Parquet File

Have we accumulated enough data?

Reorganize HBase file

into Parquet

• Wait for running operations to complete • Define new Impala partition referencing

the newly written Parquet file

Incoming Data (Messaging

System)

Reporting Request

Impala on HDFS


Real-Time Analytics in Hadoop with Kudu

Improvements:● One system to operate

● No cron jobs or background processes

● Handle late arrivals or data corrections with ease

● New data available immediately for analytics or operations

Historical and Real-timeData

Incoming Data (Messaging

System)

Reporting Request

Storage in Kudu


How it worksReplication and distribution

14


Tables and Tablets

• Table is horizontally partitioned into tablets• Range or hash partitioning• PRIMARY KEY (host, metric, timestamp) DISTRIBUTE BY HASH(timestamp) INTO 100 BUCKETS

• Each tablet has N replicas (3 or 5), with Raft consensus• Allow read from any replica, plus leader-driven writes with low MTTR

• Tablet servers host tablets• Store data on local disks (no HDFS)

15


Metadata

• Replicated master*• Acts as a tablet directory (“META” table)• Acts as a catalog (table schemas, etc)• Acts as a load balancer (tracks TS liveness, re-replicates under-replicated tablets)

• Caches all metadata in RAM for high performance• 80-node load test, GetTableLocations RPC perf:• 99th percentile: 68us, 99.99th percentile: 657us • <2% peak CPU usage

• Client configured with master addresses• Asks master for tablet locations as needed and caches them

16


Raft consensus

18

TS A

Tablet 1(LEADER)

Client

TS B

Tablet 1(FOLLOWER)

TS C

Tablet 1(FOLLOWER)

WAL

WALWAL

2b. Leader writes local WAL

1a. Client->Leader: Write() RPC

2a. Leader->Followers: UpdateConsensus() RPC

3. Follower: write WAL

4. Follower->Leader: success

3. Follower: write WAL

5. Leader has achieved majority

6. Leader->Client: Success!


Fault tolerance

• Transient FOLLOWER failure:• Leader can still achieve majority• Restart follower TS within 5 min and it will rejoin transparently

• Transient LEADER failure:• Followers expect to hear a heartbeat from their leader every 1.5 seconds• 3 missed heartbeats: leader election!• New LEADER is elected from remaining nodes within a few seconds

• Restart within 5 min and it rejoins as a FOLLOWER• N replicas handle (N-1)/2 failures

19


Fault tolerance (2)

• Permanent failure:• Leader notices that a follower has been dead for 5 minutes• Evicts that follower• Master selects a new replica• Leader copies the data over to the new one, which joins as a new FOLLOWER

20


How it worksStorage engine internals

21


Tablet design

• Inserts buffered in an in-memory store (like HBase’s memstore)• Flushed to disk• Columnar layout, similar to Apache Parquet

• Updates use MVCC (updates tagged with timestamp, not in-place)• Allow “SELECT AS OF <timestamp>” queries and consistent cross-tablet scans

• Near-optimal read path for “current time” scans• No per row branches, fast vectorized decoding and predicate evaluation

• Performance worsens based on number of recent updates

22


LSM vs Kudu

• LSM – Log Structured Merge (Cassandra, HBase, etc)• Inserts and updates all go to an in-memory map (MemStore) and later flush to

on-disk files (HFile/SSTable)• Reads perform an on-the-fly merge of all on-disk HFiles

• Kudu• Shares some traits (memstores, compactions)• More complex.• Slower writes in exchange for faster reads (especially scans)

23


LSM Insert Path

24

MemStoreINSERT

Row=r1 col=c1 val=“blah”Row=r1 col=c2 val=“1”

HFile 1Row=r1 col=c1 val=“blah”Row=r1 col=c2 val=“1”

flush


LSM Insert Path

25

MemStoreINSERT

Row=r1 col=c1 val=“blah”Row=r1 col=c2 val=“2”

HFile 2Row=r2 col=c1 val=“blah2”Row=r2 col=c2 val=“2”

flush



LSM Update path

26

MemStoreUPDATE


HFile 2Row=r2 col=c1 val=“v2”Row=r2 col=c2 val=“5”

Row=r2 col=c1 val=“newval”

Note: all updates are “fully decoupled” from reads. Random-write workload is transformed to fully sequential!


LSM Read path

27

MemStore


HFile 2

Row=r2 col=c1 val=“v2”Row=r2 col=c2 val=“5”

Row=r2 col=c1 val=“newval”

Merge based on string row keys

R1: c1=blah c2=2R2: c1=newval c2=5….

CPU intensive!

Must always read rowkeys

Any given row may exist across multiple HFiles: must always

merge!

The more HFiles to merge, the slower it reads


Kudu storage – Inserts and Flushes

28

MemRowSetINSERT(“todd”, “$1000”,”engineer”)

name pay role

DiskRowSet 1

flush


Kudu storage – Inserts and Flushes

29

MemRowSet

name pay role

DiskRowSet 1

name pay role

DiskRowSet 2

INSERT(“doug”, “$1B”, “Hadoop man”)

flush


Kudu storage - Updates

30

MemRowSet

name pay role

DiskRowSet 1

name pay role

DiskRowSet 2Delta MS

Delta MS

Each DiskRowSet has its own DeltaMemStore to accumulate updates

base data

base data


Kudu storage - Updates

31

MemRowSet

name pay role

DiskRowSet 1

name pay role


Delta MS

UPDATE set pay=“$1M” WHERE name=“todd”

Is the row in DiskRowSet 2?(check bloom filters)

Is the row in DiskRowSet 1?(check bloom filters)

Bloom says: no!

Bloom says: maybe!

Search key column to find offset: rowid = 150

150: col 1=$1M

base data


Kudu storage – Read path

32

MemRowSet

name pay role

DiskRowSet 1

name pay role


Delta MS

150: pay=$1M

Read rows in DiskRowSet 2

Then, read rows in DiskRowSet 1

Any row is only in exactly one DiskRowSet– no need to merge cross-

DRS!

Updates are merged based on ordinal offset within DRS: array indexing, no

string compares

base data

base data


Kudu storage – Delta flushes

33

MemRowSet

name pay role

DiskRowSet 1

name pay role


Delta MS

0: pay=fooREDO DeltaFileFlush

A REDO delta indicates how to transform between the ‘base data’

(columnar) and a later version

base data

base data


Kudu storage – Major delta compaction

34

name pay role

DiskRowSet(pre-compaction)Delta MS

REDO DeltaFile REDO DeltaFile REDO DeltaFile

Many deltas accumulate: lots of delta application work on reads

name pay role

DiskRowSet(post-compaction)Delta MS

Unmerged REDO deltasUNDO deltas

If a column has few updates, doesn’t need to be re-written: those deltas maintained in new DeltaFile

Merge updates for columns with high update percentage

base data


Kudu storage – RowSet Compactions

35

DRS 1 (32MB)

[PK=alice], [PK=joe], [PK=linda], [PK=zach]

DRS 2 (32MB)

[PK=bob], [PK=jon], [PK=mary] [PK=zeke]

DRS 3 (32MB)

[PK=carl], [PK=julie], [PK=omar] [PK=zoe]

DRS 4 (32MB) DRS 5 (32MB) DRS 6 (32MB)

[alice, bob, carl, joe]

[jon, julie, linda, mary]

[omar, zach, zeke, zoe]

Reorganize rows to avoid rowsets with overlapping key ranges


Kudu storage – Compaction policy

• Solves an optimization problem (knapsack problem)• Minimize “height” of rowsets for the average key lookup• Bound on number of seeks for write or random-read

• Restrict total IO of any compaction to a budget (128MB)• No long compactions, ever• No “minor” vs “major” distinction• Always be compacting or flushing• Low IO priority maintenance threads

36


Kudu trade-offs

• Random updates will be slower• HBase model allows random updates without incurring a disk seek• Kudu requires a key lookup before update, bloom lookup before insert

• Single-row reads may be slower• Columnar design is optimized for scans• Future: may introduce “column groups” for applications where single-row

access is more important• Especially slow at reading a row that has had many recent updates (e.g YCSB

“zipfian”)

37


Benchmarks

38


TPC-H (Analytics benchmark)

• 75TS + 1 master cluster• 12 (spinning) disk each, enough RAM to fit dataset• Using Kudu 0.5.0, Impala 2.2 with Kudu support, CDH 5.4• TPC-H Scale Factor 100 (100GB)

• Example query:• SELECT n_name, sum(l_extendedprice * (1 - l_discount)) as revenue FROM customer, orders, lineitem, supplier, nation, region WHERE c_custkey = o_custkey AND l_orderkey = o_orderkey AND l_suppkey = s_suppkey AND c_nationkey = s_nationkey AND s_nationkey = n_nationkey AND n_regionkey = r_regionkey AND r_name = 'ASIA' AND o_orderdate >= date '1994-01-01' AND o_orderdate < '1995-01-01’ GROUP BY n_name ORDER BY revenue desc;

39


- Kudu outperforms Parquet by 31% (geometric mean) for RAM-resident data- Parquet likely to outperform Kudu for HDD-resident (larger IO requests)


What about Apache Phoenix?• 10 node cluster (9 worker, 1 master)• HBase 1.0, Phoenix 4.3• TPC-H LINEITEM table only (6B rows)

41

Load TPCH Q1 COUNT(*)COUNT(*)WHERE…

single-rowlookup

0.01

0.1

1

10

100

1000

100002152

21976

131

0.04

1918

13.2

1.7

0.7 0.15

155

9.3

1.4 1.5 1.37

PhoenixKuduParquet

Tim

e (s

ec)


What about NoSQL-style random access? (YCSB)

• YCSB 0.5.0-snapshot• 10 node cluster

(9 worker, 1 master)• HBase 1.0• 100M rows, 10M ops

42


What Kudu is not

43


Kudu is…

• NOT a SQL database• “BYO SQL”

• NOT a filesystem• data must have tabular structure

• NOT a replacement for HBase or HDFS• Cloudera continues to invest in those systems• Many use cases where they’re still more appropriate

• NOT an in-memory database• Very fast for memory-sized workloads, but can operate on larger data too!

44


Getting started

45


Getting started as a user

• http://getkudu.io• [email protected]

• Quickstart VM• Easiest way to get started• Impala and Kudu in an easy-to-install VM

• CSD and Parcels• For installation on a Cloudera Manager-managed cluster

46

http://getkudu.io/

mailto:[email protected]


Getting started as a developer

• http://github.com/cloudera/kudu• All commits go here first

• Public gerrit: http://gerrit.cloudera.org• All code reviews happening here

• Public JIRA: http://issues.cloudera.org• Includes bugs going back to 2013. Come see our dirty laundry!

• [email protected]

• Apache 2.0 license open source• Contributions are welcome and encouraged!

47

http://github.com/cloudera/kudu

http://gerrit.cloudera.org/

http://issues.cloudera.org/

mailto:[email protected]


Demo?(if we have time and internet gods willing)

48


http://getkudu.io/@getkudu

Kudu: New Hadoop Storage for Fast Analytics on Fast Data

Software

Transcript of Kudu: New Hadoop Storage for Fast Analytics on Fast Data