Kudu: New Hadoop Storage for Fast Analytics on Fast Data
-
Upload
cloudera-inc -
Category
Software
-
view
6.367 -
download
5
Transcript of Kudu: New Hadoop Storage for Fast Analytics on Fast Data
1© Cloudera, Inc. All rights reserved.
Todd Lipcon on behalf of the Kudu team
Kudu: Resolving Transactional and Analytic Trade-offs in Hadoop
1
2© Cloudera, Inc. All rights reserved.
The conference for and by Data Scientists, from startup to enterprisewrangleconf.com
Public registration is now open!
Who: Featuring data scientists from Salesforce, Uber, Pinterest, and moreWhen: Thursday, October 22, 2015Where: Broadway Studios, San Francisco
3© Cloudera, Inc. All rights reserved.
KuduStorage for Fast Analytics on Fast Data
• New updating column store for Hadoop
• Apache-licensed open source
• Beta now available
Columnar StoreKudu
4© Cloudera, Inc. All rights reserved.
Motivation and GoalsWhy build Kudu?
4
5© Cloudera, Inc. All rights reserved.
Motivating Questions
• Are there user problems that can we can’t address because of gaps in Hadoop ecosystem storage technologies?• Are we positioned to take advantage of advancements in the hardware
landscape?
6© Cloudera, Inc. All rights reserved.
Current Storage Landscape in Hadoop
HDFS excels at:• Efficiently scanning large amounts
of data• Accumulating data with high
throughputHBase excels at:• Efficiently finding and writing
individual rows• Making data mutable
Gaps exist when these properties are needed simultaneously
7© Cloudera, Inc. All rights reserved.
Changing Hardware landscape
• Spinning disk -> solid state storage• NAND flash: Up to 450k read 250k write iops, about 2GB/sec read and
1.5GB/sec write throughput, at a price of less than $3/GB and dropping• 3D XPoint memory (1000x faster than NAND, cheaper than RAM)
• RAM is cheaper and more abundant:• 64->128->256GB over last few years
• Takeaway 1: The next bottleneck is CPU, and current storage systems weren’t designed with CPU efficiency in mind.• Takeaway 2: Column stores are feasible for random access
8© Cloudera, Inc. All rights reserved.
• High throughput for big scans (columnar storage and replication)Goal: Within 2x of Parquet
• Low-latency for short accesses (primary key indexes and quorum replication)Goal: 1ms read/write on SSD
• Database-like semantics (initially single-row ACID)
• Relational data model• SQL query• “NoSQL” style scan/insert/update (Java client)
Kudu Design Goals
9© Cloudera, Inc. All rights reserved.
Kudu Usage
• Table has a SQL-like schema• Finite number of columns (unlike HBase/Cassandra)• Types: BOOL, INT8, INT16, INT32, INT64, FLOAT, DOUBLE, STRING, BINARY,
TIMESTAMP• Some subset of columns makes up a possibly-composite primary key• Fast ALTER TABLE
• Java and C++ “NoSQL” style APIs• Insert(), Update(), Delete(), Scan()
• Integrations with MapReduce, Spark, and Impala• more to come!
9
10© Cloudera, Inc. All rights reserved.
Use cases and architectures
11© Cloudera, Inc. All rights reserved.
Kudu Use Cases
Kudu is best for use cases requiring a simultaneous combination ofsequential and random reads and writes
● Time Series○ Examples: Stream market data; fraud detection & prevention; risk monitoring○ Workload: Insert, updates, scans, lookups
● Machine Data Analytics○ Examples: Network threat detection○ Workload: Inserts, scans, lookups
● Online Reporting○ Examples: ODS○ Workload: Inserts, updates, scans, lookups
12© Cloudera, Inc. All rights reserved.
Real-Time Analytics in Hadoop TodayFraud Detection in the Real World = Storage Complexity
Considerations:● How do I handle failure
during this process?
● How often do I reorganize data streaming in into a format appropriate for reporting?
● When reporting, how do I see data that has not yet been reorganized?
● How do I ensure that important jobs aren’t interrupted by maintenance?
New Partition
Most Recent Partition
Historic Data
HBase
Parquet File
Have we accumulated enough data?
Reorganize HBase file
into Parquet
• Wait for running operations to complete • Define new Impala partition referencing
the newly written Parquet file
Incoming Data (Messaging
System)
Reporting Request
Impala on HDFS
13© Cloudera, Inc. All rights reserved.
Real-Time Analytics in Hadoop with Kudu
Improvements:● One system to operate
● No cron jobs or background processes
● Handle late arrivals or data corrections with ease
● New data available immediately for analytics or operations
Historical and Real-timeData
Incoming Data (Messaging
System)
Reporting Request
Storage in Kudu
14© Cloudera, Inc. All rights reserved.
How it worksReplication and distribution
14
15© Cloudera, Inc. All rights reserved.
Tables and Tablets
• Table is horizontally partitioned into tablets• Range or hash partitioning• PRIMARY KEY (host, metric, timestamp) DISTRIBUTE BY HASH(timestamp) INTO 100 BUCKETS
• Each tablet has N replicas (3 or 5), with Raft consensus• Allow read from any replica, plus leader-driven writes with low MTTR
• Tablet servers host tablets• Store data on local disks (no HDFS)
15
16© Cloudera, Inc. All rights reserved.
Metadata
• Replicated master*• Acts as a tablet directory (“META” table)• Acts as a catalog (table schemas, etc)• Acts as a load balancer (tracks TS liveness, re-replicates under-replicated tablets)
• Caches all metadata in RAM for high performance• 80-node load test, GetTableLocations RPC perf:• 99th percentile: 68us, 99.99th percentile: 657us • <2% peak CPU usage
• Client configured with master addresses• Asks master for tablet locations as needed and caches them
16
17© Cloudera, Inc. All rights reserved.
18© Cloudera, Inc. All rights reserved.
Raft consensus
18
TS A
Tablet 1(LEADER)
Client
TS B
Tablet 1(FOLLOWER)
TS C
Tablet 1(FOLLOWER)
WAL
WALWAL
2b. Leader writes local WAL
1a. Client->Leader: Write() RPC
2a. Leader->Followers: UpdateConsensus() RPC
3. Follower: write WAL
4. Follower->Leader: success
3. Follower: write WAL
5. Leader has achieved majority
6. Leader->Client: Success!
19© Cloudera, Inc. All rights reserved.
Fault tolerance
• Transient FOLLOWER failure:• Leader can still achieve majority• Restart follower TS within 5 min and it will rejoin transparently
• Transient LEADER failure:• Followers expect to hear a heartbeat from their leader every 1.5 seconds• 3 missed heartbeats: leader election!• New LEADER is elected from remaining nodes within a few seconds
• Restart within 5 min and it rejoins as a FOLLOWER• N replicas handle (N-1)/2 failures
19
20© Cloudera, Inc. All rights reserved.
Fault tolerance (2)
• Permanent failure:• Leader notices that a follower has been dead for 5 minutes• Evicts that follower• Master selects a new replica• Leader copies the data over to the new one, which joins as a new FOLLOWER
20
21© Cloudera, Inc. All rights reserved.
How it worksStorage engine internals
21
22© Cloudera, Inc. All rights reserved.
Tablet design
• Inserts buffered in an in-memory store (like HBase’s memstore)• Flushed to disk• Columnar layout, similar to Apache Parquet
• Updates use MVCC (updates tagged with timestamp, not in-place)• Allow “SELECT AS OF <timestamp>” queries and consistent cross-tablet scans
• Near-optimal read path for “current time” scans• No per row branches, fast vectorized decoding and predicate evaluation
• Performance worsens based on number of recent updates
22
23© Cloudera, Inc. All rights reserved.
LSM vs Kudu
• LSM – Log Structured Merge (Cassandra, HBase, etc)• Inserts and updates all go to an in-memory map (MemStore) and later flush to
on-disk files (HFile/SSTable)• Reads perform an on-the-fly merge of all on-disk HFiles
• Kudu• Shares some traits (memstores, compactions)• More complex.• Slower writes in exchange for faster reads (especially scans)
23
24© Cloudera, Inc. All rights reserved.
LSM Insert Path
24
MemStoreINSERT
Row=r1 col=c1 val=“blah”Row=r1 col=c2 val=“1”
HFile 1Row=r1 col=c1 val=“blah”Row=r1 col=c2 val=“1”
flush
25© Cloudera, Inc. All rights reserved.
LSM Insert Path
25
MemStoreINSERT
Row=r1 col=c1 val=“blah”Row=r1 col=c2 val=“2”
HFile 2Row=r2 col=c1 val=“blah2”Row=r2 col=c2 val=“2”
flush
HFile 1Row=r1 col=c1 val=“blah”Row=r1 col=c2 val=“1”
26© Cloudera, Inc. All rights reserved.
LSM Update path
26
MemStoreUPDATE
HFile 1Row=r1 col=c1 val=“blah”Row=r1 col=c2 val=“2”
HFile 2Row=r2 col=c1 val=“v2”Row=r2 col=c2 val=“5”
Row=r2 col=c1 val=“newval”
Note: all updates are “fully decoupled” from reads. Random-write workload is transformed to fully sequential!
27© Cloudera, Inc. All rights reserved.
LSM Read path
27
MemStore
HFile 1Row=r1 col=c1 val=“blah”Row=r1 col=c2 val=“2”
HFile 2
Row=r2 col=c1 val=“v2”Row=r2 col=c2 val=“5”
Row=r2 col=c1 val=“newval”
Merge based on string row keys
R1: c1=blah c2=2R2: c1=newval c2=5….
CPU intensive!
Must always read rowkeys
Any given row may exist across multiple HFiles: must always
merge!
The more HFiles to merge, the slower it reads
28© Cloudera, Inc. All rights reserved.
Kudu storage – Inserts and Flushes
28
MemRowSetINSERT(“todd”, “$1000”,”engineer”)
name pay role
DiskRowSet 1
flush
29© Cloudera, Inc. All rights reserved.
Kudu storage – Inserts and Flushes
29
MemRowSet
name pay role
DiskRowSet 1
name pay role
DiskRowSet 2
INSERT(“doug”, “$1B”, “Hadoop man”)
flush
30© Cloudera, Inc. All rights reserved.
Kudu storage - Updates
30
MemRowSet
name pay role
DiskRowSet 1
name pay role
DiskRowSet 2Delta MS
Delta MS
Each DiskRowSet has its own DeltaMemStore to accumulate updates
base data
base data
31© Cloudera, Inc. All rights reserved.
Kudu storage - Updates
31
MemRowSet
name pay role
DiskRowSet 1
name pay role
DiskRowSet 2Delta MS
Delta MS
UPDATE set pay=“$1M” WHERE name=“todd”
Is the row in DiskRowSet 2?(check bloom filters)
Is the row in DiskRowSet 1?(check bloom filters)
Bloom says: no!
Bloom says: maybe!
Search key column to find offset: rowid = 150
150: col 1=$1M
base data
32© Cloudera, Inc. All rights reserved.
Kudu storage – Read path
32
MemRowSet
name pay role
DiskRowSet 1
name pay role
DiskRowSet 2Delta MS
Delta MS
150: pay=$1M
Read rows in DiskRowSet 2
Then, read rows in DiskRowSet 1
Any row is only in exactly one DiskRowSet– no need to merge cross-
DRS!
Updates are merged based on ordinal offset within DRS: array indexing, no
string compares
base data
base data
33© Cloudera, Inc. All rights reserved.
Kudu storage – Delta flushes
33
MemRowSet
name pay role
DiskRowSet 1
name pay role
DiskRowSet 2Delta MS
Delta MS
0: pay=fooREDO DeltaFileFlush
A REDO delta indicates how to transform between the ‘base data’
(columnar) and a later version
base data
base data
34© Cloudera, Inc. All rights reserved.
Kudu storage – Major delta compaction
34
name pay role
DiskRowSet(pre-compaction)Delta MS
REDO DeltaFile REDO DeltaFile REDO DeltaFile
Many deltas accumulate: lots of delta application work on reads
name pay role
DiskRowSet(post-compaction)Delta MS
Unmerged REDO deltasUNDO deltas
If a column has few updates, doesn’t need to be re-written: those deltas maintained in new DeltaFile
Merge updates for columns with high update percentage
base data
35© Cloudera, Inc. All rights reserved.
Kudu storage – RowSet Compactions
35
DRS 1 (32MB)
[PK=alice], [PK=joe], [PK=linda], [PK=zach]
DRS 2 (32MB)
[PK=bob], [PK=jon], [PK=mary] [PK=zeke]
DRS 3 (32MB)
[PK=carl], [PK=julie], [PK=omar] [PK=zoe]
DRS 4 (32MB) DRS 5 (32MB) DRS 6 (32MB)
[alice, bob, carl, joe]
[jon, julie, linda, mary]
[omar, zach, zeke, zoe]
Reorganize rows to avoid rowsets with overlapping key ranges
36© Cloudera, Inc. All rights reserved.
Kudu storage – Compaction policy
• Solves an optimization problem (knapsack problem)• Minimize “height” of rowsets for the average key lookup• Bound on number of seeks for write or random-read
• Restrict total IO of any compaction to a budget (128MB)• No long compactions, ever• No “minor” vs “major” distinction• Always be compacting or flushing• Low IO priority maintenance threads
36
37© Cloudera, Inc. All rights reserved.
Kudu trade-offs
• Random updates will be slower• HBase model allows random updates without incurring a disk seek• Kudu requires a key lookup before update, bloom lookup before insert
• Single-row reads may be slower• Columnar design is optimized for scans• Future: may introduce “column groups” for applications where single-row
access is more important• Especially slow at reading a row that has had many recent updates (e.g YCSB
“zipfian”)
37
38© Cloudera, Inc. All rights reserved.
Benchmarks
38
39© Cloudera, Inc. All rights reserved.
TPC-H (Analytics benchmark)
• 75TS + 1 master cluster• 12 (spinning) disk each, enough RAM to fit dataset• Using Kudu 0.5.0, Impala 2.2 with Kudu support, CDH 5.4• TPC-H Scale Factor 100 (100GB)
• Example query:• SELECT n_name, sum(l_extendedprice * (1 - l_discount)) as revenue FROM customer, orders, lineitem, supplier, nation, region WHERE c_custkey = o_custkey AND l_orderkey = o_orderkey AND l_suppkey = s_suppkey AND c_nationkey = s_nationkey AND s_nationkey = n_nationkey AND n_regionkey = r_regionkey AND r_name = 'ASIA' AND o_orderdate >= date '1994-01-01' AND o_orderdate < '1995-01-01’ GROUP BY n_name ORDER BY revenue desc;
39
40© Cloudera, Inc. All rights reserved.
- Kudu outperforms Parquet by 31% (geometric mean) for RAM-resident data- Parquet likely to outperform Kudu for HDD-resident (larger IO requests)
41© Cloudera, Inc. All rights reserved.
What about Apache Phoenix?• 10 node cluster (9 worker, 1 master)• HBase 1.0, Phoenix 4.3• TPC-H LINEITEM table only (6B rows)
41
Load TPCH Q1 COUNT(*)COUNT(*)WHERE…
single-rowlookup
0.01
0.1
1
10
100
1000
100002152
21976
131
0.04
1918
13.2
1.7
0.7 0.15
155
9.3
1.4 1.5 1.37
PhoenixKuduParquet
Tim
e (s
ec)
42© Cloudera, Inc. All rights reserved.
What about NoSQL-style random access? (YCSB)
• YCSB 0.5.0-snapshot• 10 node cluster
(9 worker, 1 master)• HBase 1.0• 100M rows, 10M ops
42
43© Cloudera, Inc. All rights reserved.
What Kudu is not
43
44© Cloudera, Inc. All rights reserved.
Kudu is…
• NOT a SQL database• “BYO SQL”
• NOT a filesystem• data must have tabular structure
• NOT a replacement for HBase or HDFS• Cloudera continues to invest in those systems• Many use cases where they’re still more appropriate
• NOT an in-memory database• Very fast for memory-sized workloads, but can operate on larger data too!
44
45© Cloudera, Inc. All rights reserved.
Getting started
45
46© Cloudera, Inc. All rights reserved.
Getting started as a user
• http://getkudu.io• [email protected]
• Quickstart VM• Easiest way to get started• Impala and Kudu in an easy-to-install VM
• CSD and Parcels• For installation on a Cloudera Manager-managed cluster
46
47© Cloudera, Inc. All rights reserved.
Getting started as a developer
• http://github.com/cloudera/kudu• All commits go here first
• Public gerrit: http://gerrit.cloudera.org• All code reviews happening here
• Public JIRA: http://issues.cloudera.org• Includes bugs going back to 2013. Come see our dirty laundry!
• Apache 2.0 license open source• Contributions are welcome and encouraged!
47
48© Cloudera, Inc. All rights reserved.
Demo?(if we have time and internet gods willing)
48
49© Cloudera, Inc. All rights reserved.
http://getkudu.io/@getkudu