Are you Kudu-ing me?!
-
Upload
przemek-maciolek -
Category
Data & Analytics
-
view
1.798 -
download
1
Transcript of Are you Kudu-ing me?!
This folks must be all wrong, aren’t they?
uuid first_name last_name dob
ee-c6-47-2c John Connor Feb 28th, 1985
84-ee-ff-d5 Sarah Connor May 11th, 1965
57-4f-d9-d8 Kyle Reese Mar 1st, 2002
SELECT MIN(dob) FROM characters WHERE last_name=”connor”
uuid
ee-c6-47-2c
84-ee-ff-d5
57-4f-d9-d8
last_name
Connor
Connor
Reese
first_name
John
Sarah
Kyle
dob
Feb 28th, 1985
May 11th, 1965
Mar 1st, 2002
SELECT MIN(dob) FROM characters WHERE last_name=”connor”
What’s the problem with Apache Parquet then?
Ever implemented Lambda Architecture?
last_name first_name movie actor actor_age
Connor John Terminator 2 Edward Furlong 14
Connor John Terminator 2 Michael Edwards 47
Connor Sarah Terminator Linda Hamilton 28
Connor Sarah Terminator 2 Linda Hamilton 35
Reese Kyle Terminator 2 Michael Biehn 35
T-800 Terminator Arnold Schwarzenegger
37
CREATE TABLE ’characters’ (last_name STRING,first_name STRING,movie STRING,actor STRING,actor_age INT
)DISTRIBUTE BY HASH (last_name, first_name) INTO 4 BUCKETSTBLPROPERTIES (
’kudu.key_columns’ = ’last_name, first_name, movie, actor’)
last_name first_name movie actor actor_age
Connor John Terminator 2 Edward Furlong 14
Connor John Terminator 2 Michael Edwards 47
Connor Sarah Terminator Linda Hamilton 28
Connor Sarah Terminator 2 Linda Hamilton 35
Reese Kyle Terminator 2 Michael Biehn 35
T-800 Terminator Arnold Schwarzenegger
37
CREATE TABLE ’characters’ (last_name STRING,first_name STRING,movie STRING,actor STRING,actor_age INT
)DISTRIBUTE BY HASH (last_name, first_name) INTO 4 BUCKETSTBLPROPERTIES (
’kudu.key_columns’ = ’last_name, first_name, movie, actor’)
last_name first_name movie actor actor_age
Connor John Terminator 2 Edward Furlong 14
Connor John Terminator 2 Michael Edwards 47
Connor Sarah Terminator Linda Hamilton 28
Connor Sarah Terminator 2 Linda Hamilton 35
Reese Kyle Terminator 2 Michael Biehn 35
T-800 Terminator Arnold Schwarzenegger
37
CREATE TABLE ’characters’ (last_name STRING,first_name STRING,movie STRING,actor STRING,actor_age INT
)DISTRIBUTE BY HASH (last_name, first_name) INTO 4 BUCKETSTBLPROPERTIES (
’kudu.key_columns’ = ’last_name, first_name, movie, actor’)
last_name first_name movie actor actor_age
Connor John Terminator 2 Edward Furlong 14
Connor John Terminator 2 Michael Edwards 47
Connor Sarah Terminator Linda Hamilton 28
Connor Sarah Terminator 2 Linda Hamilton 35
Reese Kyle Terminator 2 Michael Biehn 35
T-800 Terminator Arnold Schwarzenegger
37
last_name first_name movie actor actor_age
Connor John Terminator 2 Edward Furlong 14
Connor John Terminator 2 Michael Edwards 47
Connor Sarah Terminator Linda Hamilton 28
Connor Sarah Terminator 2 Linda Hamilton 35
Reese Kyle Terminator 2 Michael Biehn 35
T-800 Terminator Arnold Schwarzenegger
37
Somewhere between BigTable/HBase range partitioning and Cassandra’s hash partitioning.
last_name
Connor
Connor
Reese
first_name
John
John
Kyle
movie
Terminator 2
Terminator 2
Terminator 2
actor
Edward Furlong
Michael Edwards
Michael Biehn
actor_age
14
47
35
last_name
Connor
Connor
first_name
Sarah
Sarah
movie
Terminator
Terminator 2
actor
Linda Hamilton
Linda Hamilton
actor_age
28
35
last_name
T-800
first_name movie
Terminator
actor
Arnold Schwarzenegger
actor_age
37
last_name
Connor
Connor
Reese
first_name
John
John
Kyle
movie
Terminator 2
Terminator 2
Terminator 2
actor
Edward Furlong
Michael Edwards
Michael Biehn
actor_age
14
47
35
last_name
Connor
Connor
first_name
Sarah
Sarah
movie
Terminator
Terminator 2
actor
Linda Hamilton
Linda Hamilton
actor_age
28
35
last_name
T-800
first_name movie
Terminator
actor
Arnold Schwarzenegger
actor_age
37
INSERT INTO characters (last_name, first_name, movie, actor, actor_age)
VALUES(’Connor’, ’John’, ’Terminator Genisys’, ’Jason Clarke’, 36)
last_name
Connor
Connor
Connor
Reese
first_name
John
John
John
Kyle
movie
Terminator 2
Terminator 2
Terminator Genisys
Terminator 2
actor
Edward Furlong
Michael Edwards
Jason Clarke
Michael Biehn
actor_age
14
47
36
35
last_name
Connor
Connor
first_name
Sarah
Sarah
movie
Terminator
Terminator 2
actor
Linda Hamilton
Linda Hamilton
actor_age
28
35
last_name
T-800
first_name movie
Terminator
actor
Arnold Schwarzenegger
actor_age
37
INSERT INTO characters (last_name, first_name, movie, actor, actor_age)
VALUES(’Connor’, ’John’, ’Terminator Genisys’, ’Jason Clarke’, 36)
Delta
last_name
Connor
Connor
Connor
Reese
first_name
John
John
John
Kyle
movie
Terminator 2
Terminator 2
Terminator Genisys
Terminator 2
actor
Edward Furlong
Michael Edwards
Jason Clarke
Michael Biehn
actor_age
14
47
36
35
last_name
Connor
Connor
first_name
Sarah
Sarah
movie
Terminator
Terminator 2
actor
Linda Hamilton
Linda Hamilton
actor_age
28
35
last_name
T-800
first_name movie
Terminator
actor
Arnold Schwarzenegger
actor_age
37
SELECT MAX(actor_age) FROM characters WHERE last_name=’Connor’
last_name
Connor
Connor
Connor
Reese
first_name
John
John
John
Kyle
movie
Terminator 2
Terminator 2
Terminator Genisys
Terminator 2
actor
Edward Furlong
Michael Edwards
Jason Clarke
Michael Biehn
actor_age
14
47
36
35
last_name
Connor
Connor
first_name
Sarah
Sarah
movie
Terminator
Terminator 2
actor
Linda Hamilton
Linda Hamilton
actor_age
28
35
last_name
T-800
first_name movie
Terminator
actor
Arnold Schwarzenegger
actor_age
37
SELECT MAX(actor_age) FROM characters WHERE last_name=’Connor’
MPP FTW
last_name
Connor
Connor
Connor
Reese
first_name
John
John
John
Kyle
movie
Terminator 2
Terminator 2
Terminator Genisys
Terminator 2
actor
Edward Furlong
Michael Edwards
Jason Clarke
Michael Biehn
actor_age
14
47
36
35
last_name
Connor
Connor
first_name
Sarah
Sarah
movie
Terminator
Terminator 2
actor
Linda Hamilton
Linda Hamilton
actor_age
28
35
last_name
T-800
first_name movie
Terminator
actor
Arnold Schwarzenegger
actor_age
37
SELECT MAX(actor_age) FROM characters WHERE movie=’Terminator 2’
last_name
Connor
Connor
Connor
Reese
first_name
John
John
John
Kyle
movie
Terminator 2
Terminator 2
Terminator Genisys
Terminator 2
actor
Edward Furlong
Michael Edwards
Jason Clarke
Michael Biehn
actor_age
14
47
36
35
last_name
Connor
Connor
first_name
Sarah
Sarah
movie
Terminator
Terminator 2
actor
Linda Hamilton
Linda Hamilton
actor_age
28
35
last_name
T-800
first_name movie
Terminator
actor
Arnold Schwarzenegger
actor_age
37
SELECT MAX(actor_age) FROM characters WHERE movie=’Terminator 2’
Bloom filters FTW
Tablet Server 1
Tablet Server 2
Master
Leader
Leader
MasterMaster replica
Leader
Leader
Tablet Server 1
Tablet Server 2
Tablet Server 3
Leader
Leader
Tablet Server 1
Tablet Server 2
MasterMaster replica
Tablet Server 3
Leader
Leader
Typically 10-100 tablets per machine.
DiskRowSet
• Col A
• Col B
• …
• [Delta store]
DiskRowSet
• Col A
• Col B
• …
• [Delta store]
MemRowSet
• Col A
• Col B
• …
In-memory concurrent B-tree,Keeps all recently-inserted rows
Each column separately written in a single contiguous block of data
Base data
Deltas organized by rows(until compaction happens)
Long story short:- 30% faster than Parquet 1.0 (TPC-H)- 16-187 times faster than Phoenix or HBase (TPC-H again)- hundreds of thousands of rows inserted per second on a single tablet server
TPC-H test, scale factor 100, RF 3- 75 nodes, each: 64 GB RAM, 12 spinning disks, 2x 6-core Xeon- Expansion of 62 GB of data (post-replication, compactions done):
- 570 GB in Hbase (9.2x)- 227 GB in Kudu (3.7x)
http://getkudu.io/kudu.pdf