Column-Stores vs. Row-Stores: How Different Are They...

26
Daniel Abadi -- Yale University Column-Stores vs. Row-Stores: How Different Are They Really? Daniel Abadi (Yale), Samuel Madden (MIT), Nabil Hachem (AvantGarde Consulting) June 12 th , 2008

Transcript of Column-Stores vs. Row-Stores: How Different Are They...

Page 1: Column-Stores vs. Row-Stores: How Different Are They Really?abadi/talks/abadi-sigmod08-slides.pdf · Column-Stores vs. Row-Stores: How Different Are They Really? Daniel Abadi (Yale),

Daniel Abadi -- Yale University

Column-Stores vs. Row-Stores: How Different Are They Really?

Daniel Abadi (Yale), Samuel Madden (MIT),

Nabil Hachem (AvantGarde Consulting)June 12th, 2008

Page 2: Column-Stores vs. Row-Stores: How Different Are They Really?abadi/talks/abadi-sigmod08-slides.pdf · Column-Stores vs. Row-Stores: How Different Are They Really? Daniel Abadi (Yale),

Daniel Abadi -- Yale University

Row vs. Column-Stores

StreetAddressPhone #E-mail

FirstName

LastName

Last Name

FirstName E-mail Phone #

StreetAddress

Row-Store Column-Store

− Might read in unnecessary data

+ Only need to read in relevant data

+ Easy to add a new record

− Tuple writes might require multiple seeks

Page 3: Column-Stores vs. Row-Stores: How Different Are They Really?abadi/talks/abadi-sigmod08-slides.pdf · Column-Stores vs. Row-Stores: How Different Are They Really? Daniel Abadi (Yale),

Daniel Abadi -- Yale University

Column-Stores

• Really good for read-mostly data warehouses� Lot’s of column scans and aggregations� Writes tend to be in batch� [CK85], [SAB+05], [ZBN+05], [HLA+06],

[SBC+07] all verify this� Top 3 in TPC-H rankings (Exasol, ParAccel,

and Kickfire) are column-stores� Factor of 5 faster on performance� Factor of 2 superior on price/performance

Page 4: Column-Stores vs. Row-Stores: How Different Are They Really?abadi/talks/abadi-sigmod08-slides.pdf · Column-Stores vs. Row-Stores: How Different Are They Really? Daniel Abadi (Yale),

Daniel Abadi -- Yale University

Data Warehouse DBMS Software

• $4.5 billion industry (out of total $16 billion DBMS software industry)

• Growing 10% annually

Page 5: Column-Stores vs. Row-Stores: How Different Are They Really?abadi/talks/abadi-sigmod08-slides.pdf · Column-Stores vs. Row-Stores: How Different Are They Really? Daniel Abadi (Yale),

Daniel Abadi -- Yale University

Momentum

• Right solution for growing market � $$$$• Vertica, ParAccel, Kickfire, Calpont,

Infobright, and Exasol new entrants• Sybase IQ’s profits rapidly increasing• Yahoo’s world largest (multi-petabyte)

data warehouse is a column-store (from Mahat Technologies acquisition)

Page 6: Column-Stores vs. Row-Stores: How Different Are They Really?abadi/talks/abadi-sigmod08-slides.pdf · Column-Stores vs. Row-Stores: How Different Are They Really? Daniel Abadi (Yale),

Daniel Abadi -- Yale University

Paper Looks At Key Question

• How much of the buzz around column-stores just marketing hype?� Do you really need to buy Sybase IQ or

Vertica?� How far will your current row-store take you?

� Can you get column-store performance from a row-store?

� Can you simulate a column-store in a row-store?

Page 7: Column-Stores vs. Row-Stores: How Different Are They Really?abadi/talks/abadi-sigmod08-slides.pdf · Column-Stores vs. Row-Stores: How Different Are They Really? Daniel Abadi (Yale),

Daniel Abadi -- Yale University

Paper Methodology

• Comparing row-store vs. column-store is dangerous/borderline meaningless

• Instead, compare row-store vs. row-store and column-store vs. column-store� Simulate a column-store inside of a row-store� Remove column-oriented features from

column-store until it behaves like a row-store

Page 8: Column-Stores vs. Row-Stores: How Different Are They Really?abadi/talks/abadi-sigmod08-slides.pdf · Column-Stores vs. Row-Stores: How Different Are They Really? Daniel Abadi (Yale),

Daniel Abadi -- Yale University

Simulate Column-Store Inside Row-Store

StreetAddressPhone #E-mail

FirstName

LastName

Last Name

FirstName E-mail

1

2

3

1

2

3

1

2

3

Option A: Vertical Partitioning

Option B:Index Every Column

Last Name Index First Name Index

Page 9: Column-Stores vs. Row-Stores: How Different Are They Really?abadi/talks/abadi-sigmod08-slides.pdf · Column-Stores vs. Row-Stores: How Different Are They Really? Daniel Abadi (Yale),

Daniel Abadi -- Yale University

Experiments

• Star Schema Benchmark (SSBM)� Fact table contains 17 columns and 60,000,000 rows� 4 dimension tables, biggest one has 80,000 rows� Queries perform 2-4 joins between fact table and

dimension tables, aggregate 1-2 columns from fact table

� [OOC06]

• Implemented by professional DBA� Original row-store plus 2 column-store simulations on

same row-store product

Page 10: Column-Stores vs. Row-Stores: How Different Are They Really?abadi/talks/abadi-sigmod08-slides.pdf · Column-Stores vs. Row-Stores: How Different Are They Really? Daniel Abadi (Yale),

Daniel Abadi -- Yale University

SSBM Averages

0.0

50.0

100.0

150.0

200.0

250.0

Time (seconds)

Average 25.7 79.9 221.2

Normal Row-StoreVertically Partitioned

Row-Store

Row-Store With All

Indexes

Page 11: Column-Stores vs. Row-Stores: How Different Are They Really?abadi/talks/abadi-sigmod08-slides.pdf · Column-Stores vs. Row-Stores: How Different Are They Really? Daniel Abadi (Yale),

Daniel Abadi -- Yale University

What’s Going On?

• Vertically Partitioned Case� Tuple Sizes� Horizontal Partitioning

• All Indexes Case� Tuple Reconstruction

Page 12: Column-Stores vs. Row-Stores: How Different Are They Really?abadi/talks/abadi-sigmod08-slides.pdf · Column-Stores vs. Row-Stores: How Different Are They Really? Daniel Abadi (Yale),

Daniel Abadi -- Yale University

Tuple Size

1

2

3

ColumnData

TID

1

2

3

TID ColumnData

1

2

3

TID ColumnData

TupleHeader

•Queries touch 3-4 foreign keys in fact table, 1-2 numeric columns

•Complete fact table takes up ~4 GB (compressed)

•Vertically partitioned tables take up 0.7-1.1 GB (compressed)

Page 13: Column-Stores vs. Row-Stores: How Different Are They Really?abadi/talks/abadi-sigmod08-slides.pdf · Column-Stores vs. Row-Stores: How Different Are They Really? Daniel Abadi (Yale),

Daniel Abadi -- Yale University

Horizontal Partitioning

• Fact table horizontally partitioned on year� Year is an element of the ‘Date’ dimension

table� Most queries in SSBM have a predicate on

year� Since vertically partitioned tables do not

contain the ‘Date’ foreign key, row-store could not similarly partition them

Page 14: Column-Stores vs. Row-Stores: How Different Are They Really?abadi/talks/abadi-sigmod08-slides.pdf · Column-Stores vs. Row-Stores: How Different Are They Really? Daniel Abadi (Yale),

Daniel Abadi -- Yale University

What’s Going On?

• Vertically Partitioned Case� Tuple Sizes� Horizontal Partitioning

• All Indexes Case� Tuple Construction

Page 15: Column-Stores vs. Row-Stores: How Different Are They Really?abadi/talks/abadi-sigmod08-slides.pdf · Column-Stores vs. Row-Stores: How Different Are They Really? Daniel Abadi (Yale),

Daniel Abadi -- Yale University

Tuple Construction

• Common type of query:� SELECT store_name, SUM(revenue)

FROM Facts, StoresWHERE fact.store_id = stores.store_id

AND stores.country = “Canada”GROUP BY store_name

Page 16: Column-Stores vs. Row-Stores: How Different Are They Really?abadi/talks/abadi-sigmod08-slides.pdf · Column-Stores vs. Row-Stores: How Different Are They Really? Daniel Abadi (Yale),

Daniel Abadi -- Yale University

Tuple Construction

• Result of lower part of query plan is a set of TIDs that passed all predicates

• Need to extract SELECT attributes at these TIDs� BUT: index maps value to TID� You really want to map TID to value (i.e., a

vertical partition)

�� Tuple construction is SLOW

Page 17: Column-Stores vs. Row-Stores: How Different Are They Really?abadi/talks/abadi-sigmod08-slides.pdf · Column-Stores vs. Row-Stores: How Different Are They Really? Daniel Abadi (Yale),

Daniel Abadi -- Yale University

So….

• All indexes approach is a poor way to simulate a column-store

• Problems with vertical partitioning are NOT fundamental� Store tuple header in a separate partition� Allow virtual TIDs� Allow HP using a foreign key on a different VP

• So can row-stores simulate column-stores?

Page 18: Column-Stores vs. Row-Stores: How Different Are They Really?abadi/talks/abadi-sigmod08-slides.pdf · Column-Stores vs. Row-Stores: How Different Are They Really? Daniel Abadi (Yale),

Daniel Abadi -- Yale University

Row-Store vs. Column-Store

0.0

5.0

10.0

15.0

20.0

25.0

30.0

Time (seconds)

Average 25.7 11.7 4.4

Row-Store Row-Store (M V) C-Store

Page 19: Column-Stores vs. Row-Stores: How Different Are They Really?abadi/talks/abadi-sigmod08-slides.pdf · Column-Stores vs. Row-Stores: How Different Are They Really? Daniel Abadi (Yale),

Daniel Abadi -- Yale University

Row-Store vs. Column-Store

0.0

5.0

10.0

15.0

20.0

25.0

30.0

Time (seconds)

Average 25.7 11.7 4.4

Row-Store Row-Store (M V) C-Store

Page 20: Column-Stores vs. Row-Stores: How Different Are They Really?abadi/talks/abadi-sigmod08-slides.pdf · Column-Stores vs. Row-Stores: How Different Are They Really? Daniel Abadi (Yale),

Daniel Abadi -- Yale University

Column-Store Experiments

• Start with column-store (C-Store)• Remove column-store-specific

performance optimizations• End with column-store with a row-oriented

query executer

Page 21: Column-Stores vs. Row-Stores: How Different Are They Really?abadi/talks/abadi-sigmod08-slides.pdf · Column-Stores vs. Row-Stores: How Different Are They Really? Daniel Abadi (Yale),

Daniel Abadi -- Yale University

Compression

• Higher data value locality in column-stores� Better ratio � reduced I/O

• Can use schemes like run-length encoding� Easy to operate on directly

for improved performance ([AMF06])

Q1Q1Q1Q1Q1Q1Q1

Q2Q2Q2Q2

Quarter

(Q1, 1, 300)

Quarter

(Q2, 301, 350)

(Q3, 651, 500)

(Q4, 1151, 600)

Page 22: Column-Stores vs. Row-Stores: How Different Are They Really?abadi/talks/abadi-sigmod08-slides.pdf · Column-Stores vs. Row-Stores: How Different Are They Really? Daniel Abadi (Yale),

Daniel Abadi -- Yale University

• Early Materialization: create rows first. But:� Poor memory bandwidth

utilization� Lose opportunity for

vectorized operation

2131

2333

7134280

Construct

2

3

3

3

7

13

42

80

Select + Aggregate

2

1

3

1

4

4

4

4

prodID storeIDcustID price

QUERY:

SELECT custID,SUM(price)FROM tableWHERE (prodID = 4) AND

(storeID = 1) ANDGROUP BY custID

Early vs. Late Materialization

4444

Page 23: Column-Stores vs. Row-Stores: How Different Are They Really?abadi/talks/abadi-sigmod08-slides.pdf · Column-Stores vs. Row-Stores: How Different Are They Really? Daniel Abadi (Yale),

Daniel Abadi -- Yale University

Other Column-Store Optimizations

• Invisible join� Column-store specific join� Optimizations for star schemas

� Similar to a semi-join

• Block Processing

Page 24: Column-Stores vs. Row-Stores: How Different Are They Really?abadi/talks/abadi-sigmod08-slides.pdf · Column-Stores vs. Row-Stores: How Different Are They Really? Daniel Abadi (Yale),

Daniel Abadi -- Yale University

Simplified Version of Results

0.0

10.0

20.0

30.0

40.0

50.0

Time (seconds)

Average 4.4 14.9 40.7

Original C-St oreC-St ore, No

Compression

C-St ore, Early

Mat erializat ion

Page 25: Column-Stores vs. Row-Stores: How Different Are They Really?abadi/talks/abadi-sigmod08-slides.pdf · Column-Stores vs. Row-Stores: How Different Are They Really? Daniel Abadi (Yale),

Daniel Abadi -- Yale University

Conclusion

• Might be possible to simulate a row-store in a column-store, BUT:� Need better support for vertical partitioning at

the storage layer� Need support for column-specific

optimizations at the executer level

• Working with HP Labs to find out

Page 26: Column-Stores vs. Row-Stores: How Different Are They Really?abadi/talks/abadi-sigmod08-slides.pdf · Column-Stores vs. Row-Stores: How Different Are They Really? Daniel Abadi (Yale),

Daniel Abadi -- Yale University

Come Join the Yale DB Group!