Correlation Maps: A Compressed Access Method for Exploiting Soft Functional Dependencies George Huo...

28
Correlation Maps: A Compressed Access Method for Exploiting Soft Functional Dependencies George Huo Google, Inc. With Hideaki Kimura (Brown), Alex Rasin (Brown), Samuel Madden (MIT CSAIL), Stanley B. Zdonik (Brown)

Transcript of Correlation Maps: A Compressed Access Method for Exploiting Soft Functional Dependencies George Huo...

Correlation Maps:A Compressed Access Method for

Exploiting Soft Functional Dependencies

George HuoGoogle, Inc.

With Hideaki Kimura (Brown), Alex Rasin (Brown),Samuel Madden (MIT CSAIL), Stanley B. Zdonik (Brown)

Two observations

1. Correlations abound

Attributes tend to encode related info(these are soft functional dependencies)

02116Boston

MA 71° 05'W

Honda

2007

Civic Hybrid

Receiptdate

Shipdate

{zip code, city, state, long/latitude}{manufacturer, model, year}{shipdate, receiptdate}

Geographic

2. Secondary indexes are often useless for range and aggregation queries

Clusteredaccess pattern

Unclusteredaccess pattern

How can we improve the access patternof a secondary index?

SELECT * FROM lineitem WHEREorderdate=‘2009-08-26’

One seekSorted byorderdate(clustered index on orderdate)

Sorted byorder_id(secondary index on orderdate)

Many seeks

Our contribution:Exploiting correlations

to improvesecondary index performance

lineitem access pattern

Clustered by primary key (uncorrelated)

SELECT * FROM lineitemWHERE orderdate = 2007-01-03

Clustered by shipdate (correlated)

Correlation determines index performance

0

20

40

60

80

100

120

140

160

180

1 4950 34065

Number of Clustered Fragments(Fewer Fragments = More Correlation)

Qu

ery

Ru

nti

me

(s

)

Real Runtime

DB Cost Estimate

Very Correlated

Poorly Correlated

Different sort orders

Our system:

1. Cost model with correlations

2. Correlation maps

3. Multi-attribute keys

4. Evaluation

i

j

shipdate (clustered)receiptdate

(unclustered)

1. Cost model with correlations

SELECT *FROM lineitem

WHEREreceiptdate IN (i, j)c_per_u: average number of clustered attribute values

per unclustered attribute value

2 lookups 3 c_per_u 10ms 3 levels

1ms 3 pages per shipdate 20s

Correlation Map Clustered B+Tree

2. Correlation MapsCREATE TABLE Salaries( State string PRIMARY_KEY, City string, Salary integer);

SELECT * FROM Salaries WHERE city=`Boston’;

Clustered Attribute: StateUnclustered Attribute: City

CMs: Usage• Populated using initial scan of the table

• Insertions/deletions: keep a co-occurrence count for each (u, c) pair

• Physically stored as a B+Tree in the DB

CMs: Compression

• CMs typically 10x-1000x smaller than a secondary B+Tree (1KB for a 5GB table)

• Achieves compression by mapping values → values, not values → tuples

• Possible to build many CMs; dedicated CM per query

• Improve performance by reducing buffer pool pressure

3. Multi-attribute keys

• Combined attributes may predict the clustered key better than either attr alone

• (longitude, latitude) → zip_code

• Challenges:

– Finding these is non-trivial

– Combining attributes leads to many-valued keys leads to large CMs

CM Advisor• The CM Advisor considers all possible attribute

combinations for clustered and unclustered keys given a training set of queries

• Buckets: collapse a range of key values into one• Bucketing clustered keys

– Leads to longer sequential disk reads– Boston:MA versus Boston:MA,MI

• Bucketing unclustered keys– Merging two unclustered buckets may increase disk seeks– Boston:MA versus Boise,Boston:ID,MA

Clustered Unclustered

Clustered Unclustered

4. Experimental evaluation

SELECT … WHERECity IN (Boston, Springf)

AND State IN (MA,NH,OH)

SELECT … WHERECity IN (Boston, Springf)

Benefit of correlation

0

10

20

30

40

50

60

70

80

1 2 4 6 8 10 20

Number of Shipdate Lookups

Elap

sed

(s) Full Table Scan

B+Tree (Uncorrelated)

B+Tree (Correlated)

CostModel (correlated)

SELECT * FROM lineitemWHERE shipdate IN (2009-01-03, …)

eBay category data

• Hierarchies of products in categories• antiques→architectural→hardware→locks & keys• 24,000 categories up to 6 levels deep• Clustered by catID• Correlation: catID → price• Generated unique ItemIDs for 43 million rows (3.5GB)

Maintenance costs: CM vs B+Tree

Index updates fit in memory

Each B+Tree: 1.5GB

Each CM: 300K

Mixed workload performance(5 indexes each)

Selects slow down inserts evenmore due to buffer pool pressure!

Total B+Tree size:7.7GB

Total CM size:1.4MB

SDSS Skyserver data

• Celestial objects and their optical properties• PhotoObj: right ascension (ra), declination (dec)• Clustered by fieldID• Correlation: (ra, dec) → fieldID• Initial data: 200k tuples• Copied ra and dec windows 10x to produce

20M tuples, 3GB

0.67 0.936 0.699

542

0

100

200

300

400

500

600

Index Size On-Disk (MB)

Multi-attributeindex performance

4

1.7

0.21

1.12

0

0.5

1

1.5

2

2.5

3

3.5

4

SDSS Range Query Runtime (s)

SELECT COUNT(*)FROM PhotoObjWHERE 193.1 < ra < 194.5AND 1.41 < dec < 1.55AND 23 < g+rho < 25

CM(ra)

CM(dec)

CM(ra,dec)

BTree(ra,dec)

CM(ra) CM(dec) CM(ra,dec)

BTree(ra,dec)Correlation:(ra, dec) → fieldID

Related ideas• BHUNT/CORDS

– Similar measure of correlation for query opt.

– Doesn’t discuss indexing, no cost model

• ADC Clustering– Proposes reclustering, but no cost model/designer

• Microsoft SQL Server: datetime clustering– Limited to datetime types

• Index compression (Prefix B+Tree, delta encoding, …)– Compression rates in the range of 2x

Summary• Correlations between attributes arise naturally in a

variety of applications

• Correlations determine the cost of secondary index lookups

• We presented a correlation-aware cost model and advisor to decide when to build CMs

• Multi-attribute CMs capture more correlations; bucketing keeps them tiny

• Experiments show that correlated lookups with CMs are 2-38x faster, and CMs are typically 10-1000x smaller than secondary B+Trees

Model accuracy

SELECT Avg(Price)

FROM EbayWHERE

Category=X

Isolated CM performance vs.secondary B+Tree

Slightly slower on isolated query;CM must filter unmatching tuples

B+Tree: 860MB

CM: 900KB

Bucketing

Acceptable performance

Smaller size

• Random-sample synopsis from table

• Try unclustered bucket sizes: 2², 2³, …

• Output candidates grouped by size, ordered by c_per_u