Technische Universität München Workload-Aware Data Partitioning in Community- Driven Data Grids...

36
Technische Universität München Workload-Aware Data Partitioning in Community-Driven Data Grids Tobias Scholl, Bernhard Bauer, Jessica Müller, Benjamin Gufler, Angelika Reiser, and Alfons Kemper Department of Computer Science, Technische Universität München Germany

Transcript of Technische Universität München Workload-Aware Data Partitioning in Community- Driven Data Grids...

Page 1: Technische Universität München Workload-Aware Data Partitioning in Community- Driven Data Grids Tobias Scholl, Bernhard Bauer, Jessica Müller, Benjamin.

Technische Universität München

Workload-Aware Data Partitioning in Community-

Driven Data Grids

Tobias Scholl, Bernhard Bauer, Jessica Müller, Benjamin Gufler, Angelika Reiser, and Alfons Kemper

Department of Computer Science, Technische Universität München

Germany

Page 2: Technische Universität München Workload-Aware Data Partitioning in Community- Driven Data Grids Tobias Scholl, Bernhard Bauer, Jessica Müller, Benjamin.

Technische Universität München

2009-03-24 EDBT 2009 – Workload-Aware Data Partitioning 2

• Many challenges and opportunities in e-science for database research– High-throughput data management– Correlation of distributed data sources

• Community-driven data grids– Dealing with data skew and query hot spots– Workload-awareness by employing cost model during

partitioningShould I Split or Replicate?

Page 3: Technische Universität München Workload-Aware Data Partitioning in Community- Driven Data Grids Tobias Scholl, Bernhard Bauer, Jessica Müller, Benjamin.

Technische Universität München

2009-03-24 EDBT 2009 – Workload-Aware Data Partitioning 3

Query Load Balancing via Partitioning

Page 4: Technische Universität München Workload-Aware Data Partitioning in Community- Driven Data Grids Tobias Scholl, Bernhard Bauer, Jessica Müller, Benjamin.

Technische Universität München

2009-03-24 EDBT 2009 – Workload-Aware Data Partitioning 4

Query Load Balancing via Partitioning

Page 5: Technische Universität München Workload-Aware Data Partitioning in Community- Driven Data Grids Tobias Scholl, Bernhard Bauer, Jessica Müller, Benjamin.

Technische Universität München

2009-03-24 EDBT 2009 – Workload-Aware Data Partitioning 5

Query Load Balancing via Partitioning

Page 6: Technische Universität München Workload-Aware Data Partitioning in Community- Driven Data Grids Tobias Scholl, Bernhard Bauer, Jessica Müller, Benjamin.

Technische Universität München

2009-03-24 EDBT 2009 – Workload-Aware Data Partitioning 6

Query Load Balancing via Partitioning

X

Page 7: Technische Universität München Workload-Aware Data Partitioning in Community- Driven Data Grids Tobias Scholl, Bernhard Bauer, Jessica Müller, Benjamin.

Technische Universität München

2009-03-24 EDBT 2009 – Workload-Aware Data Partitioning 7

Query Load Balancing via Replication

Page 8: Technische Universität München Workload-Aware Data Partitioning in Community- Driven Data Grids Tobias Scholl, Bernhard Bauer, Jessica Müller, Benjamin.

Technische Universität München

2009-03-24 EDBT 2009 – Workload-Aware Data Partitioning 8

Query Load Balancing via Replication

Page 9: Technische Universität München Workload-Aware Data Partitioning in Community- Driven Data Grids Tobias Scholl, Bernhard Bauer, Jessica Müller, Benjamin.

Technische Universität München

2009-03-24 EDBT 2009 – Workload-Aware Data Partitioning 9

Query Load Balancing via Replication

Page 10: Technische Universität München Workload-Aware Data Partitioning in Community- Driven Data Grids Tobias Scholl, Bernhard Bauer, Jessica Müller, Benjamin.

Technische Universität München

2009-03-24 EDBT 2009 – Workload-Aware Data Partitioning 10

The AstroGrid-D Project

• German Astronomy Community Grid http://www.gac-grid.org/

• Funded by the German Ministry of Education and Research

• Part of D-Grid

Page 11: Technische Universität München Workload-Aware Data Partitioning in Community- Driven Data Grids Tobias Scholl, Bernhard Bauer, Jessica Müller, Benjamin.

Technische Universität München

2009-03-24 EDBT 2009 – Workload-Aware Data Partitioning 11

Up-Coming Data-Intensive Applications

• Alex Szalay, Jim Gray (Nature, 2006):“Science in an exponential world”

• Data rates– Terabytes a day/night– Petabytes a year

• LHC• LSST• LOFAR• Pan-STARRS

LOFAR

LHC

Page 12: Technische Universität München Workload-Aware Data Partitioning in Community- Driven Data Grids Tobias Scholl, Bernhard Bauer, Jessica Müller, Benjamin.

Technische Universität München

2009-03-24 EDBT 2009 – Workload-Aware Data Partitioning 12

The Multiwavelength Milky Way

http://adc.gsfc.nasa.gov/mw/

Page 13: Technische Universität München Workload-Aware Data Partitioning in Community- Driven Data Grids Tobias Scholl, Bernhard Bauer, Jessica Müller, Benjamin.

Technische Universität München

2009-03-24 EDBT 2009 – Workload-Aware Data Partitioning 13

Research Challenges

• Directly deal with Terabyte/Petabyte-scale data sets• Integrate with existing community infrastructures• High throughput for growing user communities

Page 14: Technische Universität München Workload-Aware Data Partitioning in Community- Driven Data Grids Tobias Scholl, Bernhard Bauer, Jessica Müller, Benjamin.

Technische Universität München

2009-03-24 EDBT 2009 – Workload-Aware Data Partitioning 14

Current Sharing in Data Grids

• Data autonomy• Policies allow partners to access data• Each institution ensures

– Availability (replication)– Scalability

• Various organizational structures [Venugopal et al. 2006]:– Centralized– Hierarchical– Federated– Hybrid

Page 15: Technische Universität München Workload-Aware Data Partitioning in Community- Driven Data Grids Tobias Scholl, Bernhard Bauer, Jessica Müller, Benjamin.

Technische Universität München

2009-03-24 EDBT 2009 – Workload-Aware Data Partitioning 15

Community-Driven Data Grids (HiSbase)

Page 16: Technische Universität München Workload-Aware Data Partitioning in Community- Driven Data Grids Tobias Scholl, Bernhard Bauer, Jessica Müller, Benjamin.

Technische Universität München

2009-03-24 EDBT 2009 – Workload-Aware Data Partitioning 16

“Distribute by Region – not by Archive!”

Page 17: Technische Universität München Workload-Aware Data Partitioning in Community- Driven Data Grids Tobias Scholl, Bernhard Bauer, Jessica Müller, Benjamin.

Technische Universität München

2009-03-24 EDBT 2009 – Workload-Aware Data Partitioning 17

“Distribute by Region – not by Archive!”

Page 18: Technische Universität München Workload-Aware Data Partitioning in Community- Driven Data Grids Tobias Scholl, Bernhard Bauer, Jessica Müller, Benjamin.

Technische Universität München

2009-03-24 EDBT 2009 – Workload-Aware Data Partitioning 18

“Distribute by Region – not by Archive!”

Page 19: Technische Universität München Workload-Aware Data Partitioning in Community- Driven Data Grids Tobias Scholl, Bernhard Bauer, Jessica Müller, Benjamin.

Technische Universität München

2009-03-24 EDBT 2009 – Workload-Aware Data Partitioning 19

“Distribute by Region – not by Archive!”

Page 20: Technische Universität München Workload-Aware Data Partitioning in Community- Driven Data Grids Tobias Scholl, Bernhard Bauer, Jessica Müller, Benjamin.

Technische Universität München

2009-03-24 EDBT 2009 – Workload-Aware Data Partitioning 20

Mapping Data to Nodes

Page 21: Technische Universität München Workload-Aware Data Partitioning in Community- Driven Data Grids Tobias Scholl, Bernhard Bauer, Jessica Müller, Benjamin.

Technische Universität München

2009-03-24 EDBT 2009 – Workload-Aware Data Partitioning 21

Workload-Aware Training Phase

• Incorporate query traces during training phase• Base partitioning scheme on

– Data load– Query load

• Challenges– Balance query load without losing data load balancing– Approximate real query hot spots from query sample

Page 22: Technische Universität München Workload-Aware Data Partitioning in Community- Driven Data Grids Tobias Scholl, Bernhard Bauer, Jessica Müller, Benjamin.

Technische Universität München

2009-03-24 EDBT 2009 – Workload-Aware Data Partitioning 22

Dealing with Query Hot Spots

• Query skew triggered by increased interest in particular subsets of the data

• Two well-known query load balancing techniques:– Data partitioning– Data replication

• Finding trade-offs between both

Page 23: Technische Universität München Workload-Aware Data Partitioning in Community- Driven Data Grids Tobias Scholl, Bernhard Bauer, Jessica Müller, Benjamin.

Technische Universität München

2009-03-24 EDBT 2009 – Workload-Aware Data Partitioning 23

When to Split (Partition) or to Replicate

• Considers partition characteristics– Amount of data (few/many data points)– Number of queries (few/many queries)– Extent of regions and queries (small/big queries)

Data points Few Queries Many Queries

Small Big Small Big

Few ─ ─ SPLIT REPLICATE

Many SPLIT SPLIT SPLIT REPLICATE

Page 24: Technische Universität München Workload-Aware Data Partitioning in Community- Driven Data Grids Tobias Scholl, Bernhard Bauer, Jessica Müller, Benjamin.

Technische Universität München

2009-03-24 EDBT 2009 – Workload-Aware Data Partitioning 24

Region Weight Functions

• Data only (#objects in a region)• Queries only (#queries in a region)• Scaled queries

– Approximate “real” extent of hot spot– Avoid overfitting to training query set

• Heat of a region (#objects * #queries)• Extents of regions and queries

– Replicate when many big queries

big

small

Page 25: Technische Universität München Workload-Aware Data Partitioning in Community- Driven Data Grids Tobias Scholl, Bernhard Bauer, Jessica Müller, Benjamin.

Technische Universität München

2009-03-24 EDBT 2009 – Workload-Aware Data Partitioning 25

Evaluation

• Weight functions: data, heat, extent• Data sets (observational, simulation)• Workloads (SDSS query log, synthetic)• Partitioning Scheme Properties

– Load distribution– Communication overhead

• Throughput Measurements– Distributed setup– FreePastry simulator

Pobs

Page 26: Technische Universität München Workload-Aware Data Partitioning in Community- Driven Data Grids Tobias Scholl, Bernhard Bauer, Jessica Müller, Benjamin.

Technische Universität München

2009-03-24 EDBT 2009 – Workload-Aware Data Partitioning 26

Load Distribution

• Uniform data set from the Millennium simulation• Workload with extreme hot spot• In the following:

– 1024 partitions– Heat of a region (#data * #queries)– Normalized across all partitioning schemes

Page 27: Technische Universität München Workload-Aware Data Partitioning in Community- Driven Data Grids Tobias Scholl, Bernhard Bauer, Jessica Müller, Benjamin.

Technische Universität München

2009-03-24 EDBT 2009 – Workload-Aware Data Partitioning 27

Query-unaware Training

Page 28: Technische Universität München Workload-Aware Data Partitioning in Community- Driven Data Grids Tobias Scholl, Bernhard Bauer, Jessica Müller, Benjamin.

Technische Universität München

2009-03-24 EDBT 2009 – Workload-Aware Data Partitioning 28

Training with Scaled Queries (scaled 50x)

Page 29: Technische Universität München Workload-Aware Data Partitioning in Community- Driven Data Grids Tobias Scholl, Bernhard Bauer, Jessica Müller, Benjamin.

Technische Universität München

2009-03-24 EDBT 2009 – Workload-Aware Data Partitioning 29

Training with Scaled Queries (scaled 400x)

Page 30: Technische Universität München Workload-Aware Data Partitioning in Community- Driven Data Grids Tobias Scholl, Bernhard Bauer, Jessica Müller, Benjamin.

Technische Universität München

2009-03-24 EDBT 2009 – Workload-Aware Data Partitioning 30

Heat-based, Extent-based Training

Page 31: Technische Universität München Workload-Aware Data Partitioning in Community- Driven Data Grids Tobias Scholl, Bernhard Bauer, Jessica Müller, Benjamin.

Technische Universität München

2009-03-24 EDBT 2009 – Workload-Aware Data Partitioning 31

Communication Overhead for Pobs

Page 32: Technische Universität München Workload-Aware Data Partitioning in Community- Driven Data Grids Tobias Scholl, Bernhard Bauer, Jessica Müller, Benjamin.

Technische Universität München

2009-03-24 EDBT 2009 – Workload-Aware Data Partitioning 32

Throughput for Pobs

Page 33: Technische Universität München Workload-Aware Data Partitioning in Community- Driven Data Grids Tobias Scholl, Bernhard Bauer, Jessica Müller, Benjamin.

Technische Universität München

2009-03-24 EDBT 2009 – Workload-Aware Data Partitioning 33

Load Balancing During Runtime

• Complement workload-aware partitioning with runtime load-balancing

• Short-term peaks– Master-slave approach– Load monitoring

• Long-term trends– Based on load monitoring– Histogram evolution

Page 34: Technische Universität München Workload-Aware Data Partitioning in Community- Driven Data Grids Tobias Scholl, Bernhard Bauer, Jessica Müller, Benjamin.

Technische Universität München

2009-03-24 EDBT 2009 – Workload-Aware Data Partitioning 34

Related Work

• On-line load balancing• Hundreds of thousands to

millions of nodes• Reacting fast• Treating objects

individually HiSbase

Page 35: Technische Universität München Workload-Aware Data Partitioning in Community- Driven Data Grids Tobias Scholl, Bernhard Bauer, Jessica Müller, Benjamin.

Technische Universität München

2009-03-24 EDBT 2009 – Workload-Aware Data Partitioning 35

Should I Split or Replicate?

• Many challenges and opportunities in e-science for database research– High-throughput data management– Correlation of distributed data sources

• Community-driven data grids– Dealing with data skew and query hot spots– Workload-awareness by employing cost model during

partitioning

Page 36: Technische Universität München Workload-Aware Data Partitioning in Community- Driven Data Grids Tobias Scholl, Bernhard Bauer, Jessica Müller, Benjamin.

Technische Universität München

2009-03-24 EDBT 2009 – Workload-Aware Data Partitioning 36

Get in Touch

• Database systems group, TU München– Web site: http://www-db.in.tum.de– E-mail: [email protected]

• The HiSbase project– http://www-db.in.tum.de/research/projects/hisbase/

Thank You for Your Attention