29/1/2014 Efficient Updates for a Shared Nothing Analytics Platform Katerina Doka, Dimitrios...

18
26/06/22 Efficient Updates for a Shared Nothing Analytics Platform Katerina Doka, Dimitrios Tsoumakos, Nectarios Koziris {katerina, dtsouma, nkoziris}@cslab.ece.ntua.gr Computing Systems Laboratory National Technical University of Athens

Transcript of 29/1/2014 Efficient Updates for a Shared Nothing Analytics Platform Katerina Doka, Dimitrios...

Page 1: 29/1/2014 Efficient Updates for a Shared Nothing Analytics Platform Katerina Doka, Dimitrios Tsoumakos, Nectarios Koziris {katerina, dtsouma, nkoziris}@cslab.ece.ntua.gr.

10/04/23

Efficient Updates for a Shared Nothing

Analytics Platform

Katerina Doka, Dimitrios Tsoumakos, Nectarios Koziris{katerina, dtsouma, nkoziris}@cslab.ece.ntua.gr

Computing Systems LaboratoryNational Technical University of Athens

Page 2: 29/1/2014 Efficient Updates for a Shared Nothing Analytics Platform Katerina Doka, Dimitrios Tsoumakos, Nectarios Koziris {katerina, dtsouma, nkoziris}@cslab.ece.ntua.gr.

10/04/23

Motivation• Large volumes of data

Everyday life, science and business domain

• Time-series data Temporally ordered, organized in hierarchies (Day<Month<Year)

• E.g., date of a credit card purchase, time of a phone call

Important for monitoring a process of interest

• On-line processing Fast retrieval – Point, range, aggregate queries Detection of real time changes in trends

• Intrusion or DoS detection, effects of product’s promotion Online, cost-efficient updates

2

Page 3: 29/1/2014 Efficient Updates for a Shared Nothing Analytics Platform Katerina Doka, Dimitrios Tsoumakos, Nectarios Koziris {katerina, dtsouma, nkoziris}@cslab.ece.ntua.gr.

10/04/23

Up till now• Data Warehouses

Centralized, off-line approaches Distributed warehousing systems

• Functionality remains centralized

• Distributed Warehouse-like initiative: Brown Dwarf Distribution of centralized Dwarf Deployed on shared-nothing, commodity hardware

• Scalability, fault tolerance, performance

No special consideration for time-series data Update procedure costly → unfit for frequent updates

3

Page 4: 29/1/2014 Efficient Updates for a Shared Nothing Analytics Platform Katerina Doka, Dimitrios Tsoumakos, Nectarios Koziris {katerina, dtsouma, nkoziris}@cslab.ece.ntua.gr.

10/04/23

Our Goals• Cloud based DataWarehousing-like system

Targeted to time-series data• Arriving at high rate

Store, update, query data at various granularity levels• Multidimensional, hierarchical

Shared nothing architecture• Commodity nodes

Without use of any proprietary tool• Java libraries, socket APIs

4

Page 5: 29/1/2014 Efficient Updates for a Shared Nothing Analytics Platform Katerina Doka, Dimitrios Tsoumakos, Nectarios Koziris {katerina, dtsouma, nkoziris}@cslab.ece.ntua.gr.

10/04/23

Our Contribution

5

• Complete system for multidimensional time-series data Store with one pass Update online Query efficiently

• Point, aggregate

• Various levels of granularity

• Adaptive materialization According to data recency Accelerate cube creation/update Minimize storage consumption

Page 6: 29/1/2014 Efficient Updates for a Shared Nothing Analytics Platform Katerina Doka, Dimitrios Tsoumakos, Nectarios Koziris {katerina, dtsouma, nkoziris}@cslab.ece.ntua.gr.

10/04/23

Dwarf• Dwarf computes, stores, indexes and updates materialized cubes

• Eliminates prefix and suffix redundancies

• Any query (point or aggregate) is answered through traversal of structure

6

Page 7: 29/1/2014 Efficient Updates for a Shared Nothing Analytics Platform Katerina Doka, Dimitrios Tsoumakos, Nectarios Koziris {katerina, dtsouma, nkoziris}@cslab.ece.ntua.gr.

10/04/23

Brown Dwarf• Dwarf nodes mapped to overlay nodes

• UID for each node• Hint tables of the form (currAttr, child)

• Insertion• One-pass over the fact table• Gradual structure of hint tables

• Queries • Overlay path of d hops

• Incremental Updates

• Elasticity through adaptive mirroring

7

Page 8: 29/1/2014 Efficient Updates for a Shared Nothing Analytics Platform Katerina Doka, Dimitrios Tsoumakos, Nectarios Koziris {katerina, dtsouma, nkoziris}@cslab.ece.ntua.gr.

10/04/23

Advantages and Drawbacks

• Store even larger amounts of data! Dwarf reduces but may also blow-up data

• High dimensional, sparse >1,000 times

• Handle many more requests

• Query the system online

• Accelerate creation (up to 5 times ) and querying (up to 60 times) Parallelization

• Update remains costly

8

Page 9: 29/1/2014 Efficient Updates for a Shared Nothing Analytics Platform Katerina Doka, Dimitrios Tsoumakos, Nectarios Koziris {katerina, dtsouma, nkoziris}@cslab.ece.ntua.gr.

10/04/23

Time Series Dwarf (TSD)

• A concept hierarchy characterizes time and any other dimension

• Updates are applied in temporal order

• Temporal granularity of queries relative to the time of querying More detailed queries for recent events More coarse grained queries for past events

9

Page 10: 29/1/2014 Efficient Updates for a Shared Nothing Analytics Platform Katerina Doka, Dimitrios Tsoumakos, Nectarios Koziris {katerina, dtsouma, nkoziris}@cslab.ece.ntua.gr.

10/04/23

TSD Operations - Insertion• Time first in order

• Lack of ALL cell in Time

• Aggregate created after completion of a level

10

Page 11: 29/1/2014 Efficient Updates for a Shared Nothing Analytics Platform Katerina Doka, Dimitrios Tsoumakos, Nectarios Koziris {katerina, dtsouma, nkoziris}@cslab.ece.ntua.gr.

10/04/23

TSD Operations - Querying• Follow path along the structure

• Roll-up query for aggregate already created Within d hops (e.g., <Y1, ALL, P1>)

• Roll-up query for recent records Initial query substituted by multiple lower level queries

(e.g., <Y2, S1, P1>)

11

Page 12: 29/1/2014 Efficient Updates for a Shared Nothing Analytics Platform Katerina Doka, Dimitrios Tsoumakos, Nectarios Koziris {katerina, dtsouma, nkoziris}@cslab.ece.ntua.gr.

10/04/23

TSD Operations - Updating• Insertion of a new tuple

• Longest common prefix with existing structure

• Underlying nodes recursively updated

• Lack of ALL cell for Time + temporal ordering = fewer existing cells affected

• Example: 3 TSD nodes vs. 12 Dwarf nodes affected

12

Page 13: 29/1/2014 Efficient Updates for a Shared Nothing Analytics Platform Katerina Doka, Dimitrios Tsoumakos, Nectarios Koziris {katerina, dtsouma, nkoziris}@cslab.ece.ntua.gr.

10/04/23

Adaptive Materialization

• A daemon process asynchronously creates roll-up views deletes corresponding drill-down ones

• The period of this process depends on application

• Tradeoff: cube size vs. response accuracy

13

Page 14: 29/1/2014 Efficient Updates for a Shared Nothing Analytics Platform Katerina Doka, Dimitrios Tsoumakos, Nectarios Koziris {katerina, dtsouma, nkoziris}@cslab.ece.ntua.gr.

10/04/23

Experimental Evaluation

• 25 LAN commodity nodes (dual core, 2.0 GHz, 4GB main memory)

• Synthetic and real datasets• APB-1 Benchmark generator

• 4-d, 3 levels for Time, various densities

• DARPA Intrusion Detection audit data• 1M tuples, 7-d, 3 levels for Time

• TSD: static mode

• TSDad: adaptive mode

14

Page 15: 29/1/2014 Efficient Updates for a Shared Nothing Analytics Platform Katerina Doka, Dimitrios Tsoumakos, Nectarios Koziris {katerina, dtsouma, nkoziris}@cslab.ece.ntua.gr.

10/04/23

Cube Construction

• Noticeable reduction of cube size for TSD, impressive for TSDad (up to 85% for the APB dataset)• Lack of the ALL cell in the first dimension

• Acceleration of cube creation up to 89% compared to Dwarf• Better use of resources through parallelization (BD)• Further reduction due to lack of ALL and selective materialization

15

Size (MB) Time (sec)Dataset #Tuples Dwarf BD TSD TSDad Dwarf BD TSD TSDad

APB-A 1.2M 56 59 53 9 485 101 100 57APB-B 2.5M 102 115 93 24 957 220 198 123APB-C 3.7M 163 182 146 32 1530 321 289 167DARPA 1.1M 178 191 156 127 614 222 208 189

Page 16: 29/1/2014 Efficient Updates for a Shared Nothing Analytics Platform Katerina Doka, Dimitrios Tsoumakos, Nectarios Koziris {katerina, dtsouma, nkoziris}@cslab.ece.ntua.gr.

10/04/23

Updates

• 10k updates

• TSD up to 3 times faster than Dwarf and 30% faster than BD• Ordered updates – do not affect already created views• No recursive updates for ALL cell of first dimension → smaller communication

overhead (3-fold reduction)

• TSDad does not include roll-up view creation (asynchronous) → further acceleration ~20%

16

Time(sec) Msgs/update

Dataset Dwarf BD TSD TSDad BD TSD TSDad

APB-A 1123 603 404 315 22 9 8

APB-B 1158 611 418 323 23 10 9

APB-C 1203 624 424 328 25 11 9

DARPA 1535 649 458 380 29 13 9

Page 17: 29/1/2014 Efficient Updates for a Shared Nothing Analytics Platform Katerina Doka, Dimitrios Tsoumakos, Nectarios Koziris {katerina, dtsouma, nkoziris}@cslab.ece.ntua.gr.

10/04/23

Queries

• DARPA 10k datasets – 3 kinds of querysets, 50% aggregates • Q1: Ideal• Q2: Recent records are queried upon in more detail (Zipfian)• Q3: Random

• As queryset approximates uniform distribution• Message cost increases• Accuracy decreases

17

Time(sec) Msgs/query %Inaccurate queries

%Resp.DeviationQueryset BD TSD TSDad BD TSD TSDad

Q1 5 6 6 7 7 7 0 0

Q2 5 9 8 7 9 9 15 19

Q3 5 24 21 7 32 32 33 32

Page 18: 29/1/2014 Efficient Updates for a Shared Nothing Analytics Platform Katerina Doka, Dimitrios Tsoumakos, Nectarios Koziris {katerina, dtsouma, nkoziris}@cslab.ece.ntua.gr.

10/04/23

Questions

18