Getting Rid of Data - VLDB · Production of Data & Storage Tova Milo GETTING RID OF DATA -...

72
Getting Rid of Data Tova Milo Tel Aviv University

Transcript of Getting Rid of Data - VLDB · Production of Data & Storage Tova Milo GETTING RID OF DATA -...

Page 1: Getting Rid of Data - VLDB · Production of Data & Storage Tova Milo GETTING RID OF DATA - VLDB’19 5 The size of our digital universe grows exponentially Forecast [IDC’17]: “By

Getting Rid of Data

Tova Milo

Tel Aviv University

Page 2: Getting Rid of Data - VLDB · Production of Data & Storage Tova Milo GETTING RID OF DATA - VLDB’19 5 The size of our digital universe grows exponentially Forecast [IDC’17]: “By

The Big Data Era

From sports,to health care, to the way we drive our cars,or choose how to invest our money,…Big Data is changing every aspect of our lives.

Tova Milo GETTING RID OF DATA - VLDB’19 2

Page 3: Getting Rid of Data - VLDB · Production of Data & Storage Tova Milo GETTING RID OF DATA - VLDB’19 5 The size of our digital universe grows exponentially Forecast [IDC’17]: “By

The Big Data Era

The data-centered revolution is fueled by the masses of data, but at the same time is at a great risk due to the very same information flood.

Tova Milo GETTING RID OF DATA - VLDB’19 3

Page 4: Getting Rid of Data - VLDB · Production of Data & Storage Tova Milo GETTING RID OF DATA - VLDB’19 5 The size of our digital universe grows exponentially Forecast [IDC’17]: “By

Time to stop and rethink the “More Data!” philosophy.

The 3 P’s to worry about:

Tova Milo GETTING RID OF DATA - VLDB’19 4

Production

Privacy

Performance

The Big Data Era

Page 5: Getting Rid of Data - VLDB · Production of Data & Storage Tova Milo GETTING RID OF DATA - VLDB’19 5 The size of our digital universe grows exponentially Forecast [IDC’17]: “By

Production ofData & Storage

Tova Milo GETTING RID OF DATA - VLDB’19 5

The size of our digital universe grows exponentially

Forecast [IDC’17]:

“By 2025 the global datasphere will grow to 163 zettabytes (trillion giga), ten times the 16.1 ZB of data generated in 2016.”

Updated forecast [IDC’18]:

“By 2025 the global datasphere will grow to 175 zettabytes, from the 33 ZB in 2018”

Storage demand is estimated to outstrip production by more than double!

Production

Privacy

Performance

Page 6: Getting Rid of Data - VLDB · Production of Data & Storage Tova Milo GETTING RID OF DATA - VLDB’19 5 The size of our digital universe grows exponentially Forecast [IDC’17]: “By

Data Size

Tova Milo GETTING RID OF DATA - VLDB’19 6

Production

Privacy

Performance

Page 7: Getting Rid of Data - VLDB · Production of Data & Storage Tova Milo GETTING RID OF DATA - VLDB’19 5 The size of our digital universe grows exponentially Forecast [IDC’17]: “By

How Much is175 ZB?

Tova Milo GETTING RID OF DATA - VLDB’19 7

“If one were able to store 175ZB onto BluRay discs, then you’d have a stack of discs that can get you to the moon 23 times…”

“Even if you could download 175ZB on today’s largest hard drive it would take 12.5 billion drives (and as an industry, we ship a fraction of that today.)”

Production

Privacy

Performance

Page 8: Getting Rid of Data - VLDB · Production of Data & Storage Tova Milo GETTING RID OF DATA - VLDB’19 5 The size of our digital universe grows exponentially Forecast [IDC’17]: “By

Storage Production

Tova Milo GETTING RID OF DATA - VLDB’19 8

Production

Privacy

Performance

Page 9: Getting Rid of Data - VLDB · Production of Data & Storage Tova Milo GETTING RID OF DATA - VLDB’19 5 The size of our digital universe grows exponentially Forecast [IDC’17]: “By

Data vs. Storage

Tova Milo GETTING RID OF DATA - VLDB’19 9

5 ZB

Production

Privacy

Performance

Page 10: Getting Rid of Data - VLDB · Production of Data & Storage Tova Milo GETTING RID OF DATA - VLDB’19 5 The size of our digital universe grows exponentially Forecast [IDC’17]: “By

Performance

Handling exponentially growing data incurs a substantial maintenance and processing overhead

• data cleaning,

• validation,

• enhancement,

• analysis,…

Selective data management is key to performance !

Tova Milo GETTING RID OF DATA - VLDB’19 10

Production

Privacy

Performance

Page 11: Getting Rid of Data - VLDB · Production of Data & Storage Tova Milo GETTING RID OF DATA - VLDB’19 5 The size of our digital universe grows exponentially Forecast [IDC’17]: “By

Let’s Think Energy…

Tova Milo GETTING RID OF DATA - VLDB’19 11

Production

Privacy

Performance

Page 12: Getting Rid of Data - VLDB · Production of Data & Storage Tova Milo GETTING RID OF DATA - VLDB’19 5 The size of our digital universe grows exponentially Forecast [IDC’17]: “By

Let’s Think Energy…

Tova Milo GETTING RID OF DATA - VLDB’19 12

Production

Privacy

Performance

Page 13: Getting Rid of Data - VLDB · Production of Data & Storage Tova Milo GETTING RID OF DATA - VLDB’19 5 The size of our digital universe grows exponentially Forecast [IDC’17]: “By

Energy Optimization ?

Over the last few years:

Development of better ways to cool data centers

Recycling the waste heat

Streamlining computing processes

Switching to renewable energy

Still, even in the best-scenario predictions, if we don’t learn how to dispense of data we’ll stay at the same consumption level (which is already high)

Tova Milo GETTING RID OF DATA - VLDB’19 13

Production

Privacy

Performance

Page 14: Getting Rid of Data - VLDB · Production of Data & Storage Tova Milo GETTING RID OF DATA - VLDB’19 5 The size of our digital universe grows exponentially Forecast [IDC’17]: “By

Privacy and Security

Even if we disregard storage and performance constraints, uncontrolled data retention dangers privacy & security

EU Data Protection Regulation (GDPR).

Sarbanes-Oxley, Graham-Leach-Bliley, the Fair and Accurate Credit Transactions Act, HIPAA,…

Data disposal/retention policies must be systematically developed and enforced to benefit and protect organizations and individuals.

Tova Milo GETTING RID OF DATA - VLDB’19 14

Production

Privacy

Performance

Page 15: Getting Rid of Data - VLDB · Production of Data & Storage Tova Milo GETTING RID OF DATA - VLDB’19 5 The size of our digital universe grows exponentially Forecast [IDC’17]: “By

1) Not all data is important!

2) People fear of loosing potentially important data

3) Already now, sometimes there is really no choice

4) Like most good ideas, we are not the first to think about this …

Tova Milo GETTING RID OF DATA - VLDB’19 15

Before we continue,4 important notes

Production

Privacy

Performance

Page 16: Getting Rid of Data - VLDB · Production of Data & Storage Tova Milo GETTING RID OF DATA - VLDB’19 5 The size of our digital universe grows exponentially Forecast [IDC’17]: “By

1) Not all data is important!

2) People fear of loosing potentially important data

3) Already now, sometimes there is really no choice

4) Like most good ideas, we are not the first to think about this …

Martin Kersten,"The Wildest Idea" Award, CIDR’15 Gong Show, for "Big Data Space Fungus"

Tova Milo GETTING RID OF DATA - VLDB’19 16

Before we continue,4 important notes

Production

Privacy

Performance

Page 17: Getting Rid of Data - VLDB · Production of Data & Storage Tova Milo GETTING RID OF DATA - VLDB’19 5 The size of our digital universe grows exponentially Forecast [IDC’17]: “By

Big Data Space Fungus

Tova Milo GETTING RID OF DATA - VLDB’19 17

Production

Privacy

Performance

[CIDR’15]

Page 18: Getting Rid of Data - VLDB · Production of Data & Storage Tova Milo GETTING RID OF DATA - VLDB’19 5 The size of our digital universe grows exponentially Forecast [IDC’17]: “By

Big Data Space Fungus

Tova Milo GETTING RID OF DATA - VLDB’19 18

Production

Privacy

Performance

[CIDR’15]

Page 19: Getting Rid of Data - VLDB · Production of Data & Storage Tova Milo GETTING RID OF DATA - VLDB’19 5 The size of our digital universe grows exponentially Forecast [IDC’17]: “By

Big Data Space Fungus

Tova Milo GETTING RID OF DATA - VLDB’19 19

Production

Privacy

Performance

[CIDR’15]

Page 20: Getting Rid of Data - VLDB · Production of Data & Storage Tova Milo GETTING RID OF DATA - VLDB’19 5 The size of our digital universe grows exponentially Forecast [IDC’17]: “By

Retaining the knowledge hidden in the data while respecting storage, processing and regulatory constraints

Determine an optimal disposal policy (which data to retain, summarize, dispose off) and execute it efficiently

Support full-cycle information processing over the partial data

Incrementally maintain the partial data as new info comes in

Tova Milo GETTING RID OF DATA - VLDB’19 20

Production

Privacy

Performance

The Data Disposal Challenge

Page 21: Getting Rid of Data - VLDB · Production of Data & Storage Tova Milo GETTING RID OF DATA - VLDB’19 5 The size of our digital universe grows exponentially Forecast [IDC’17]: “By

The 7 Criteriafor Disposing Data

What makes a piece of data important?

How importance changes over time?

Which of the data is important?

Which data can (or must) be retained/disposed off? When?

What is the cost of retaining / disposing off the data ?

How can data be summarized / disposed off?

How to process the partial data?

Tova Milo GETTING RID OF DATA - VLDB’19 21

Production

Privacy

Performance

Page 22: Getting Rid of Data - VLDB · Production of Data & Storage Tova Milo GETTING RID OF DATA - VLDB’19 5 The size of our digital universe grows exponentially Forecast [IDC’17]: “By

1. Existing tools(and why they are not enough)

2. Understanding the past(provenance)

3. Predicting the future(Deep Reinforcement Learning)

22

The Rest of This Talk

Page 23: Getting Rid of Data - VLDB · Production of Data & Storage Tova Milo GETTING RID OF DATA - VLDB’19 5 The size of our digital universe grows exponentially Forecast [IDC’17]: “By

(Very) Incomplete ListDeduplication

Entity resolution

(Semantic) compression & summarization

Relations

Semi-structured (XML, RDF, graph)

Unstructured (text)

Sampling

Approximate Query Processing

Sketching

Streams

Machine Learning

Dimensionality reduction

Clustering

Features selection

Tova Milo GETTING RID OF DATA - VLDB’19 23

Page 24: Getting Rid of Data - VLDB · Production of Data & Storage Tova Milo GETTING RID OF DATA - VLDB’19 5 The size of our digital universe grows exponentially Forecast [IDC’17]: “By

Example 1: Relations

Tova Milo GETTING RID OF DATA - VLDB’19 24

[Jagadish, Ng, Ooi, Tung, ICDE'04]

Back to the late 90’s…

Page 25: Getting Rid of Data - VLDB · Production of Data & Storage Tova Milo GETTING RID OF DATA - VLDB’19 5 The size of our digital universe grows exponentially Forecast [IDC’17]: “By

Example 2: Graphs

Tova Milo GETTING RID OF DATA - VLDB’19 25

[Song, Wu, Lin, Dong, Sun, TKDE‘18]

Page 26: Getting Rid of Data - VLDB · Production of Data & Storage Tova Milo GETTING RID OF DATA - VLDB’19 5 The size of our digital universe grows exponentially Forecast [IDC’17]: “By

Example 3: Sampling for AQP

Approximate query answers, at a fraction of full execution cost

In query-time sampling, the query is evaluated over samples taken from the database at run time.

For a sharper reduction on response time, draw samples from the data in a pre-processing step

Question 1: Sample also from the data summaries?

Question 2: Use the precomputed samples as data summaries, thereby allowing to discard some (or all) of the remaining items?

Tova Milo GETTING RID OF DATA - VLDB’19 26

[Chaudhuri, Ding, Kandula, SIGMOD‘17]

Page 27: Getting Rid of Data - VLDB · Production of Data & Storage Tova Milo GETTING RID OF DATA - VLDB’19 5 The size of our digital universe grows exponentially Forecast [IDC’17]: “By

Common Objectives

Summary properties

Conciseness

Diversification

Coverage

Accuracy w.r.t query results

Concrete queries

Queries class/workload

Information loss

Tova Milo GETTING RID OF DATA - VLDB’19 27

[Orr, Suciu, Balazinska, VLDB‘17]

Page 28: Getting Rid of Data - VLDB · Production of Data & Storage Tova Milo GETTING RID OF DATA - VLDB’19 5 The size of our digital universe grows exponentially Forecast [IDC’17]: “By

But in Practice…

Workloads are far more complex(cleaning, transformation, integration, ML,…)

Tova Milo GETTING RID OF DATA - VLDB’19 28

Page 29: Getting Rid of Data - VLDB · Production of Data & Storage Tova Milo GETTING RID OF DATA - VLDB’19 5 The size of our digital universe grows exponentially Forecast [IDC’17]: “By

But in Practice…

Workloads are far more complex(cleaning, transformation, integration, ML,…)

Need to understand how data is manipulated, summarized, disposed off throughout the entire workload !

Tova Milo GETTING RID OF DATA - VLDB’19 29

Page 30: Getting Rid of Data - VLDB · Production of Data & Storage Tova Milo GETTING RID OF DATA - VLDB’19 5 The size of our digital universe grows exponentially Forecast [IDC’17]: “By

1. Existing tools(and why they are not enough)

2. Understanding the past(provenance)

3. Predicting the future(Deep Reinforcement Learning)

30

The Rest of This Talk

Page 31: Getting Rid of Data - VLDB · Production of Data & Storage Tova Milo GETTING RID OF DATA - VLDB’19 5 The size of our digital universe grows exponentially Forecast [IDC’17]: “By

Data Provenance

Tracks computation and reveals the “origin” of results

Many different models with different granularities

Can be a key for performing & understanding data reduction

Tova Milo GETTING RID OF DATA - VLDB’19 31

Page 32: Getting Rid of Data - VLDB · Production of Data & Storage Tova Milo GETTING RID OF DATA - VLDB’19 5 The size of our digital universe grows exponentially Forecast [IDC’17]: “By

Provenance by Example

Tova Milo GETTING RID OF DATA - VLDB’19 32

Page 33: Getting Rid of Data - VLDB · Production of Data & Storage Tova Milo GETTING RID OF DATA - VLDB’19 5 The size of our digital universe grows exponentially Forecast [IDC’17]: “By

Lineage

Tova Milo GETTING RID OF DATA - VLDB’19 33

Page 34: Getting Rid of Data - VLDB · Production of Data & Storage Tova Milo GETTING RID OF DATA - VLDB’19 5 The size of our digital universe grows exponentially Forecast [IDC’17]: “By

Provenance Polynomials

Tova Milo GETTING RID OF DATA - VLDB’19 34

Page 35: Getting Rid of Data - VLDB · Production of Data & Storage Tova Milo GETTING RID OF DATA - VLDB’19 5 The size of our digital universe grows exponentially Forecast [IDC’17]: “By

Provenance Polynomials

Tova Milo GETTING RID OF DATA - VLDB’19 35

Page 36: Getting Rid of Data - VLDB · Production of Data & Storage Tova Milo GETTING RID OF DATA - VLDB’19 5 The size of our digital universe grows exponentially Forecast [IDC’17]: “By

Workflow Provenance

Tova Milo GETTING RID OF DATA - VLDB’19 36

Page 37: Getting Rid of Data - VLDB · Production of Data & Storage Tova Milo GETTING RID OF DATA - VLDB’19 5 The size of our digital universe grows exponentially Forecast [IDC’17]: “By

Many Applications

• Results Explanation

• Hypothetical reasoning

• Trust level assessment

• Computation in presence of incomplete/probabilistic info.

• Data reduction [Gershtein, M, Novgorodov, CIKM’19]

• …

Tova Milo GETTING RID OF DATA - VLDB’19 37

Page 38: Getting Rid of Data - VLDB · Production of Data & Storage Tova Milo GETTING RID OF DATA - VLDB’19 5 The size of our digital universe grows exponentially Forecast [IDC’17]: “By

But…

Provenance is HUGE

Tova Milo GETTING RID OF DATA - VLDB’19 38

Page 39: Getting Rid of Data - VLDB · Production of Data & Storage Tova Milo GETTING RID OF DATA - VLDB’19 5 The size of our digital universe grows exponentially Forecast [IDC’17]: “By

Provenance Reduction

Lossless

Size reduction via expression simplification/factorization

(e.g. using Boolean circuits)

Lossy

Selective provenance

Compression via abstraction

Tova Milo GETTING RID OF DATA - VLDB’19 39

Page 40: Getting Rid of Data - VLDB · Production of Data & Storage Tova Milo GETTING RID OF DATA - VLDB’19 5 The size of our digital universe grows exponentially Forecast [IDC’17]: “By

Example:Compression by Abstraction

Tova Milo GETTING RID OF DATA - VLDB’19 40

[Deutch, Moskovitch, Rinetzky SIGMOD’19]

Page 41: Getting Rid of Data - VLDB · Production of Data & Storage Tova Milo GETTING RID OF DATA - VLDB’19 5 The size of our digital universe grows exponentially Forecast [IDC’17]: “By

Example:Compression by Abstraction

Tova Milo GETTING RID OF DATA - VLDB’19 41

Page 42: Getting Rid of Data - VLDB · Production of Data & Storage Tova Milo GETTING RID OF DATA - VLDB’19 5 The size of our digital universe grows exponentially Forecast [IDC’17]: “By

Example:Compression by Abstraction

Tova Milo GETTING RID OF DATA - VLDB’19 42

Page 43: Getting Rid of Data - VLDB · Production of Data & Storage Tova Milo GETTING RID OF DATA - VLDB’19 5 The size of our digital universe grows exponentially Forecast [IDC’17]: “By

Example:Compression by Abstraction

Tova Milo GETTING RID OF DATA - VLDB’19 43

Page 44: Getting Rid of Data - VLDB · Production of Data & Storage Tova Milo GETTING RID OF DATA - VLDB’19 5 The size of our digital universe grows exponentially Forecast [IDC’17]: “By

Optimization Problem

• Choose a cut in the ontology that maximizes expressiveness for a target compression ratio

• NP-hard in general

• Polynomial time complexity for a single ontology

• Practically appealing heuristics for the general case

Tova Milo GETTING RID OF DATA - VLDB’19 44

Expressiveness

Size

Page 45: Getting Rid of Data - VLDB · Production of Data & Storage Tova Milo GETTING RID OF DATA - VLDB’19 5 The size of our digital universe grows exponentially Forecast [IDC’17]: “By

1. Existing tools(and why they are not enough)

2. Understanding the past(provenance)

3. Predicting the future(Deep Reinforcement Learning)

45

The Rest of This Talk

Page 46: Getting Rid of Data - VLDB · Production of Data & Storage Tova Milo GETTING RID OF DATA - VLDB’19 5 The size of our digital universe grows exponentially Forecast [IDC’17]: “By

Learn what may be interesting in a new dataset

Tova Milo GETTING RID OF DATA - VLDB’19 46

Exploratory data analysis (EDA):

The process of examining & investigating a given dataset

Page 47: Getting Rid of Data - VLDB · Production of Data & Storage Tova Milo GETTING RID OF DATA - VLDB’19 5 The size of our digital universe grows exponentially Forecast [IDC’17]: “By

Exploratory Data AnalysisEEDA is an iterative process:

A user u loads a dataset D to an analysis interface.

Performs a sequence of: Su(D)= q1, q2,…qn of actions (e.g. queries)

After executing qi - the user examines the results, and decides if and which action to perform next.

The goal:

Understand the nature of the dataset

Discover its properties

Estimate its quality

Figure our what may be interesting in it

Modern analysis platforms (e.g. Splunk, Kibana-ELK, Tableau, …)

Tova Milo GETTING RID OF DATA - VLDB’19 47

Page 48: Getting Rid of Data - VLDB · Production of Data & Storage Tova Milo GETTING RID OF DATA - VLDB’19 5 The size of our digital universe grows exponentially Forecast [IDC’17]: “By

EDA agent

Can we teach a machine to generate a coherent, meaningful sequence of exploratory queries?

Tova Milo GETTING RID OF DATA - VLDB’19 48

Page 49: Getting Rid of Data - VLDB · Production of Data & Storage Tova Milo GETTING RID OF DATA - VLDB’19 5 The size of our digital universe grows exponentially Forecast [IDC’17]: “By

Deep Reinforcement Learning

DRL works surprisingly well for very difficult tasks:

Play Go

Drive a car

Conduct natural language dialogs

……

Tova Milo GETTING RID OF DATA - VLDB’19 49

Page 50: Getting Rid of Data - VLDB · Production of Data & Storage Tova Milo GETTING RID OF DATA - VLDB’19 5 The size of our digital universe grows exponentially Forecast [IDC’17]: “By

Can/Should we use DRL?

PROS:

It requires NO training data OR traces of user activity

Once trained - results can be obtained rather FAST.

CONS:

It is a heavy-weight tool, requires lots of computing power.

Currently works mostly on game-like environments

Even when working - it may just overfit to some odd patterns in the data

Tova Milo GETTING RID OF DATA - VLDB’19 50

Page 51: Getting Rid of Data - VLDB · Production of Data & Storage Tova Milo GETTING RID OF DATA - VLDB’19 5 The size of our digital universe grows exponentially Forecast [IDC’17]: “By

1. Quick recap of standard RL settings

2. Requirements for RL-EDA environment

3. Our framework (ongoing work)

Tova Milo GETTING RID OF DATA - VLDB’19 51

The Rest of This Talk

Page 52: Getting Rid of Data - VLDB · Production of Data & Storage Tova Milo GETTING RID OF DATA - VLDB’19 5 The size of our digital universe grows exponentially Forecast [IDC’17]: “By

RL Standard Settings

In the (not so simple) Atari environment:

Tova Milo GETTING RID OF DATA - VLDB’19 52

1. Agent observes a “State”

from an “environment”

2. Agent selects an “action”

3. Agent receives “reward”

4. Agent learns (unsupervised)

a “policy” that maximizes

the mean reward

Page 53: Getting Rid of Data - VLDB · Production of Data & Storage Tova Milo GETTING RID OF DATA - VLDB’19 5 The size of our digital universe grows exponentially Forecast [IDC’17]: “By

RL-EDA Settings

Tova Milo GETTING RID OF DATA - VLDB’19 53

Utilizing the RL paradigm for EDA:

1. Agent observes a dataset/results set

2. Agent formulates a query

3. Agent receives reward

4. Agent learns to maximize the reward

Page 54: Getting Rid of Data - VLDB · Production of Data & Storage Tova Milo GETTING RID OF DATA - VLDB’19 5 The size of our digital universe grows exponentially Forecast [IDC’17]: “By

1.RL-EDA environment

2. State and action representation

3. Reward Signal

4. Agent NN-Architecture

Tova Milo GETTING RID OF DATA - VLDB’19 54

Outline for an RL-EDA Framework

Page 55: Getting Rid of Data - VLDB · Production of Data & Storage Tova Milo GETTING RID OF DATA - VLDB’19 5 The size of our digital universe grows exponentially Forecast [IDC’17]: “By

1.RL-EDA environment

2. State and action representation

3. Reward Signal

4. Agent NN-Architecture

Tova Milo GETTING RID OF DATA - VLDB’19 55

Outline for an RL-EDA Framework

Page 56: Getting Rid of Data - VLDB · Production of Data & Storage Tova Milo GETTING RID OF DATA - VLDB’19 5 The size of our digital universe grows exponentially Forecast [IDC’17]: “By

RL-EDA Environment

RL-EDA environment comprises:

(1) A collection of datasets

(2) Query interface

RL-EDA Episode:

The agent is “given” an arbitrary dataset

The agent performs a “session” (sequence) of N queries.

Tova Milo GETTING RID OF DATA - VLDB’19 56

Page 57: Getting Rid of Data - VLDB · Production of Data & Storage Tova Milo GETTING RID OF DATA - VLDB’19 5 The size of our digital universe grows exponentially Forecast [IDC’17]: “By

1.RL-EDA environment

2. State and action representation

3. Reward Signal

4. Agent NN-Architecture

Tova Milo GETTING RID OF DATA - VLDB’19 57

Outline for an RL-EDA Framework

Page 58: Getting Rid of Data - VLDB · Production of Data & Storage Tova Milo GETTING RID OF DATA - VLDB’19 5 The size of our digital universe grows exponentially Forecast [IDC’17]: “By

State Representation

Tova Milo GETTING RID OF DATA - VLDB’19 58

Result displays are often large and complex…

→ Summarize the results display into a numeric vector

Structural features of the data:

Value entropy, # of distinct values, # of Null values

Grouping/Aggregation features:

# of groups, groups size variance, aggr. values, entropy,…

Context:

N previous displays

Page 59: Getting Rid of Data - VLDB · Production of Data & Storage Tova Milo GETTING RID OF DATA - VLDB’19 5 The size of our digital universe grows exponentially Forecast [IDC’17]: “By

1.RL-EDA environment

2. State and action representation

3. Reward Signal

4. Agent NN-Architecture

Tova Milo GETTING RID OF DATA - VLDB’19 59

Outline for an RL-EDA Framework

Page 60: Getting Rid of Data - VLDB · Production of Data & Storage Tova Milo GETTING RID OF DATA - VLDB’19 5 The size of our digital universe grows exponentially Forecast [IDC’17]: “By

Action RepresentationParameterized Actions (action type + parameters)

• FILTER(attr, op, term) - used to select data tuples that matches a criteria

• GROUP(attr, agg func, agg attr) - groups and aggregates the data

• BACK() - allows the agent to backtrack to a previous display

Our Representation

• [action_type, attr, op, term, agg_func, agg_attr]

• Handle filter terms using the frequency of appearances in the display

Issue: large actions domain

Tova Milo GETTING RID OF DATA - VLDB’19 60

Page 61: Getting Rid of Data - VLDB · Production of Data & Storage Tova Milo GETTING RID OF DATA - VLDB’19 5 The size of our digital universe grows exponentially Forecast [IDC’17]: “By

1.RL-EDA environment

2. State and action representation

3. Reward Signal

4. Agent NN-Architecture

Tova Milo GETTING RID OF DATA - VLDB’19 61

Outline for an RL-EDA Framework

Page 62: Getting Rid of Data - VLDB · Production of Data & Storage Tova Milo GETTING RID OF DATA - VLDB’19 5 The size of our digital universe grows exponentially Forecast [IDC’17]: “By

Reward Signal

Given a sequence SD= q1, q2,…qn of queries performed by the agent on dataset D. How to determine the reward R(SD)?

We suggest three major components.

1. Interestingness: Actions inducing interesting results set should be encouraged

2. Diversity: Actions in the same session should yield diverseresults describing different aspects of the dataset

3. Coherency: The session is understandable to human analysts

Tova Milo GETTING RID OF DATA - VLDB’19 62

Page 63: Getting Rid of Data - VLDB · Production of Data & Storage Tova Milo GETTING RID OF DATA - VLDB’19 5 The size of our digital universe grows exponentially Forecast [IDC’17]: “By

Interestingness

Tova Milo GETTING RID OF DATA - VLDB’19 63

Multitude of interestingness measures are suggested in previous work.

Each captures a different aspect of interestingness:

DiversityMeasures how much the elements of a data

pattern are different from on another

PecularityMeasures how anomalous is a pattern

comparing to the rest of the data patterns

ConcisenessMeasures the size of the pattern compared

to its coverage

NoveltyMeasures how unexpected a data pattern is

w.r.t. known prior knowledge

Page 64: Getting Rid of Data - VLDB · Production of Data & Storage Tova Milo GETTING RID OF DATA - VLDB’19 5 The size of our digital universe grows exponentially Forecast [IDC’17]: “By

Diversity

Goal: encourage the agent to choose actions inducing new observations of different parts of the data than those examined so far

Solution: calculate the Euclidean distances between the observation vector of the current results display and the vectors of all previous displays

64Tova Milo GETTING RID OF DATA - VLDB’19

Page 65: Getting Rid of Data - VLDB · Production of Data & Storage Tova Milo GETTING RID OF DATA - VLDB’19 5 The size of our digital universe grows exponentially Forecast [IDC’17]: “By

Coherency

Performed using an external classifier:

1. Given the dataset schema & application domain we use a set of heuristic classification-rules composed by domain experts(e.g. “a group-by that is employed on more than 4 attributes is non-coherent”)

2. Then employ Snorkel to build a weak-supervision based classifier

65Tova Milo GETTING RID OF DATA - VLDB’19

Page 66: Getting Rid of Data - VLDB · Production of Data & Storage Tova Milo GETTING RID OF DATA - VLDB’19 5 The size of our digital universe grows exponentially Forecast [IDC’17]: “By

1.RL-EDA environment

2. State and action representation

3. Reward Signal

4. Agent NN-Architecture

Tova Milo GETTING RID OF DATA - VLDB’19 66

Outline for an RL-EDA Framework

Page 67: Getting Rid of Data - VLDB · Production of Data & Storage Tova Milo GETTING RID OF DATA - VLDB’19 5 The size of our digital universe grows exponentially Forecast [IDC’17]: “By

ChallengesLarge # of actions

(in particular due to the Filter parameter)

Exploration challenges: imbalanced action types (BACK, GROUP, FILTER)

Our solution: parameterized softmax with pre-output layer

67Tova Milo GETTING RID OF DATA - VLDB’19

Page 68: Getting Rid of Data - VLDB · Production of Data & Storage Tova Milo GETTING RID OF DATA - VLDB’19 5 The size of our digital universe grows exponentially Forecast [IDC’17]: “By

A few words about experimental evaluation

1. Learning curves and reward

2. Competitors: Greedy, Recommender systems, Human…

3. Measures: BLEU, sessions similarity

“Turing test”

Tova Milo GETTING RID OF DATA - VLDB’19 68

Page 69: Getting Rid of Data - VLDB · Production of Data & Storage Tova Milo GETTING RID OF DATA - VLDB’19 5 The size of our digital universe grows exponentially Forecast [IDC’17]: “By

Time to Conclude…

Tova Milo GETTING RID OF DATA - VLDB’19 69

Page 70: Getting Rid of Data - VLDB · Production of Data & Storage Tova Milo GETTING RID OF DATA - VLDB’19 5 The size of our digital universe grows exponentially Forecast [IDC’17]: “By

Time to Conclude…

The Data Disposal Challenge

Determine an optimal disposal policy (which data to retain, summarize, dispose off) and execute it efficiently

Support full-cycle information processing over the partial data

Incrementally maintain the partial data as new info comes in

Define formally what makes a disposal policy good…

Tova Milo GETTING RID OF DATA - VLDB’19 70

Page 71: Getting Rid of Data - VLDB · Production of Data & Storage Tova Milo GETTING RID OF DATA - VLDB’19 5 The size of our digital universe grows exponentially Forecast [IDC’17]: “By

Time to Conclude…

1. Plenty of relevant tools

2. But still very far from a comprehensive solution

3. ML agents: Still a lot to do here!

Support more data analysis actions

Adaptive disposal policies based on user interaction

Consider potential data exploration goals

Tova Milo GETTING RID OF DATA - VLDB’19 71

Page 72: Getting Rid of Data - VLDB · Production of Data & Storage Tova Milo GETTING RID OF DATA - VLDB’19 5 The size of our digital universe grows exponentially Forecast [IDC’17]: “By

Thank You

72

Ori Bar-El, Naama Boer, Daniel Deutch, Shay Gershtein, Amir Gilad,

Gefen Keinan, Nave Frost, Yuval Moskovitch, Slava Novgorodov, Kathy

Razmadze, Noam Rinetzky, Amit Somech, Brit Youngmann, …