EXPERIMENTAL METHODS AND TECHNIQUES IN ...home.deib.polimi.it/schiaffo/CS/EXPERIMENTAL METHODS...

59
EXPERIMENTAL METHODS AND TECHNIQUES IN ENGINEERING The Data Management Perspective Fabio A. Schreiber Politecnico di Milano Dipartimento di Elettronica, Informazione e Bioingegneria

Transcript of EXPERIMENTAL METHODS AND TECHNIQUES IN ...home.deib.polimi.it/schiaffo/CS/EXPERIMENTAL METHODS...

Page 1: EXPERIMENTAL METHODS AND TECHNIQUES IN ...home.deib.polimi.it/schiaffo/CS/EXPERIMENTAL METHODS AND...EXPERIMENTAL METHODS AND TECHNIQUES IN ENGINEERING The Data Management Perspective

EXPERIMENTAL METHODS AND TECHNIQUES IN ENGINEERING

The Data Management Perspective

Fabio A. Schreiber

Politecnico di Milano

Dipartimento di Elettronica, Informazione e Bioingegneria

Page 2: EXPERIMENTAL METHODS AND TECHNIQUES IN ...home.deib.polimi.it/schiaffo/CS/EXPERIMENTAL METHODS AND...EXPERIMENTAL METHODS AND TECHNIQUES IN ENGINEERING The Data Management Perspective

THE DATA MANAGEMENT PERSPECTIVE

Experiments on Databases and DBMSs

Data organization and management as a service to the experiments of the scientific community

Experimenting with the Database content itself

F. A. Schreiber Experimental Methods ... Data Perspective

1

Page 3: EXPERIMENTAL METHODS AND TECHNIQUES IN ...home.deib.polimi.it/schiaffo/CS/EXPERIMENTAL METHODS AND...EXPERIMENTAL METHODS AND TECHNIQUES IN ENGINEERING The Data Management Perspective

EXPERIMENTS & DATA MANAGEMENT

Experiments for optimizing data structures and management Database Management System (DBMS) Data Structures Conceptual/Logical Schema optimization and evolution Physical structures design

Data organization and management for collecting experimental results (e-Science)

Exploring Database content (data mining) Assessing Data Quality

F. A. Schreiber Experimental Methods ... Data erspective

2

Page 4: EXPERIMENTAL METHODS AND TECHNIQUES IN ...home.deib.polimi.it/schiaffo/CS/EXPERIMENTAL METHODS AND...EXPERIMENTAL METHODS AND TECHNIQUES IN ENGINEERING The Data Management Perspective

EXPERIMENTS & DATA MANAGEMENT

F. A. Schreiber Experimental Methods ... Data Perspective

Goals Systems Performance Evaluation and Tuning How performant a system is? How can I improve its performance?

Benchmarking Comparison among different systems under similar workload

System Effectiveness How much a system conforms to the user’s needs

w.r.t. a defined metric

3

Page 5: EXPERIMENTAL METHODS AND TECHNIQUES IN ...home.deib.polimi.it/schiaffo/CS/EXPERIMENTAL METHODS AND...EXPERIMENTAL METHODS AND TECHNIQUES IN ENGINEERING The Data Management Perspective

WORKLOAD AND FACTORS

F. A. Schreiber Experimental Methods ... Data Perspective

Synthetic vs. Real Synthetic workload allows for controlled experiment

repeteability. Useful in systems comparison Real workload can be highly variable and can be used in

assessing the overall performance of a single system in its real environment

Single-user (to test specific algorithms) vs. Multi-user (to test system procedures) Multiprogramming level Query mix Degree of data sharing (buffer and cache sizes)

4

Page 6: EXPERIMENTAL METHODS AND TECHNIQUES IN ...home.deib.polimi.it/schiaffo/CS/EXPERIMENTAL METHODS AND...EXPERIMENTAL METHODS AND TECHNIQUES IN ENGINEERING The Data Management Perspective

FACTORS IN DBMS PERFORMANCE EVALUATION (Boral & DeWitt 84)[2]

F. A. Schreiber Experimental Methods ... Data Perspective

Multiprogramming level (MPL) Number of concurrent queries in any phase of execution Use precompiled queries and minimize the data volume of the results in order to exibit as much as possible the true «execution» time

5

Page 7: EXPERIMENTAL METHODS AND TECHNIQUES IN ...home.deib.polimi.it/schiaffo/CS/EXPERIMENTAL METHODS AND...EXPERIMENTAL METHODS AND TECHNIQUES IN ENGINEERING The Data Management Perspective

FACTORS IN DBMS PERFORMANCE EVALUATION

F. A. Schreiber Experimental Methods ... Data Perspective

Degree of Data Sharing (DDS) Concurrent access affects both data pages (rare) and index

pages (frequent) Expressed as a percentage of the multiprogramming level: 0% each query references only its partition 100% all queries reference the same partition 0%<DDS<100%

Queries randomly distributed among partitions Application programs uniformly distributed among partitions

The DDS affects the buffer pages replacement algorithm MRU best for relational operators LRU best for replacement of shared data pages

6

Page 8: EXPERIMENTAL METHODS AND TECHNIQUES IN ...home.deib.polimi.it/schiaffo/CS/EXPERIMENTAL METHODS AND...EXPERIMENTAL METHODS AND TECHNIQUES IN ENGINEERING The Data Management Perspective

FACTORS IN DBMS PERFORMANCE EVALUATION

F. A. Schreiber Experimental Methods ... Data Perspective

Query Mix Selection (multiuser) (Boral & DeWitt 84)[2]

Consumed resources CPU cycles: actual query execution, access path selection,

buffer pool management, OS disk operations Disk bandwith: get/store data, page swapping

Query type CPU Disk Query example I Low

0.18 s Low 2-3

Select one tuple from 10000 using a clustered index

II Low 0.90 s

High 91

Select 100 tuples from 10000 using a non-clustered index

III High 18.96 s

Low 206

Join 10000 tuples with 1000 tuples using a clustered index on the join attribute of the first relation

IV High 35.62 s

High 1008

Aggregate function on 10000 tuple relation (100 partitions)

7

Page 9: EXPERIMENTAL METHODS AND TECHNIQUES IN ...home.deib.polimi.it/schiaffo/CS/EXPERIMENTAL METHODS AND...EXPERIMENTAL METHODS AND TECHNIQUES IN ENGINEERING The Data Management Perspective

METRIC (Schwartz 11) [15]

F. A. Schreiber Experimental Methods ... Data Perspective

Measured entities Observation interval (OI) Number of queries in the observation interval (NQ) Busy time: total time of the queries in the system (BT) Weighted time: total execution time of the queries (WT)

Derived variables Throughput: NQ/OI Execution time: WT/NQ Concurrency: WT/OI Utilization: BT/OI

8

Page 10: EXPERIMENTAL METHODS AND TECHNIQUES IN ...home.deib.polimi.it/schiaffo/CS/EXPERIMENTAL METHODS AND...EXPERIMENTAL METHODS AND TECHNIQUES IN ENGINEERING The Data Management Perspective

METRIC

F. A. Schreiber Experimental Methods ... Data Perspective

Si,j , Ei,j starting and ending times of the jth query of the i concurrent program 1≤ i ≤ MPL , 1≤j≤N

Tlast-to-start = max{Si,1 , 1≤ i ≤ MPL } ; Tfirst-to-finish = min{Ei,N , 1≤ i ≤ MPL }

NQ number of totally executed queries

System throughput NQ/(Tfirst-to-finish - Tlast-to-start) Average Response time Σ exec_timesNQ /NQ

t

MPL

Tlast-to-start Tfirst-to-finish

Number of totally executed queries NQ

9

Page 11: EXPERIMENTAL METHODS AND TECHNIQUES IN ...home.deib.polimi.it/schiaffo/CS/EXPERIMENTAL METHODS AND...EXPERIMENTAL METHODS AND TECHNIQUES IN ENGINEERING The Data Management Perspective

EXPERIMENTAL MODALITY

F. A. Schreiber Experimental Methods ... Data Perspective

Simulation vs. real life Testbeds simulation often provides imprecise results owing to many

parameters of the system which are not accounted for by the simulation programs

testbeds with a very large number of components are very difficult, if not impossible, to organize

use testbeds to tune and calibrate simulation

programs???

Repeatability is essential for the experiment credibility (Manolescu &Al.) [8]

10

Page 12: EXPERIMENTAL METHODS AND TECHNIQUES IN ...home.deib.polimi.it/schiaffo/CS/EXPERIMENTAL METHODS AND...EXPERIMENTAL METHODS AND TECHNIQUES IN ENGINEERING The Data Management Perspective

A DBMS QUEUING MODEL

F. A. Schreiber Experimental Methods ... Data Perspective

USERS

TRANSACTION REQUEST

PRIORITY ASSIGNMENT

RESTART RESUBMIT TERMINATE

COMMIT ABORT REQUEST/RELEASE

A DATA OBJECT CONCURR. CONTROL

BLOCK WAIT DB OPERATION

BUFFER ACCESS

HIT

MISS DISK

ACCESS

COMPUTATION

CPU

11

Page 13: EXPERIMENTAL METHODS AND TECHNIQUES IN ...home.deib.polimi.it/schiaffo/CS/EXPERIMENTAL METHODS AND...EXPERIMENTAL METHODS AND TECHNIQUES IN ENGINEERING The Data Management Perspective

SCALABILITY

F. A. Schreiber Experimental Methods ... Data Perspective

Systems COMPLEXITY The memory and time behaviour of algorithms cannot be

inferred by testing systems composed of only a bounce of nodes: constants matter!

n

O(n)

O(n2) O(2n)

O(log n)

12

Page 14: EXPERIMENTAL METHODS AND TECHNIQUES IN ...home.deib.polimi.it/schiaffo/CS/EXPERIMENTAL METHODS AND...EXPERIMENTAL METHODS AND TECHNIQUES IN ENGINEERING The Data Management Perspective

DATABASE SYSTEMS BENCHMARKS

F. A. Schreiber Experimental Methods ... Data Perspective

Useful for comparing different DBMS Systems must be fully installed and operational They rely on the effectiveness of synthetic

workloads Benchmarking is an experimental activity which

requires three steps (as usual): Design Execution

Analysis

13

Page 15: EXPERIMENTAL METHODS AND TECHNIQUES IN ...home.deib.polimi.it/schiaffo/CS/EXPERIMENTAL METHODS AND...EXPERIMENTAL METHODS AND TECHNIQUES IN ENGINEERING The Data Management Perspective

DATABASE SYSTEMS BENCHMARKS

F. A. Schreiber Experimental Methods ... Data Perspective

14

The good thing about standards is that there are so many of them

Unknown

… and each one has so many options that can be chosen at will …

Page 16: EXPERIMENTAL METHODS AND TECHNIQUES IN ...home.deib.polimi.it/schiaffo/CS/EXPERIMENTAL METHODS AND...EXPERIMENTAL METHODS AND TECHNIQUES IN ENGINEERING The Data Management Perspective

DATABASE SYSTEMS BENCHMARKS

F. A. Schreiber Experimental Methods ... Data Perspective

Transaction Processing Performance Council (TPC) [16] TPC-C: for OLTP systems. It simulates a multi-user environment

making concurrent queries to a central Database. Suited for on-line handling of orders and for managing inventories.

TPC-E: similar to TPC-C, but with transactions designed for brokerage environments such as on-line trading, market research, account inquiries, …

TPC-H: tuned for Decision Support Systems, complex data mining queries, concurrent data modifications.

Other benchmarks for specific products (Oracle, MySQL, …) or functionalities (security, web servers, …)

15

Page 17: EXPERIMENTAL METHODS AND TECHNIQUES IN ...home.deib.polimi.it/schiaffo/CS/EXPERIMENTAL METHODS AND...EXPERIMENTAL METHODS AND TECHNIQUES IN ENGINEERING The Data Management Perspective

DATA ORGANIZATION AND MANAGEMENT FOR COLLECTING EXPERIMENTAL RESULTS (e-Science) (Vanschoren & Blockeel 10) [14]

F. A. Schreiber Experimental Methods ... Data Perspective

Experimental data collection create searchable , community-wide repositories to

automatically publish experimental results on-line a formal experiment description language to import a large

number of experiments and make them immediately available to everyone

ontologies providing a controlled vocabulary clearly describing the interpretation of each concept

Generate a collaborative approach to experimentation Experiments freely shared Linked together Reused by querying and data mining

16

Page 18: EXPERIMENTAL METHODS AND TECHNIQUES IN ...home.deib.polimi.it/schiaffo/CS/EXPERIMENTAL METHODS AND...EXPERIMENTAL METHODS AND TECHNIQUES IN ENGINEERING The Data Management Perspective

e-SCIENCES

F. A. Schreiber Experimental Methods ... Data Perspective

17

Computationally intensive sciences, which use the internet as a global collaborative workspace Bioinformatics Microarrays (Stoeckert & Al. 02) [12]

Proteomics (Masseroli 07) [10]

Astronomy Virtual observatories (Szalay & Gray 01) [13]

Physics High energy nuclear physics (Brown & 07) [3]

Page 19: EXPERIMENTAL METHODS AND TECHNIQUES IN ...home.deib.polimi.it/schiaffo/CS/EXPERIMENTAL METHODS AND...EXPERIMENTAL METHODS AND TECHNIQUES IN ENGINEERING The Data Management Perspective

e-SCIENCES

F. A. Schreiber Experimental Methods ... Data Perspective

18

e-science applications as well as other Web Information Systems share a collaborative and distributed nature of their development and content management (Curino & Al. 08) [5,6]

Evolution in time DB migration

While preserving the past contents of the DB and the history of its schema

Applications maintenance while allowing legacy applications to access new contents

through old schema versions

Page 20: EXPERIMENTAL METHODS AND TECHNIQUES IN ...home.deib.polimi.it/schiaffo/CS/EXPERIMENTAL METHODS AND...EXPERIMENTAL METHODS AND TECHNIQUES IN ENGINEERING The Data Management Perspective

DATA MODELS AND STRUCTURES MAINTENANCE (Marche 93) [9], (Sjoberg 93) [11], (Curino & Al. 08) [5,6]

F. A. Schreiber Experimental Methods ... Data Perspective

19

Conceptual/Logical level Schema evolution

Restructuring Optimization ……….

Observational study (analog to natural sciences) made on the evolution of the Wikipedia Database schema Goal:

Create a benchmark for schema evolution (and in general a standard relational DB dataset).

Extend the analysis to several other Open-Source WIS (Joomla!,TikiWiki, Slashcode, Zen-Cart, Wordpress)

Extend the analysis towards Public Scientific DB (Genome, HGVS)

Page 21: EXPERIMENTAL METHODS AND TECHNIQUES IN ...home.deib.polimi.it/schiaffo/CS/EXPERIMENTAL METHODS AND...EXPERIMENTAL METHODS AND TECHNIQUES IN ENGINEERING The Data Management Perspective

EXPERIMENTS ON SCHEMA EVOLUTION: the Wikipedia case (Curino & Al. 08) [5,6]

F. A. Schreiber Experimental Methods ... Data Perspective

20

• Schema Evolution: • 170+ versions in 4.5 years • almost 250% increase • WIS evolve faster than Traditional IS • 38% w.r.t. [Sjoberg93] • 539% w.r.t. [Marche93]

Page 22: EXPERIMENTAL METHODS AND TECHNIQUES IN ...home.deib.polimi.it/schiaffo/CS/EXPERIMENTAL METHODS AND...EXPERIMENTAL METHODS AND TECHNIQUES IN ENGINEERING The Data Management Perspective

EXPERIMENTS ON SCHEMA EVOLUTION: the Wikipedia case

F. A. Schreiber Experimental Methods ... Data Perspective

21

Previous queries on new schema

Major restructuring

Page 23: EXPERIMENTAL METHODS AND TECHNIQUES IN ...home.deib.polimi.it/schiaffo/CS/EXPERIMENTAL METHODS AND...EXPERIMENTAL METHODS AND TECHNIQUES IN ENGINEERING The Data Management Perspective

EXPERIMENTS ON SCHEMA EVOLUTION: the Wikipedia case

F. A. Schreiber Experimental Methods ... Data Perspective

22

New queries on all previous schema versions

Major restructuring

Page 24: EXPERIMENTAL METHODS AND TECHNIQUES IN ...home.deib.polimi.it/schiaffo/CS/EXPERIMENTAL METHODS AND...EXPERIMENTAL METHODS AND TECHNIQUES IN ENGINEERING The Data Management Perspective

DATA MODELS AND STRUCTURES MAINTENANCE (Babu & Al. 09)[1], (Davcev & Al. 08) [7]

F. A. Schreiber Experimental Methods ... Data Perspective

23

Physical level Tables

Sorting

Clustering

……….

Indexes Trees

Hashing

……….

Memory Shared buffers

Cache size

……….

Page 25: EXPERIMENTAL METHODS AND TECHNIQUES IN ...home.deib.polimi.it/schiaffo/CS/EXPERIMENTAL METHODS AND...EXPERIMENTAL METHODS AND TECHNIQUES IN ENGINEERING The Data Management Perspective

EXPLORING DATABASE CONTENT

F. A. Schreiber Experimental Methods ... Data Perspective

24

EVERY HUMAN KNOWLEDGE STARTS FROM

INTUITIONS, PROCEEDS THROUGH CONCEPTS, AND

REACHES ITS CLIMAX WITH IDEAS

I. Kant

Page 26: EXPERIMENTAL METHODS AND TECHNIQUES IN ...home.deib.polimi.it/schiaffo/CS/EXPERIMENTAL METHODS AND...EXPERIMENTAL METHODS AND TECHNIQUES IN ENGINEERING The Data Management Perspective

KNOWLEDGE HIERARCHY

F. A. Schreiber Experimental Methods ... Data Perspective

25

ELEMENTS (VOLUME)

VARIABLES

VALUE ADDED

EXPERIENCE STATISTICAL PROCESSING

KNOWLEDGE DISCOVERY

PROCEDURES

INVOICES

DATA

SALES TREND

INFORMATION

STRATEGIC DECISIONS

WISDOM

MARKET RULES

KNOWLEDGE

Page 27: EXPERIMENTAL METHODS AND TECHNIQUES IN ...home.deib.polimi.it/schiaffo/CS/EXPERIMENTAL METHODS AND...EXPERIMENTAL METHODS AND TECHNIQUES IN ENGINEERING The Data Management Perspective

KNOWLEDGE DISCOVERY AND DATA MINING

F. A. Schreiber Experimental Methods ... Data Perspective

26

Knowledge Discovery in Databases and Data Warehouses To identify the most significant information To show it to the user in the most convenient way

Data Mining Algorithm application to raw data in order to extract

knowledge (relations, paths, …) Predictive aim (signal analysis, voice recognition, ecc.) Descriptive aim (decision support systems, natural sciences)

Page 28: EXPERIMENTAL METHODS AND TECHNIQUES IN ...home.deib.polimi.it/schiaffo/CS/EXPERIMENTAL METHODS AND...EXPERIMENTAL METHODS AND TECHNIQUES IN ENGINEERING The Data Management Perspective

WHAT KIND OF INFORMATION DO WE GET?

F. A. Schreiber Experimental Methods ... Data Perspective

27

Associations Set of rules specifying the joint occurrence of two (or more)

elements Sequences

Possibility of stating temporal sequences of events Classifications

Grouping of elements into classes following a given model Clusters

Grouping of elements into classes which have not been defined a-priori

Trends Discovery of peculiar temporal paths having a forecasting

value

Page 29: EXPERIMENTAL METHODS AND TECHNIQUES IN ...home.deib.polimi.it/schiaffo/CS/EXPERIMENTAL METHODS AND...EXPERIMENTAL METHODS AND TECHNIQUES IN ENGINEERING The Data Management Perspective

KNOWLEDGE DISCOVERY PROCESS (1)

F. A. Schreiber Experimental Methods ... Data Perspective

28

Even if specialized tools are available it requires A competence in used techniques A very good application domain knowledge

Sequential steps Selection

Choice of the sample data the analysis shall be focused on

Preprocessing Data sampling in order to reduce their volume Data scrubbing for errors and omissions

Page 30: EXPERIMENTAL METHODS AND TECHNIQUES IN ...home.deib.polimi.it/schiaffo/CS/EXPERIMENTAL METHODS AND...EXPERIMENTAL METHODS AND TECHNIQUES IN ENGINEERING The Data Management Perspective

KNOWLEDGE DISCOVERY PROCESS (2)

F. A. Schreiber Experimental Methods ... Data Perspective

29

Transformation Data types homogeneization and/or conversion

Data mining Choice of the method/algorithm

Interpretation and evaluation Retrieved information filtering Possible refining by previous steps repetition Search results visual presentation (graphical or logical)

Page 31: EXPERIMENTAL METHODS AND TECHNIQUES IN ...home.deib.polimi.it/schiaffo/CS/EXPERIMENTAL METHODS AND...EXPERIMENTAL METHODS AND TECHNIQUES IN ENGINEERING The Data Management Perspective

KNOWLEDGE DISCOVERY PROCESS (3)

F. A. Schreiber Experimental Methods ... Data Perspective

30

RAW DATA

TARGET DATA

PRE- PROCESSED

DATA TRANSFORMED DATA

CORRELATIONS AND PATHS

KNOWLEDGE

SELECTION

PRE-PROCESSING

TRANSFORMATION

DATA MINING

INTERPRETATION

source: G. Piatesky-Shapiro 1996

Page 32: EXPERIMENTAL METHODS AND TECHNIQUES IN ...home.deib.polimi.it/schiaffo/CS/EXPERIMENTAL METHODS AND...EXPERIMENTAL METHODS AND TECHNIQUES IN ENGINEERING The Data Management Perspective

DATA MINING ALGORITHMS

F. A. Schreiber Experimental Methods ... Data Perspective

31

Model representation Formalisms to represent and describe possible paths

Model evaluation Statistical or logical estimate of the correspondence of a path to the search criteria

Search method Of parameters

Search of the parameters which optimize the evaluation criteria, the observations set and the model representation being given

Of model The parameters are applied to models belonging to the same family, differentiated by the representation type, for quality evaluation

Page 33: EXPERIMENTAL METHODS AND TECHNIQUES IN ...home.deib.polimi.it/schiaffo/CS/EXPERIMENTAL METHODS AND...EXPERIMENTAL METHODS AND TECHNIQUES IN ENGINEERING The Data Management Perspective

THE “MARKET BASKET” MODEL

F. A. Schreiber Experimental Methods ... Data Perspective

32

The best-known model on which data mining techniques are applied

Mainly, but not exclusively, used for retail sale

problems The goal is to discover recurrent patterns in data

(association rules)

Page 34: EXPERIMENTAL METHODS AND TECHNIQUES IN ...home.deib.polimi.it/schiaffo/CS/EXPERIMENTAL METHODS AND...EXPERIMENTAL METHODS AND TECHNIQUES IN ENGINEERING The Data Management Perspective

THE “MARKET BASKET” MODEL

F. A. Schreiber Experimental Methods ... Data Perspective

33

I = {i1, ..., ik} SET OF k ELEMENTS (ITEM)

B = {b1, ..., bn} SET OF n SUBSETS (BASKET) OF I

bi ⊆ I

I Goods in a supermarket

Words in a dictionary

B

A customer’s purchase

A document in a corpus

ASSOCIATION RULE i1 ⇒ i2 i1 AND i2 SHOW TOGETHER IN AT LEAST s% OF THE n BASKET (SUPPORT)

OF ALL THE BASKETS CONTAINING i1 AT LEAST c% CONTAIN ALSO i2 (CONFIDENCE)

Page 35: EXPERIMENTAL METHODS AND TECHNIQUES IN ...home.deib.polimi.it/schiaffo/CS/EXPERIMENTAL METHODS AND...EXPERIMENTAL METHODS AND TECHNIQUES IN ENGINEERING The Data Management Perspective

THE “MARKET BASKET” MODEL: ANY PROBLEM?

F. A. Schreiber Experimental Methods ... Data Perspective

34

c COFFEE IS IN THE BASKET c NO COFFEE IN THE BASKET

t TEA IS IN THE BASKET t NO TEA IN THE BASKET

c c

t

t

Σ rows

Σcolumns

20 5 25

70 5 75

90 10 100

WARNING! A CORRELATION EXISTS BETWEEN TEA AND COFFEE r = P[t ∧ c] / (P[t] x P[c] ) = 0.89

t ⇒ c IS TRUE???

s = 20% c = P[t ∧ c] / P[t] =20/25= 80%

PERHAPS, BUT ... THOSE WHO BUY COFFEE ANYHOW REACH 90% !!!

Page 36: EXPERIMENTAL METHODS AND TECHNIQUES IN ...home.deib.polimi.it/schiaffo/CS/EXPERIMENTAL METHODS AND...EXPERIMENTAL METHODS AND TECHNIQUES IN ENGINEERING The Data Management Perspective

CLASSIFICATION PROBLEM : AN EXAMPLE

F. A. Schreiber Experimental Methods ... Data Perspective

35

AGE CAR TYPE RISK 17 sports high 43 family low 68 family low 32 truck low 23 family high 18 family high 20 family high 45 sports high 50 truck low 64 truck high 46 family low 40 family low

AGE CAR TYPE 22 family 60 family 35 sports

AGE CAR TYPE 22 family 60 family 35 sports

CLASS

high

high low

MINE CLASSIFICATION

TEST

1. IF Age ≤ 23 THEN Risk IS High; 2. IF CarType = sports THEN Risk

IS High; 3. IF CarType IN {family, truck} AND

Age > 23 THEN Risk IS Low; 4. DEFAULT Risk IS Low

Page 37: EXPERIMENTAL METHODS AND TECHNIQUES IN ...home.deib.polimi.it/schiaffo/CS/EXPERIMENTAL METHODS AND...EXPERIMENTAL METHODS AND TECHNIQUES IN ENGINEERING The Data Management Perspective

EFFECTIVENESS

F. A. Schreiber Experimental Methods ... Data Perspective

36

No established results on metric and methodologies Application dependent

Context Physical Sociological

User psychology

Page 38: EXPERIMENTAL METHODS AND TECHNIQUES IN ...home.deib.polimi.it/schiaffo/CS/EXPERIMENTAL METHODS AND...EXPERIMENTAL METHODS AND TECHNIQUES IN ENGINEERING The Data Management Perspective

DATA QUALITY DIMENSIONS (Cappiello & Schreiber 12) [4]

ACCURACY the degree of conformity of a measured or computed quantity to its actual (true) value (|vavg-vref| < εacc)

PRECISION the degree to which repeated measurement show the same or similar results

(small variance 1/n*ΣNn=1 (vn – μ)2 < εprec )

TIMELINESS

CURRENCY the time interval from the instant the value was sampled to the instant at

which it is sent to the base station

VOLATILITY the amount of time during which data remain valid

Timeliness = max(1 − Currency/Volatility; 0)s

F. A. Schreiber Experimental Methods ... Data Perspective

37

Page 39: EXPERIMENTAL METHODS AND TECHNIQUES IN ...home.deib.polimi.it/schiaffo/CS/EXPERIMENTAL METHODS AND...EXPERIMENTAL METHODS AND TECHNIQUES IN ENGINEERING The Data Management Perspective

BASIC PRINCIPLES OF A PROPOSED AGGREGATION ALGORITHM

Accuracy is represented by the window height Values falling within the window can be considered similar

enough to be fairly represented by their average Values falling outside the window are outliers Outliers can be occasional or consecutive: in any case

outliers information must be preserved for further investigation

v

t

vref

vref+ εacc

Vref- εacc

x x x

x

F. A. Schreiber Experimental Methods ... Data Perspective

38

Page 40: EXPERIMENTAL METHODS AND TECHNIQUES IN ...home.deib.polimi.it/schiaffo/CS/EXPERIMENTAL METHODS AND...EXPERIMENTAL METHODS AND TECHNIQUES IN ENGINEERING The Data Management Perspective

CONSIDERED CASES

OSCILLATORY / BURSTY

EXPECTED TREND SLOW CHANGE

v

t

vref

vref+ εacc

vref- εacc

(b)

v

t

vref

vref+ εacc

vref- εacc

W

H (a)

OUTLIER

By considering Z aggregate values and J outliers out of a set of N measures, the algorithm is considered efficient if the output is composed by (Z+J) values instead of N where (Z+J)<<N

F. A. Schreiber Experimental Methods ... Data Perspective

39

Page 41: EXPERIMENTAL METHODS AND TECHNIQUES IN ...home.deib.polimi.it/schiaffo/CS/EXPERIMENTAL METHODS AND...EXPERIMENTAL METHODS AND TECHNIQUES IN ENGINEERING The Data Management Perspective

ALGORITHM BANDWIDTH

Compressing data amounts to lowering the bandwidth of the measurement system

The window width determines the number of measured values which are aggregated 1 point window no compression max bandwidth

The window width also determines the timeliness by which data are delivered to the base station

F. A. Schreiber Experimental Methods ... Data Perspective

40

Page 42: EXPERIMENTAL METHODS AND TECHNIQUES IN ...home.deib.polimi.it/schiaffo/CS/EXPERIMENTAL METHODS AND...EXPERIMENTAL METHODS AND TECHNIQUES IN ENGINEERING The Data Management Perspective

ALGORITHM INPUT/OUTPUT

INPUT PARAMETERS TIME SERIES V = <v1, v2, … vn> EXPECTED VALUE vref

ACCURACY TOLERANCE εacc

PRECISION TOLERANCE εprec

WINDOW WIDTH N CONTINUITY INTERVAL C

OUTPUT PARAMETERS

AGGREGATE VALUES T = < a1,t1 >; < a2,t2 >; … < az,tz > OUTLIERS O = < o1,t1 >; < o2,t2 >; … < oj,tj >

ALGORITHM COMPLEXITY ALGORITHM FOOTPRINT O(N) 11 KB RAM; 1 KB ROM

F. A. Schreiber Experimental Methods ... Data Perspective

41

Page 43: EXPERIMENTAL METHODS AND TECHNIQUES IN ...home.deib.polimi.it/schiaffo/CS/EXPERIMENTAL METHODS AND...EXPERIMENTAL METHODS AND TECHNIQUES IN ENGINEERING The Data Management Perspective

EXPERIMENTAL SET UP

+ -

R R i

v1

∆V=v2

Z(t)

100 Ώ < ZR(t) < 1000 Ώ (measured)

R = 1 Ώ

0 mV < ΔV < 30 mV

0 mA < i < 30 mA (Data sheet)

R + ZR ≈ ZR

F. A. Schreiber Experimental Methods ... Data Perspective

42

Page 44: EXPERIMENTAL METHODS AND TECHNIQUES IN ...home.deib.polimi.it/schiaffo/CS/EXPERIMENTAL METHODS AND...EXPERIMENTAL METHODS AND TECHNIQUES IN ENGINEERING The Data Management Perspective

7 TRANSMITTED VALUES , 30mJ 60 TRANSMITTED VALUES , 120mJ

ALGORITHM BEHAVIOUR

WITH AGGREGATION WITHOUT AGGREGATION (BYPASS)

7 TRANSMITTED VALUES , 30mJ 60 TRANSMITTED VALUES , 120mJ

70% ENERGY SAVINGS

F. A. Schreiber Experimental Methods ... Data Perspective

43

Page 45: EXPERIMENTAL METHODS AND TECHNIQUES IN ...home.deib.polimi.it/schiaffo/CS/EXPERIMENTAL METHODS AND...EXPERIMENTAL METHODS AND TECHNIQUES IN ENGINEERING The Data Management Perspective

COMPARISON CRITERIA (1/2)

Two real world data sets have been processed by using the algorithm proposed and two other aggregation algorihms: I. Lazaridis, S. Mehrotra, Capturing Sensor-Generated

Time Series with Quality Guarantees, in: ICDE, 2003, pp. 429–439.

T. Schoellhammer, E. Osterweil, B. Greenstein, M. Wimbrow, D. Estrin, Lightweight Temporal Compression of Microclimate Datasets, in: LCN, 2004, pp. 516–524.

F. A. Schreiber Experimental Methods ... Data Perspective

44

Page 46: EXPERIMENTAL METHODS AND TECHNIQUES IN ...home.deib.polimi.it/schiaffo/CS/EXPERIMENTAL METHODS AND...EXPERIMENTAL METHODS AND TECHNIQUES IN ENGINEERING The Data Management Perspective

COMPARISON CRITERIA (2/2)

The comparison among algorithms have been based on three main criteria: Compression rate: the degree with which data have been

aggregated. Energy savings: the degree with which the aggregation

allows sensors to save energy with respect to the case in which all the original values are sent to the base station.

Correctness: the degree with which the aggregated data allow the base station to retrieve the original trend. Correctness has been evaluated by using the Mean Absolute Error (MAE) and the related Mean Absolute Percentage Error (MA%E).

F. A. Schreiber Experimental Methods ... Data Perspective

45

Page 47: EXPERIMENTAL METHODS AND TECHNIQUES IN ...home.deib.polimi.it/schiaffo/CS/EXPERIMENTAL METHODS AND...EXPERIMENTAL METHODS AND TECHNIQUES IN ENGINEERING The Data Management Perspective

DATA SET (A) RESULTS

0,13

0,14

0,15

0,16

0,17

0,18

0,19

0 20 40 60 80 100 120 140 160

CappielloandSchreiber

[V]

[t] [V]

F. A. Schreiber Experimental Methods ... Data Perspective

C2N2 absorption spectrum

46

a b c

Page 48: EXPERIMENTAL METHODS AND TECHNIQUES IN ...home.deib.polimi.it/schiaffo/CS/EXPERIMENTAL METHODS AND...EXPERIMENTAL METHODS AND TECHNIQUES IN ENGINEERING The Data Management Perspective

DATA SET (A) RESULTS

0,13

0,14

0,15

0,16

0,17

0,18

0,19

0 20 40 60 80 100 120 140 160

Lazaridiset al.

[t]

[V]

F. A. Schreiber Experimental Methods ... Data Perspective

47

a b c

Page 49: EXPERIMENTAL METHODS AND TECHNIQUES IN ...home.deib.polimi.it/schiaffo/CS/EXPERIMENTAL METHODS AND...EXPERIMENTAL METHODS AND TECHNIQUES IN ENGINEERING The Data Management Perspective

DATA SET (A) RESULTS

0,13

0,14

0,15

0,16

0,17

0,18

0,19

0 20 40 60 80 100 120 140 160

Schoellhammer et al.

[V]

[t] a b c

F. A. Schreiber Experimental Methods ... Data Perspective

48

Page 50: EXPERIMENTAL METHODS AND TECHNIQUES IN ...home.deib.polimi.it/schiaffo/CS/EXPERIMENTAL METHODS AND...EXPERIMENTAL METHODS AND TECHNIQUES IN ENGINEERING The Data Management Perspective

DATA SET (A) RESULTS

60,00%

65,00%

70,00%

75,00%

80,00%

85,00%

90,00%

[Authors] [Lazaridis et al.] [Schoellhammer et al.]

Compression rate

0,00%

10,00%

20,00%

30,00%

40,00%

50,00%

60,00%

[Authors] [Lazaridis et al.] [Schoellhammer etal. ]

Energy Reduction

MAE in case of non linear trends

49

F. A. Schreiber Experimental Methods ... Data Perspective

Page 51: EXPERIMENTAL METHODS AND TECHNIQUES IN ...home.deib.polimi.it/schiaffo/CS/EXPERIMENTAL METHODS AND...EXPERIMENTAL METHODS AND TECHNIQUES IN ENGINEERING The Data Management Perspective

DATA SET (B) RESULTS

-8

-6

-4

-2

0

2

4

6

0 20 40 60 80 100 120 140 160[t]

CappielloandSchreiber

Input dataset

F. A. Schreiber Experimental Methods ... Data Perspective

50

C2N2 absorption spectrum FM

Systematic error due to the processing time shift

Page 52: EXPERIMENTAL METHODS AND TECHNIQUES IN ...home.deib.polimi.it/schiaffo/CS/EXPERIMENTAL METHODS AND...EXPERIMENTAL METHODS AND TECHNIQUES IN ENGINEERING The Data Management Perspective

DATA SET (B) RESULTS

-8

-6

-4

-2

0

2

4

6

0 20 40 60 80 100 120 140 160[t]

Lazaridiset al.

Input DataSet

51

F. A. Schreiber Experimental Methods ... Data Perspective

Page 53: EXPERIMENTAL METHODS AND TECHNIQUES IN ...home.deib.polimi.it/schiaffo/CS/EXPERIMENTAL METHODS AND...EXPERIMENTAL METHODS AND TECHNIQUES IN ENGINEERING The Data Management Perspective

DATA SET (B) RESULTS

-8

-6

-4

-2

0

2

4

6

0 20 40 60 80 100 120 140 160[t]

Schoellhammeret al.Input data set

52

F. A. Schreiber Experimental Methods ... Data Perspective

Page 54: EXPERIMENTAL METHODS AND TECHNIQUES IN ...home.deib.polimi.it/schiaffo/CS/EXPERIMENTAL METHODS AND...EXPERIMENTAL METHODS AND TECHNIQUES IN ENGINEERING The Data Management Perspective

DATA SET (B) RESULTS

65,00%

70,00%

75,00%

80,00%

85,00%

90,00%

95,00%

[Authors] [Lazaridis et al.] [Schoellhammer etal. ]

Compression rate

0,00%

10,00%

20,00%

30,00%

40,00%

50,00%

60,00%

[Authors] [Lazaridis et al.] [Schoellhammer etal. ]

Energy Savings

F. A. Schreiber Experimental Methods ... Data Perspective

53

Page 55: EXPERIMENTAL METHODS AND TECHNIQUES IN ...home.deib.polimi.it/schiaffo/CS/EXPERIMENTAL METHODS AND...EXPERIMENTAL METHODS AND TECHNIQUES IN ENGINEERING The Data Management Perspective

SUMMARY COMPARISON AND COMMENTS

No single algorithm is «the best» Transmission procedures with packed based protocols can

affect the analysis Higher packing factors improve energy efficiency Higher transmission delays negatively affect timeliness

Adaptable procedures should be used on the basis of The peculiar features of the signals to be processed The quality requirements of the applications

F. A. Schreiber Experimental Methods ... Data Perspective

54

54

Page 56: EXPERIMENTAL METHODS AND TECHNIQUES IN ...home.deib.polimi.it/schiaffo/CS/EXPERIMENTAL METHODS AND...EXPERIMENTAL METHODS AND TECHNIQUES IN ENGINEERING The Data Management Perspective

Programs and Data

F. A. Schreiber Experimental Methods ... Data Perspective

55

Philosophy without Science is empty,

Science without Philosophy is

blind I. Kant

PARAPHRASE Programs without Data are empty, Data without Programs are blind F. A. Schreiber

Page 57: EXPERIMENTAL METHODS AND TECHNIQUES IN ...home.deib.polimi.it/schiaffo/CS/EXPERIMENTAL METHODS AND...EXPERIMENTAL METHODS AND TECHNIQUES IN ENGINEERING The Data Management Perspective

SUMMARY AND CONCLUSIONS

F. A. Schreiber Experimental Methods ... Data Perspective

56

Experiments on Databases and DBMSs for optimizing data structures and management including Data Quality

Data organization and management as a service to the experiments of the scientific community

Experimenting with the Database content itself (data mining)

Experimentation is both: a science because it requires formal and rigorous

methodologies, languages, and instruments an art because it requires intuition, phantasy, and …

it gives emotions

Page 58: EXPERIMENTAL METHODS AND TECHNIQUES IN ...home.deib.polimi.it/schiaffo/CS/EXPERIMENTAL METHODS AND...EXPERIMENTAL METHODS AND TECHNIQUES IN ENGINEERING The Data Management Perspective

BIBLIOGRAPHICAL REFERENCES

F. A. Schreiber Experimental Methods ... Data Perspective

1. Babu S. et Al. – Automated Experiment-Driven Management of (Database) Systems – Proc. 12th HotOS, pp. 1 – 5, 2009

2. Boral H., DeWitt D. J. – A Methodology for Database Systems Performance Evaluation – SIGMOD Record, Vol. 14, n. 2, pp. 176-185, 1984

3. Brown D. et Al. – High energy nuclear database: a testbed for nuclear data information technology – Int. Conf. On nuclear data for Science and Technology, art. 250, 2007

4. Cappiello C., Schreiber F.A. - Experiments and analysis of quality and Energy-aware data aggregation approaches in WSNs - 10th Int. Workshop on Quality in Databases QDB 2012, Istanbul, Aug. 26, 2012, pp. 1- 8 http://www.purdue.edu/discoverypark/cyber/qdb2012/papers/7data%20aggregation.pdf

5. Curino C. et Al. – Schema Evolution in Wikipedia: Toward a Web Information System Benchmark – Proc. ICEIS, pp. 323 – 332, 2008

6. Curino et Al. – Graceful Database Schema Evolution: the PRISM Workbench – Proc. VLDB’08, pp. 761 – 772, 2008

7. Davcev D. et Al. – Experiments in Data Management for Wireless Sensor Networks – Proc. 2° Int. Conf. on Sensor Technologies and Applications , pp. 198 – 202, 2008

8. Manolescu I. et Al. - The Repeatability Experiment of SIGMOD 2008 - SIGMOD Record, Vol. 37, n. 1, pp. 39 – 45, 2008

57

Page 59: EXPERIMENTAL METHODS AND TECHNIQUES IN ...home.deib.polimi.it/schiaffo/CS/EXPERIMENTAL METHODS AND...EXPERIMENTAL METHODS AND TECHNIQUES IN ENGINEERING The Data Management Perspective

BIBLIOGRAPHICAL REFERENCES

F. A. Schreiber Experimental Methods ... Data Perspective

58

9. Marche S. – Measuring the stability of data models – European Journal of Information Systems, Vol.2, n.1, pp. 37 – 47, 1993

10. Masseroli M. - Management and Analysis of Genomic Functional and Phenotypic Controlled Annotations to Support Biomedical Investigation and Practice - IEEE Transactions on Information Technology in Biomedicine, Vol. 11, n. 4, pp. 376-385, 2007

11. Sjoberg D. I. – Quantifying schema evolution – Information asnd software technology, Vol. 35, n. 1, pp.35 - 44, 1993

12. Stoeckert C. et Al. – Microarray databases: standards and ontologies – Nature genetics, Vol. 32, pp. 469 – 473, 2002

13. Szalay A., Gray J. – The world-wide telescope – Science, Vol. 293, pp. 2037 – 2040, 2001

14. Vanschoren J., Blockeel H. – Experiment Databases - In: Dzeroski S., Goethals B., Panov P. (Eds.), Inductive Databases and Queries: Constraint-based Data Mining, Chapt. 14, Springer, pp. 335 - 360, 2010

15. Schwartz B. – The four fundamental performance metrics – PERCONA, 2011 http://www.mysqlperformanceblog.com/2011/04/27/the-four-fundamental-performance-metrics/

16. http://www.tpc.org/information/benchmarks.asp