Sunita Sarawagi IIT Bombay it.iitb.ernet/~sunita

49
Sunita Sarawagi IIT Bombay http://www.it.iitb.ernet.in/~sunita I 3: Intelligent, Interactive Investigation of multidimensional data

description

Sunita Sarawagi IIT Bombay http://www.it.iitb.ernet.in/~sunita. I 3 : Intelligent, Interactive Investigation of multidimensional data. Multidimensional OLAP databases. Fast, interactive answers to large aggregate queries . Multidimensional model: dimensions with hierarchies - PowerPoint PPT Presentation

Transcript of Sunita Sarawagi IIT Bombay it.iitb.ernet/~sunita

Page 1: Sunita  Sarawagi IIT Bombay it.iitb.ernet/~sunita

Sunita SarawagiIIT Bombay

http://www.it.iitb.ernet.in/~sunita

I3: Intelligent, Interactive Investigation of multidimensional data

Page 2: Sunita  Sarawagi IIT Bombay it.iitb.ernet/~sunita

Multidimensional OLAP databases

Fast, interactive answers to large aggregate queries. Multidimensional model: dimensions with hierarchies

Dim 1: Bank location: • branch-->city-->state

Dim 2: Customer:• sub profession --> profession

Dim 3: Time:• month --> quarter --> year

Measures: loan amount, #transactions, balance

Page 3: Sunita  Sarawagi IIT Bombay it.iitb.ernet/~sunita

OLAP

Navigational operators: Pivot, drill-down, roll-up, select.

Hypothesis driven search: E.g. factors affecting defaulters view defaulting rate on age aggregated over other

dimensions for particular age segment detail along profession

Need interactive response to aggregate queries..

Page 4: Sunita  Sarawagi IIT Bombay it.iitb.ernet/~sunita

Motivation

OLAP products provide a minimal set of tools for analysis: simple aggregates selects/drill-downs/roll-ups on the multidimensional

structure

Heavy reliance on manual operations for analysis tedious on large data with multiple dimensions and

levels of hierarchy

GOAL: automate through complex, mining-like operations integrated with Olap.

Page 5: Sunita  Sarawagi IIT Bombay it.iitb.ernet/~sunita

State of art in mining OLAP integration

Decision trees [Information discovery, Cognos] find factors influencing high profits

Clustering [Pilot software] segment customers to define hierarchy on that dimension

Time series analysis: [Seagate’s Holos] Query for various shapes along time: spikes, outliers etc

Multi-level Associations [Han et al.] find association between members of dimensions

Page 6: Sunita  Sarawagi IIT Bombay it.iitb.ernet/~sunita

The Diff operator

Page 7: Sunita  Sarawagi IIT Bombay it.iitb.ernet/~sunita

Unravel aggregate data

Total sales dropped 30%in N. America. Why?

What is the most compact answer that user can quickly assimilate?

Page 8: Sunita  Sarawagi IIT Bombay it.iitb.ernet/~sunita

Solution

A new DIFF-operator added to OLAP systems that provides the answer in a single-step is easy-to-assimilate and compact --- configurable by user.

Obviates use of the lengthy and manual search for reasons in large multidimensional data.

Page 9: Sunita  Sarawagi IIT Bombay it.iitb.ernet/~sunita

Example query

Plat_User (All)Plat_Type (All)Platform (All)Prod_Group (All)Prod_Category (All)Product (All)

Sum of Revenue YearGeography 1990 1991 1992 1993 1994Asia/Pacific 1440 1947 3454 5576 6310Rest of World 2170 2154 4577 5204 5510United States 6545 7524 10947 13545 15817Western Europe 4552 6061 10053 12578 13501

Page 10: Sunita  Sarawagi IIT Bombay it.iitb.ernet/~sunita

Compact answer

PRODUCT PLAT_USERPLAT_TYPEPLATFORM 1990 1991 RATIO ERROR(All)- (All)- (All) (All) 1620 1820 1.1 34Operating SystemsMulti (All)- (All) 254 198 0.8 23Operating SystemsMulti Other M. Multiuser Mainframe IBM98 2 0.0 0Operating SystemsSingle Wn16 (All) 94 11 0.1 0*Middleware & Oth.UtilitiesMulti Other M. Multiuser Mainframe IBM101 10 0.1 0EDA Multi Unix M. (All) 0.4 76 211.7 0EDA Single Unix S. (All) 0.1 13 210.8 0

Page 11: Sunita  Sarawagi IIT Bombay it.iitb.ernet/~sunita

Example: explaining increases

Plat_Type (All)Geography (All)Prod_Group Soln

Sum of Revenue YearProd_Category 1990 1991 1992 1993 1994Cross Ind. Apps 1975 2484 4564 7407 8150Home software 294 575Other Apps 843 1172 3436Vertical Apps 898 1461 2827 7947 8663

Page 12: Sunita  Sarawagi IIT Bombay it.iitb.ernet/~sunita

Compact answerPRODUCT GEOGRAPHYPLAT_TYPEPLATFORM 1992 1993 RATIO ERROR(All)- (All)- (All)- (All) 2113 2763 1.3 200Manufacturing - Process(All) (All) (All) 26 702 27.1 250Other Vertical Apps(All)- (All)- (All) 20 1858 91.4 251Other Vertical AppsUnited StatesUnix S. (All) 8 77 9.6 0Other Vertical AppsWestern EuropeUnix S. (All) 7 96 13.2 0Manufacturing - Discrete(All) (All) (All) 1135 0Health Care (All)- (All)- (All)- 7 820 118.2 98Banking/FinanceUnited StatesOther M. (All) 341 239 0.7 60Mechanical CADUnited States(All) (All) 328 243 0.7 34

Page 13: Sunita  Sarawagi IIT Bombay it.iitb.ernet/~sunita

Model for summarization

The two aggregated values correspond to two subcubes in detailed data.

Products (All)Geography (All)

Year 1990 1991 19922000 1800

Year 1990Products

Geography OS DBMS Prog

Asia 100 80 80

USA 100 200 400

UK 140 100 56

Year 1991Products

Geography OS DBMS Prog

Asia 80 90 70

USA 120 240 480

UK 140 60 56

Cube-A Cube-B

Page 14: Sunita  Sarawagi IIT Bombay it.iitb.ernet/~sunita

Detailed answers

PRODUCT GEOGRAPHY PLATFORM Y1992 Y1993 RATIOOther Vertical AppsWestern Europe Multiuser Minicomputer OpenVMS 99.9 Other Vertical AppsAsia/Pacific Single-user MAC OS 92.5 Other Vertical AppsRest of World Multiuser Mainframe IBM 88.1 Other Vertical AppsWestern Europe Single-user UNIX 7.3 96.3 13.2Other Vertical AppsUnited States Multiuser Minicomputer Other 97.2 Other Vertical AppsUnited States Multiuser Minicomputer OS/400 99.5 Other Vertical AppsAsia/Pacific Multiuser Minicomputer OS/400 99.6 EDA Western Europe Multiuser UNIX 192.6 277.8 1.4Manufacturing - DiscreteUnited States Multiuser Mainframe IBM 88.4

Explain only 15% of total difference as against 90% with compact

Page 15: Sunita  Sarawagi IIT Bombay it.iitb.ernet/~sunita

Summarizing similar changes

Product Manufacturing - Process

Sum of Revenue YearPlat_Type 1992 1993Other M. 10 473Other S. 0 22Unix M. 7 105Unix S. 1 17Wn16 3 85Wn32 0

PRODUCT GEOGRAPHYPLAT_TYPEPLATFORMYEAR_1992 YEAR_1993 RATIO ERROR(All)- (All)- (All)- (All) 2113.0 2763.5 1.3 200Manufacturing - Process (All) (All) (All) 25.9 702.5 27.1 250Other Vertical Apps (All)- (All)- (All) 20.3 1858.4 91.4 251Other Vertical Apps United StatesUnix S. (All) 8.1 77.5 9.6 0Other Vertical Apps Western EuropeUnix S. (All) 7.3 96.3 13.2 0Manufacturing - Discrete (All) (All) (All) 1135.2 0Health Care (All)- (All)- (All)- 6.9 820.4 118.2 98Banking/Finance United StatesOther M. (All) 341.3 239.3 0.7 60Mechanical CAD United States(All) (All) 327.8 243.4 0.7 34

Page 16: Sunita  Sarawagi IIT Bombay it.iitb.ernet/~sunita

MDL model for summarization

Given N, find the best N rows of answer such that: if user knows cube-A and answer, number of bits needed to send cube-B is minimized.

Year 1990Products

Geography OS DBMS Prog

Asia 100 80 80

USA 100 200 400

UK 140 100 56Year 1991

ProductsGeography OS DBMS Prog

Asia 90 80 70

USA 89 39 67

UK 140 60 56

N row answer

Cube-A

Cube-B

Page 17: Sunita  Sarawagi IIT Bombay it.iitb.ernet/~sunita

Transmission cost: MDL-based

Each answer entry has a ratio that is sum of measure values in cube-B and cube-A not

covered by a more detailed entry in answer.

For each cell of cube-B not in answer r: ratio of closest parent in answer a (b): measure value of cube A (B). Expected value of b = a r #bits = -log(prob(b, ar)) where prob(x,u) is probability at

value x for a distribution with mean u. We use a poisson distribution when x are counts, normal

distribution otherwise

Page 18: Sunita  Sarawagi IIT Bombay it.iitb.ernet/~sunita

Algorithm

Challenges Circular dependence on parent’s ratio Bounded size of answer Greedy methods do not work

Bottom up dynamic programming algorithm

)))()',1,((min),,,(min(),,(

hierarchy level-2With

),,(min),(

),,'(),,'((min),,(

' TaggrNTDrNTDrNTDA

rNTDNTD

rnTDrnNTTDrNTD

r

r

Nno

Page 19: Sunita  Sarawagi IIT Bombay it.iitb.ernet/~sunita

N=2

+ tuple i

min

N=1

N=0

Tuples in detailed data grouped by common parent..

iN=2

N=1

N=0

Level 0

N=2

N=1

N=0

Level 1

N=2

N=1

N=0

A new group formed

Tuples with same parent

Page 20: Sunita  Sarawagi IIT Bombay it.iitb.ernet/~sunita

Integration

Single pass on data --- all indexing/sorting in the DBMS: interactive.

Low memory usage: independent of number of tuples: O(NL)

Easy to package as a stored procedure on the data server side.

When detailed subcube too large: work off aggregated data.

Page 21: Sunita  Sarawagi IIT Bombay it.iitb.ernet/~sunita

Performance

80% time spent in data access. Quarter million records processed in 10 seconds

0

10

20

30

40

50

60

70

80

0 50000 100000 150000 200000 250000 300000 350000

Number of tuples in query subcube

Tim

e in

sec

on

ds

Alg Total DataAccess

333 MHz Pentium128 MB memory

Data on DB2 UDBNT 4.0

Olap benchmark:1.36 million tuples4 dimensions

Page 22: Sunita  Sarawagi IIT Bombay it.iitb.ernet/~sunita

The Relax operator

Page 23: Sunita  Sarawagi IIT Bombay it.iitb.ernet/~sunita

Example query: generalizing drops

Page 24: Sunita  Sarawagi IIT Bombay it.iitb.ernet/~sunita
Page 25: Sunita  Sarawagi IIT Bombay it.iitb.ernet/~sunita

Ratio generalization

Page 26: Sunita  Sarawagi IIT Bombay it.iitb.ernet/~sunita

Problem formulation

Inputs A specific tuple Ts

An upper bound N on the answer size Error functions

R(Ts,T) measures the error of including a tuple T in a generalization around Ts

S(Ts,T) measures the error of excluding T from the generalization

Goal To find all possible consistent and maximal generalizations

around Ts

Page 27: Sunita  Sarawagi IIT Bombay it.iitb.ernet/~sunita

Algorithm

Considerations Need to exploit the capabilities of the OLAP data source Need to reduce the amount of data fetches to the application

2-stage approach Finding generalizations Getting exceptions

Page 28: Sunita  Sarawagi IIT Bombay it.iitb.ernet/~sunita

Finding generalizations

n = number of dimensions

Li = levels of hierarchy of dimension Di

Dij = jth level in the ith dimension hierarchy

candidate_set {D11, D21…Dn1} // all single dimension candidate gen.

k = 1while (candidate_set )

g candidate_set

if (ΣTg S(Ts,T) > ΣTg R(Ts,T)) Gk Gk g // generating candidates for pass (k+1) from generalizations of pass k

candidate_set generateCandidates(Gk) //Apriori style

// if gen is possible at level j of dimension Di , add its parent level to the candidate set

candidate_set candidate_set {Di(j+1)|Dij Gk & j< Li} k k +1

Return i Gi

Page 29: Sunita  Sarawagi IIT Bombay it.iitb.ernet/~sunita

Finding Summarized Exceptions

Goal Find exceptions to each maximal generalization compacted to within N

rows and yielding the minimum total error

Challenges No absolute criteria for determining whether a tuple is an exception or

not for all possible R functions Worth of including a child tuple is circularly dependent on its parent

tuple Bounded size of answer

SolutionBottom up dynamic programming algorithm

Page 30: Sunita  Sarawagi IIT Bombay it.iitb.ernet/~sunita

Single dimension with multiple levels of hierarchies

Optimal solution for finite domain R functions

soln(l,n,v) : the best solution for subtree l for all n between 0 and N and all possible values of the default rep.

soln(l,n,v,c) : the intermediate value of soln(l,n,v) after the 1st to the cth child of l are scanned

Err(soln(l,n,v,c+1))=min0kn(Err(soln(l,n,v,c))+Err(soln(c+1,n-k,v)))

Err(soln(l,n,v))=min(Err(soln(l,n,v,*)),

minv v’ Err(soln(1,n-1,v’,*)+rep(v’)))

Page 31: Sunita  Sarawagi IIT Bombay it.iitb.ernet/~sunita

+ + + - + + + - - +1 2 3 4 5 6 7 8 9 10

+ + - + + + 1 2 3 4 5 6

- - - + + + + - +1 2 3 4 5 6 7 8 9

1.1.8 : -1.1.9 : -

10Error

1.1.4 : -1.1.8 : -

1.1 : +1.1.4 : -

-+N=3

+ - + - - - - 1 2 3 4 5 6 7

1.2.3 : +

00

1.2.3 : +1.2.1 : +

1.2.1 : +1.2 : -

-+

1.1.8 : -1.1.9 : -

10

1.1.4 : -1.1.8 : -

1.1 : +1.1.4 : -

-+

1.1.8 : -1.1.9 : -

10

1.1.4 : -1.1.8 : -

1.1 : +1.1.4 : -

-+

1.4 : +1.1.4 : -

108Error

1.3 : +1.2.1 : +

1.1 : +1.2 : -

-+

1.4 : +1.2.1 :+

149

1.1 : +1.2 : -

-+

1510

1.1 : +1.2 : -

-+

1913

-+

N=3

N=2

N=1

N=0

1310

1 : +1.2 : -

-+

1

1.1 (+) 1.2 (-) 1.3 (+) 1.4 (+)

soln(1,1,*)

soln(1.1,3,*) soln(1.2,3,*) soln(1.3,3,*) soln(1.4,3,*)

Page 32: Sunita  Sarawagi IIT Bombay it.iitb.ernet/~sunita

The Inform operator

Page 33: Sunita  Sarawagi IIT Bombay it.iitb.ernet/~sunita

User-cognizant data exploration: overview

Monitor to find regions of data user has visited

Model user’s expectation of unseen values Report most informative unseen values

How to

Model expected values?

Define information content?

Page 34: Sunita  Sarawagi IIT Bombay it.iitb.ernet/~sunita

Modeling expected values

OS DB Word Prog

Asia 10 10 10 10

Afric 10 10 10 10

USA 10 10 10 10

UK 10 10 10 10

OS DB Word Prog

Asia 5 5 5 5

Afric 20 20 20 20

USA 12 12 12 12

UK 3 3 3 3

OS DB Word Prog

Asia 4 8 7 1

Afric 10 20 30 20

USA 5 9 1 33

UK 1 3 2 6

OS DB Word Prog

AsiaAfricUSAUK

Database hidden from user

Views seen by user

All

All 160

All

Asia 20

Afric 80

USA 48

UK 12

OS DB Word Prog

ALL 20 40 40 60

Page 35: Sunita  Sarawagi IIT Bombay it.iitb.ernet/~sunita

The Maximum Entropy Principle

Choose the most uniform distribution while adhering to all the constraints

E.T.Jaynes..[1990]it agrees with everything that is known but carefully avoids

assuming anything that is not known. It is transcription into mathematics of an ancient principle of wisdom…

Characterizing uniformity:

maximum when all pi-s are equal Solve the constrained optimization problem:

maximize H(p) subject to k constraints

i

ii pppH log)(

Page 36: Sunita  Sarawagi IIT Bombay it.iitb.ernet/~sunita

Modeling expected values

OS DB Word Prog

Asia 4 8 7 1

Afric 10 20 30 20

USA 5 9 1 33

UK 1 3 2 6

All

All 160

All

Asia 20

Afric 80

USA 48

UK 12

OS DB Word Prog

Asia 5 5 5 5

Afric 20 20 20 20

USA 12 12 12 12

UK 3 3 3 3

OS DB Word Prog

ALL 20 40 40 60

DatabaseVisited views

OS DB Word Prog

Asia 2.5 5 5 7.5

Afric 10 20 20 30

USA 6 12 12 18

UK 1.5 3 3 4.5

Prog

Usa 33

OS DB Word Prog

Asia 3 6 6 4.8

Afric 12 24 24.3 19.2

USA 3 6 6 33

UK 2 4 3.6 3

OS DB Word Prog

Asia 10 10 10 10

Afric 10 10 10 10

USA 10 10 10 10

UK 10 10 10 10

Page 37: Sunita  Sarawagi IIT Bombay it.iitb.ernet/~sunita

Change in entropy

2

2.1

2.2

2.3

2.4

2.5

2.6

2.7

2.8E

ntr

op

y

View 1 View 2 View 3 View 4 Data

Visited views

Page 38: Sunita  Sarawagi IIT Bombay it.iitb.ernet/~sunita

Finding expected values

Solve the constrained optimization problem: maximize H(p) subject to k constraints Each constraint is of the form: sum of arbitrary

sets of values Expected values can be expressed as a

product of k coefficients one from each of the k constraints

ki

jIijip

0

)(

Page 39: Sunita  Sarawagi IIT Bombay it.iitb.ernet/~sunita

Iterative scaling algorithmInitially all p values are the same

While convergence not reached

For each constraint Ci in turn

Scale p values included in Ci by

Converges to optimal solution when all constraints are consistent.

)(

)(~

i

i

Cp

Cp

Page 40: Sunita  Sarawagi IIT Bombay it.iitb.ernet/~sunita

All

Asia 40

Afric 40

USA 40

UK 40

OS DB Word Prog

ALL 40 40 40 40

Prog

Usa 40

All

Asia 20

Afric 80

USA 48

UK 12

OS DB Word Prog

ALL 20 40 40 60

Prog

Usa 33

OS DB Word Prog

Asia 10 10 10 10

Afric 10 10 10 10

USA 10 10 10 10

UK 10 10 10 10

OS DB Word Prog

ALL 40 40 40 40

Prog

Usa 12

All

Asia 20

Afric 80

USA 48

UK 12

OS DB Word Prog

ALL 20 40 40 60

Prog

Usa 33

OS DB Word Prog

Asia 5 5 5 5

Afric 20 20 20 20

USA 12 12 12 12

UK 3 3 3 3

All

Asia 20

Afric 80

USA 48

UK 12

Prog

Usa 18

All

Asia 20

Afric 80

USA 48

UK 12

OS DB Word Prog

ALL 20 40 40 60

Prog

Usa 33

OS DB Word Prog

Asia 3 5 5 7.5

Afric 10 20 20 30

USA 6 12 12 18

UK 2 3 3 4.5

All

Asia 20

Afric 80

USA 48

UK 12OS DB Word Prog

ALL 20 40 40 60

Prog

Usa 33

All

Asia 20

Afric 80

USA 48

UK 12

OS DB Word Prog

ALL 20 40 40 60

Prog

Usa 33

OS DB Word Prog

Asia 3 5 5 7.5

Afric 10 20 20 30

USA 6 12 12 33

UK 2 3 3 4.5

All

Asia 20

Afric 80

USA 63

UK 12

OS DB Word Prog

ALL 19 37 37 75

Prog

Usa 25

All

Asia 20

Afric 80

USA 48

UK 12

OS DB Word Prog

ALL 20 40 40 60

Prog

Usa 33

OS DB Word Prog

Asia 3 5 5 7.5

Afric 10 20 20 30

USA 5 9 9 25

UK 2 3 3 4.5

All

Asia 20

Afric 80

USA 48

UK 12

OS DB Word Prog

ALL 19 37 37 67

Page 41: Sunita  Sarawagi IIT Bombay it.iitb.ernet/~sunita

Defined as how much adding it as a constraint will reduce distance between actual and expected values

Distance between actual and expected:

Information content of (k+1)th constraint Ck+1:

Can be approximated as:

Information content of an unvisited cell

i

ki

ii

k

p

ppppD

~log~)~,(

)~,()~,( 1 ppDppD kk

)(

)(~log)(~

1

11

kk

kk Cp

CpCp

Page 42: Sunita  Sarawagi IIT Bombay it.iitb.ernet/~sunita

Information content of unseen data

0

0.005

0.01

0.015

0.02

0.025

0.03

OS DB Word Prog

AsiaAfric

USAUK

OS DB Word Prog

Asia 4 8 7 1

Afric 10 20 30 20

USA 5 9 1 33

UK 1 3 2 6

OS DB Word Prog

Asia 3 6 6 4.8

Afric 12 24 24.3 19.2

USA 3 6 6 33

UK 2 4 3.6 3

Page 43: Sunita  Sarawagi IIT Bombay it.iitb.ernet/~sunita

Adapting for OLAP data: Optimization 1: Expand expected cube on demand

Single entry for all cells with same expected value

Initially everything aggregated but touches lot of data

Later constraints touch limited amount of data.

All

All 160

All

Asia 20

Afric 80

USA 48

UK 12

OS DB Word Prog

ALL 20 40 40 60

OS DB Word Prog

Asia 2.5 5 5 7.5

Afric 10 20 20 30

USA 6 12 12 18

UK 1.5 3 3 4.5

All

All 160

All

Asia 20

Afric 80

USA 48

UK 12

Expected cube Views

Page 44: Sunita  Sarawagi IIT Bombay it.iitb.ernet/~sunita

Optimization 2: Reduce overlap

Number of iterations depend on overlap between constraints

Remove subsumed constraints from their parents to reduce overlap

OS DB Word Prog

Asia 4 8 7 1

Afric 10 20 30 20

USA 5 9 1 33

UK 1 3 2 6

All

Asia 20

Afric 80

USA 48

UK 12

Prog

Usa 33OS DB Word Prog

Asia 4 8 7 1

Afric 10 20 30 20

USA 5 9 1 33

UK 1 3 2 6

All

Asia 20

Afric 80

USA 15

UK 12

Page 45: Sunita  Sarawagi IIT Bombay it.iitb.ernet/~sunita

Finding N most informative cells

In general, most informative cells can be any of value from any level of aggregation.

Single-pass algorithm that finds the best difference between actual and expected values [VLDB-99]

Page 46: Sunita  Sarawagi IIT Bombay it.iitb.ernet/~sunita

Information gain with focussed exploration

0

0.2

0.4

0.6

0.8

1

0 10 20 30 40 50

Constraint number

Rela

tive s

quare

err

or Random MaxEntropy

0

0.2

0.4

0.6

0.8

1

0 10 20 30 40 50

Constraint number

Rela

tive s

quare

err

or

Random MaxEntropy

0

0.1

0.2

0.3

0.4

0 5 10 15

Constraint number

Rel

ativ

e sq

uare

err

or

Page 47: Sunita  Sarawagi IIT Bombay it.iitb.ernet/~sunita

Illustration from Student enrollment data

Student Sex Program Department YearCategory (9) Sex (2) Name (10) Name (28) Year (10) Category (3)

Sum 8206

PROGRAM Total2 Yr M.Sc 9.21%B.Tech 33.87%M.Tech 37.31%Ph.D 11.60%Others 1.60%

SEX TotalF 10.25%M 89.75%

CATEGORY TotalFull time Sponsored 5.28%Indian 81.55%Others 1.90%

Year Total1989 19.00%

Others 9.00%

35% of information in data captured in 12 out of 4560 cells: 0.25% of data

Page 48: Sunita  Sarawagi IIT Bombay it.iitb.ernet/~sunita

Top few suprising values

Category Sex Program DeptIndian M Computer Science & EngineeringIndian M Metallurgical Engineering & Mat.Sc.

M.Mgnt. School of ManagementIndian M M.Tech Civil EngineeringIndian M M.Tech Chemical Engineering

M Bio-Technology

80% of information in data captured in 50 out of 4560 cells: 1% of data

Page 49: Sunita  Sarawagi IIT Bombay it.iitb.ernet/~sunita

Summary

Our goal: enhance OLAP with a suite of operations that are richer than simple OLAP and SQL queries more interactive than conventional mining

...and thus reduce the need for manual analysis

Proposed three new operators: Diff, Generalize, Surprise Formulations with theoretical basis Efficient algorithms for online answering Integrates smoothly with existing systems.

Future work: More operators.