Post on 04-Feb-2016
description
QUERY BY EXCEL
A. Witkowski, S. Bellamkonda, T. Bozkaya, B. A. Naimat, L. Sheng,
S. Subramanian, A. Waingold
Oracle Corporation
Spreadsheets
Spreadsheets are established analytical tools:– Attractive user interface
– Easy to use computational model
– Interactivity for what if analysis
But, they do not offer:–Scalability
–Parallelization
–A unified view of the data model
Our proposal
QUERY BY EXCEL (QBX)
Combines Presentational interactive modeling power
of Excel (spreadsheet tools) Computational power and scalability of
RDBMS via analytical extensions
QBX – How it works Analyst builds a model using Excel. The model
is translated to SQL and stored in relational views.
Analysts designate areas in Excel as relational sources (RTables). An RTable can be transformed into another RTable using Excel operations corresponding to Outer Join, Selection, Projection and Aggregation. Analyst does not write any SQL during this process.
Analysts write Excel formulas on samples of relational sources that fit in a spreadsheet. The translated SQL works on the whole data for scalability.
Business Reporting tools can access the relational views for consolidation.
Analytic SQL Extensions for QBX SQL MODEL (Witkowski, et al. Sigmod
2003) SQL PIVOT (Cunningham, et al. Vldb
2004)
QBX Architecture
Excel Analyst
RDBMS Interaction &
Modeling
Persistence
EXCEL
Excel ->
SQL Translation
QBX
Database
Schema
QBX generated
SQL Objects
RDBMS
Application RDBMS User
QBX Metadata
Cells( eid, sheet, row, col, x, f) A B C D
1 Sale Diff
2 10.00
3 12.00
=C3-C2
For this Excel spreadsheet, we store five rows in Cells Table: C1, C2, C3, D1, D3
RTables( eid, RTable, sheet, row, col, sample, RTableView, …)
eid sheet
row col x f
1 1 1 3 ‘Sale’
1 1 1 4 ‘Diff’
1 1 2 3 ’10.00’
1 1 3 3 ’12.00’
1 1 3 4 ‘=C3-C2’
Excels( eid, name, owner, ExcelBinary, SQLView)
QBX Infrastructure
Interaction and Modeling Component (VBA add on)
– Menu interface (QBX)– QBX->Rtables manages Rtables (import, add
column, save as relational view..)– QBX->Spreadsheet translates Excel to SQL,
saves and loads it.
Persistence Component (VBA add-on) Translation Component
Excel to SQL Translation
Fix Frame translation Table Translation Unified Translation
SELECT sheet, row, col, x FROM cellsMODEL DBY (sheet,row,col) MEA (x) RULES AUTOMATIC ORDER( x[1,1,2] = x[1,1,1] + x[1,2,1], -- B1=A1+A2 x[1,2,2] = x[1,3,1] + 1, -- B2=A3+1 x[1,3,2] = sum(x)[1,1<=row<=3,1] –- B3=sum(A1:A3));
Fix Frame TranslationA B C
1 1 =A1+A2
2 2 =A3+1
3 3 =sum(A1:A3)
Fix Frame Translation-VLOOKUP - HLOOKUPWe use REFERENCE SQL MODEL:VLOOKUP(key, (<rs,cs>, <re,ce>), col):
REFERENCE vlookup_ref ON
( SELECT k.x key, v.x value
FROM cells k, cells v
WHERE k.col=cs AND v.col=cs+col-1 AND
k.row >= rs AND k.row <= re AND
v.row=k.row )
DIMENSION BY (key) MEASURES (value)
Fix Frame Translation-VLOOKUP - HLOOKUP
EXAMPLE: A3 =Vlookup(C3, A1:B4, 2)
SELECT row, col, x FROM cellsMODELREFERENCE vlookup_ref ON (SELECT k.x key,v.x value FROM cells k,cells v WHERE k.col = 1 AND v.col = 2 AND k.row >= 0 AND k.row <= 4 AND v.row = k.row) DIMENSION BY(key) MEASURES(value)MAIN DIMENSION BY (row, col) MEASURES (x)RULES
( x[3,1] = vlookup_ref.value[ x[3,3] ] );
A B C D
1 1 2 3 4
2 11 6 7 8
3 9 10 11 12
4 13 14 15 16
5 17 18 19 20
6
C3
Table Translation
Table Translation creates named, protected regions within Excel named RTables.
We remember associated metadata for RTable regions (PK, PK-FK constraints, etc..)
A direct RTable represents a– An RDBMS table (entire table or sample)– An RDBMS view (Direct Rtables can be created through QBX menu)
A derived RTable represents the result of relational operations on other RTables.
Rtable ExampleA B C D … I J
1 FACT TIME_D
2 City Prod Month Sale Month Year
3 LA tv M1.00 10.00 M1.00 Y.00
4 LA radio M2.00 12.00 M2.00 Y.00
5 LA tv M1.01 14.00
6 LA radio M2.01 16.00 PROD_D
7 Boston tv M1.00 20.00 Prod Categ
8 Boston radio M2.00 22.00 tv Video
9 Boston tv M1.01 24.00 radio Audio
10 Boston radio M2.01 26.00
11 REGION_D
12 City State
13 LA CA
14 Boston MA
FACT(A2:D10)
TIME_D(I2:J4)
PROD_D(I7:J9)
REGION_D(I12:J14)
Table Translation - Operations Inter-column calculations
Adding new (calculated) column to an Rtable Projection Joining of Rtables
The closest Excel operation to join is Hlookup/Vlookup,
Which is similar to relational OUTER JOIN.Steps: (R1 LEFT OUTER JOIN R2 ON R1.col1=R2.col2)1. A new column is added.2. The new column is populated with
(VLOOKUP(R1.Col1, R2, R2.Col2) Aggregation
Inter-column Calculations
Computations involving columns of the same row. EX: A1 = B1+D1A B C D
1 10 8 2
2 15 9 6
3 18 2 16
… … … …MODELDBY(row,col) MEASURES (x)RULES( x[ANY, 1] = x[cv(row),2] + x[cv(row),4])
Table Translation – Join ExampleA B C D E F G … I J
1 FACT TIME_D
2 City Prod Month State
Categ
Year Sale Month Year
3 LA tv M1.00 CA video Y.00 10.00
M1.00 Y.00
4 LA radio
M2.00 CA audio Y.00 12.00
M2.00 Y.00
5 LA tv M1.01 CA video Y.01 14.00
6 LA radio
M2.01 CA audio Y.01 16.00
PROD_D
7 Boston
tv M1.00 MA video Y.00 20.00
Prod Categ
8 Boston
radio
M2.00 MA audio Y.00 22.00
tv Video
9 Boston
tv M1.01 MA video Y.01 24.00
radio Audio
10 Boston
radio
M2.01 MA audio Y.01 26.00
E3=VLOOKUP(B3,I8:J9,2)
F3=VLOOKUP(C3,I3:J4,2)
Join SQL
SELECT f.city, f.prod, f.month, g.state, p.categ, t.year, sale,
row_number() over (order by city, prod, month) rn
FROM
fact f outer join time_d t on f.month = t.month
outer join prod_d p on f.prod = p.prod
outer join geog_d g on f.city = g.city
ORDER BY city NULLS LAST, prod NULLS LAST,
month NULLS LAST;
Table Translation- Aggregation Aggregation in Excel can be done
through DATA PIVOTTABLE operation.This corresponds to (in RDBMS):
Aggregation via SQL GROUP BY operator Aggregation via SQL PIVOT operator
Aggregation – SQL GROUP BYL M N
1 AGG_Q
2 State Year
Total
3 CA Y.00 22.00
4 Y.01 30.00
5 52.00
6 MA Y.00 42.00
7 Y.01 50.00
8 92.00
SELECT state, year, sum(amt) amt,
row_number() over (order by state,year) rn
FROM
fact f outer join time_d t on f.month = t.month
outer join prod_d p on f.prod = p.prod
outer join geog_d g on f.city = g.city
GROUP BY
GROUPING SETS ((state,year),(state))
ORDER BY state NULLS LAST,
year NULLS LAST;
Translation of Fix Frame Operations on RTables
Excel computation is possible once we map relational data to 2-D form. This is called linearization.
Assignment Linearization Reference Linearization
Reference Linearization
L M N O
1 AGG_Q
2 State Year Total Ratio
3 CA Y.00 22.00
=N3/N5
4 Y.01 30.00
=N4/N5
5 52.00
=N5/N5
6 MA Y.00 42.00
=N6/N8
7 Y.01 50.00
=N7/N8
8 92.00
=N8/N8
SELECT row, col, x FROM cells
MODEL
REFERENCE r ON
( SELECT rn, state, time, total FROM RT )
DIMENSION BY (rn)
MEASURES (state, time, total)
DIMENSION BY (row, col) MEASURES (x)
( x[3, 15] = r.total[1] / r.total[3], -- =N3/N5
x[4, 15] = r.total[2] / r.total[3], -- =N4/N5
x[5, 15] = r.total[3] / r.total[3], -- =N5/N5
x[6, 15] = r.total[4] / r.total[6], -- =N6/N5
x[7, 15] = r.total[5] / r.total[6], -- =N6/N5
x[8, 15] = r.total[6] / r.total[6] -- =N8/N5
);
Relative Referencing to RTables
Introducing a new lookup function for referencing values in Rtables:
RTLOOKUP(RTREGION, COL, {PKEYS})
L M N O
1 AGG_Q
2 State Year
Total Ratio
3 CA Y.00 22.00
0.42
4 Y.01 30.00
0.58
5 52.00
1.00
6 MA Y.00 42.00
0.48
7 Y.01 50.00
0.52
8 92.00
1.00
O3 = N3/rtlookup(L2:N8,3,L3,NULL)
O4 = N3/rtlookup(L2:N8,3,L3,NULL)
Relative Referencing to RTablesCREATE VIEW AGG_Q ASSELECT state, year, ratio, sum(amt) amt, row_number() over () (order by state, year) rnFROM fact f outer join time_d t on f.month=t.month outer join prod_d p on f.prod = p.prod outer join geog_d g on f.city = g.cityGROUP BY GROUPING SETS ((state,year),(state))MODEL DBY (state, year) MEA (total, 0 ratio)( ratio[ANY,ANY] = total[CV(state),CV(year)]/total[CV(state), null])ORDER BY state nulls last, year nulls last;
Optimizations
Collapsing of Equivalent Rules For Loops vs. Existential Form Existing optimization of SQL Model
functionality– Rule pruning– Filter pushdown– Others…
Optimizations- Collapsing of Rules
MODEL
DBY(row,col) MEA(x)
RULES
(
x[1,1] = x[1,2] + x[1,4],
x[2,1] = x[2,2] + x[2,4],
…
x[20,1] = x[20,2] +x[20,4]
);
MODELDBY(row,col) MEA(x)RULES( x[for row from 1 to 20,1] = x[CV(row),2] + x[CV(row), 4]);
==
A B C D
1 10 8 2
2 15 9 6
3 18 2 16
4 … … …
5
A1=B1+D1
Optimizations- Collapsing of Rules
0
510
15
2025
30
3540
45
500 1000 2500 5000 10000
Number of rules
Co
mp
ilat
ion
tim
e
Optimization – For Loops vs Existential Rules
MODELDBY(row,col) MEA(x)RULES( x[for row from 1 to 20,1] = x[CV(row),2] +
x[CV(row), 4]);
MODELDBY(row,col) MEA(x)RULES( x[1<=row<=20,1] = x[CV(row),2] +
x[CV(row), 4]);
VS
A B C D
1 10 8 2
2 15 9 6
3 18 2 16
4 … … …
5
A1=B1+D1
Optimization – For Loops vs Existential Rules
0
0.2
0.4
0.6
0.8
1
1.2
1.4
1.6
0.1 0.5 1 5 10 20 40
Percentage of cells modified
Lo
ok
up
tim
e/S
ca
n t
ime
Use For Loops Use scan
Conclusion Our goal is to translate Excel computation to SQL
so that Business Models built in Excel can be stored and queried in RDBMS.
We proposed translation techniques for expressing Excel computation in RDBMS SQL using new analytic extensions.
We proposed representation techniques for relational data in Excel by using Rtables and described how Excel operations on RTables can be simulated in SQL.
We discussed how this proposed system would fit into our RDBMS SQL execution engine and benefit from all its capabilities and optimizations.
What is ahead?
Excel– Pivoting and advanced filtering turned out
to be essential, but a few more relational friendly extensions would go a long way, particularly in simulating joins, window function computations.
SQL– RDBMS SQL needs to be extended to cover
the functionality provided in Excel, particularly financial functions.
AQ&Q U E S T I O N SQ U E S T I O N SA N S W E R SA N S W E R S