ITCS 6163 View Maintenance. Implementing a Warehouse Monitoring: Sending data from sources...

53
ITCS 6163 View Maintenance

Transcript of ITCS 6163 View Maintenance. Implementing a Warehouse Monitoring: Sending data from sources...

Page 1: ITCS 6163 View Maintenance. Implementing a Warehouse Monitoring: Sending data from sources Integrating: Loading, cleansing,... Processing: Query processing,

ITCS 6163

View Maintenance

Page 2: ITCS 6163 View Maintenance. Implementing a Warehouse Monitoring: Sending data from sources Integrating: Loading, cleansing,... Processing: Query processing,

Implementing a Warehouse

Monitoring: Sending data from sources

Integrating: Loading, cleansing,... Processing: Query processing,

indexing, ... Managing: Metadata, Design, ...

Page 3: ITCS 6163 View Maintenance. Implementing a Warehouse Monitoring: Sending data from sources Integrating: Loading, cleansing,... Processing: Query processing,

Monitoring Source Types: relational, flat file, IMS,

VSAM, IDMS, WWW, news-wire, … Incremental vs. Refreshcustomer id name address city

53 joe 10 main sfo81 fred 12 main sfo

111 sally 80 willow la new

Page 4: ITCS 6163 View Maintenance. Implementing a Warehouse Monitoring: Sending data from sources Integrating: Loading, cleansing,... Processing: Query processing,

Monitoring Techniques Periodic snapshots Database triggers Log shipping Data shipping (replication service) Transaction shipping Polling (queries to source) Application level monitoring

A

dvan

tage

s &

Dis

adva

ntag

es!!

Page 5: ITCS 6163 View Maintenance. Implementing a Warehouse Monitoring: Sending data from sources Integrating: Loading, cleansing,... Processing: Query processing,

Monitoring Issues Frequency

periodic: daily, weekly, … triggered: on “big” change, lots of

changes, ... Data transformation

convert data to uniform format remove & add fields (e.g., add date to get

history) Standards (e.g., ODBC)

Gateways

Page 6: ITCS 6163 View Maintenance. Implementing a Warehouse Monitoring: Sending data from sources Integrating: Loading, cleansing,... Processing: Query processing,

Monitoring Products Gateways: Info Builders EDA/SQL, Oracle

Open Connect, Informix Enterprise Gateway, …

Data Shipping: Oracle Replication Server, Praxis OmniReplicator, …

Transaction Shipping: Sybase Replication Server, Microsoft SQL Server

Extraction: Aonix, ETI, CrossAccess, DBStar Monitoring/Integration products later on

Page 7: ITCS 6163 View Maintenance. Implementing a Warehouse Monitoring: Sending data from sources Integrating: Loading, cleansing,... Processing: Query processing,

Integration Data Cleaning Data Loading Derived Data

Client Client

Warehouse

Source Source Source

Query & Analysis

Integration

Metadata

integration

Query &analysis

Page 8: ITCS 6163 View Maintenance. Implementing a Warehouse Monitoring: Sending data from sources Integrating: Loading, cleansing,... Processing: Query processing,

Change detection

Detect & send changes to integrator Different classes of sources

Cooperative Queryable Logged Snapshot/dump

Page 9: ITCS 6163 View Maintenance. Implementing a Warehouse Monitoring: Sending data from sources Integrating: Loading, cleansing,... Processing: Query processing,

Data transformation Convert data to uniform format

Byte ordering, string termination Internal layout

Remove, add, & reorder attributes Add (regeneratable) key Add date to get history

Page 10: ITCS 6163 View Maintenance. Implementing a Warehouse Monitoring: Sending data from sources Integrating: Loading, cleansing,... Processing: Query processing,

Data transformation (2) Sort tuples May use external utilites

Can be much faster (10x) than SQL engine

E.g., perl script to reorder attributes

Page 11: ITCS 6163 View Maintenance. Implementing a Warehouse Monitoring: Sending data from sources Integrating: Loading, cleansing,... Processing: Query processing,

External functions (EFs)

Special transformation functions E.g., Yen_to_dollars

User defined Specified in warehouse table

definition Aid in integration Must be applied to updates, too

Page 12: ITCS 6163 View Maintenance. Implementing a Warehouse Monitoring: Sending data from sources Integrating: Loading, cleansing,... Processing: Query processing,

Data integration Rules for matching data from

different sources Build composite view of data Eliminate duplicate, unneeded

attributes

Page 13: ITCS 6163 View Maintenance. Implementing a Warehouse Monitoring: Sending data from sources Integrating: Loading, cleansing,... Processing: Query processing,

Data Cleaning Migration (e.g., yen dollars) Scrubbing: use domain-specific knowledge (e.g., social

security numbers) Fusion (e.g., mail list, customer merging)

Auditing: discover rules & relationships(like data mining)

billing DB

service DB

customer1(Joe)

customer2(Joe)

merged_customer(Joe)

Page 14: ITCS 6163 View Maintenance. Implementing a Warehouse Monitoring: Sending data from sources Integrating: Loading, cleansing,... Processing: Query processing,

Data cleansing Find (& remove) duplicate tuples

E.g., Jane Doe & Jane Q. Doe Detect inconsistent, wrong data

Attributes that don’t match E.g., city, state and zipcode

Patch missing, unreadable data Want to “backflush” clean data

Notify sources of errors found

Page 15: ITCS 6163 View Maintenance. Implementing a Warehouse Monitoring: Sending data from sources Integrating: Loading, cleansing,... Processing: Query processing,

Loading Data Incremental vs. refresh Off-line vs. on-line Frequency of loading

At night, 1x a week/month, continuously

Parallel/Partitioned load

Page 16: ITCS 6163 View Maintenance. Implementing a Warehouse Monitoring: Sending data from sources Integrating: Loading, cleansing,... Processing: Query processing,

Derived Data Derived Warehouse Data

indexes aggregates materialized views (next slide)

When to update derived data? Incremental vs. refresh

Page 17: ITCS 6163 View Maintenance. Implementing a Warehouse Monitoring: Sending data from sources Integrating: Loading, cleansing,... Processing: Query processing,

The “everything is a view” view

• Pure programs: e.g., “can queries.” Always the same cost. No data is materialized. (DBMSs)

• Derived data: Materialized views. Data always there but must be updated. (Good for warehouses.)

• Pure data: Snapshot. Procedure is thrown away! Not maintainable.

• Approximate: Snapshot+refresh procedure applied in some conditions. (Quasi-copies). Approximate models (e.g., statistical).

(Quasi-cubes).

Page 18: ITCS 6163 View Maintenance. Implementing a Warehouse Monitoring: Sending data from sources Integrating: Loading, cleansing,... Processing: Query processing,

Materialized Views Define new warehouse relations using

SQL expressions

sale prodId storeId date amtp1 c1 1 12p2 c1 1 11p1 c3 1 50p2 c2 1 8p1 c1 2 44p1 c2 2 4

product id name pricep1 bolt 10p2 nut 5

joinTb prodId name price storeId date amtp1 bolt 10 c1 1 12p2 nut 5 c1 1 11p1 bolt 10 c3 1 50p2 nut 5 c2 1 8p1 bolt 10 c1 2 44p1 bolt 10 c2 2 4

does not existat any source

Page 19: ITCS 6163 View Maintenance. Implementing a Warehouse Monitoring: Sending data from sources Integrating: Loading, cleansing,... Processing: Query processing,

Integration Products Monitoring & Integration: Apertus,

Informatica, Prism, Sagent, … Merging: DataJoiner, SAS,… Cleaning: Trillum, ... Typically take warehouse off-line Typically refresh

or simple incremental: e.g., Red Brick Table Management Utility, Prism

Page 20: ITCS 6163 View Maintenance. Implementing a Warehouse Monitoring: Sending data from sources Integrating: Loading, cleansing,... Processing: Query processing,

Managing Metadata Warehouse Design Tools Client Client

Warehouse

Source Source Source

Query & Analysis

Integration

Metadata

integration

Query &analysis

Page 21: ITCS 6163 View Maintenance. Implementing a Warehouse Monitoring: Sending data from sources Integrating: Loading, cleansing,... Processing: Query processing,

Metadata Administrative

definition of sources, tools, ... schemas, dimension hierarchies, … rules for extraction, cleaning, … refresh, purging policies user profiles, access control, ...

Page 22: ITCS 6163 View Maintenance. Implementing a Warehouse Monitoring: Sending data from sources Integrating: Loading, cleansing,... Processing: Query processing,

Metadata Business

business terms & definition data ownership, charging

Operational data lineage data currency (e.g., active, archived,

purged) use stats, error reports, audit trails

Page 23: ITCS 6163 View Maintenance. Implementing a Warehouse Monitoring: Sending data from sources Integrating: Loading, cleansing,... Processing: Query processing,

Tools Development

design & edit: schemas, views, scripts, rules, queries, reports

Planning & Analysis what-if scenarios (schema changes, refresh rates), capacity planning

Warehouse Management performance monitoring, usage patterns, exception reporting

System & Network Management measure traffic (sources, warehouse, clients)

Workflow Management “reliable scripts” for cleaning & analyzing data

Page 24: ITCS 6163 View Maintenance. Implementing a Warehouse Monitoring: Sending data from sources Integrating: Loading, cleansing,... Processing: Query processing,

Tools - Products Management Tools

HP Intelligent Warehouse Advisor, IBM Data Hub, Prism Warehouse Manager

System & Network Management HP OpenView, IBM NetView, Tivoli

Page 25: ITCS 6163 View Maintenance. Implementing a Warehouse Monitoring: Sending data from sources Integrating: Loading, cleansing,... Processing: Query processing,

Current State of Industry Extraction and integration done off-line

Usually in large, time-consuming, batches Everything copied at warehouse

Not selective about what is stored Query benefit vs storage & update cost

Query optimization aimed at OLTP High throughput instead of fast response Process whole query before displaying

anything

Page 26: ITCS 6163 View Maintenance. Implementing a Warehouse Monitoring: Sending data from sources Integrating: Loading, cleansing,... Processing: Query processing,

Future Directions Better performance Larger warehouses Easier to use What are companies & research

labs working on?

Page 27: ITCS 6163 View Maintenance. Implementing a Warehouse Monitoring: Sending data from sources Integrating: Loading, cleansing,... Processing: Query processing,

Research (1) Incremental Maintenance Data Consistency Data Expiration Recovery Data Quality Error Handling (Back Flush)

Page 28: ITCS 6163 View Maintenance. Implementing a Warehouse Monitoring: Sending data from sources Integrating: Loading, cleansing,... Processing: Query processing,

Research (2) Rapid Monitor Construction Temporal Warehouses Materialization & Index Selection Data Fusion Data Mining Integration of Text & Relational

Data

Page 29: ITCS 6163 View Maintenance. Implementing a Warehouse Monitoring: Sending data from sources Integrating: Loading, cleansing,... Processing: Query processing,

Make warehouse self-maintainable

Add auxiliary tables to minimize update cost

Original + auxiliary are self-maintainable E.g., auxiliary table of all unsold catalog

items Some updates may still be self-

maintainable E.g., insert into catalog if item (the join

attribute) is a keySales Catalog

Items sold

Page 30: ITCS 6163 View Maintenance. Implementing a Warehouse Monitoring: Sending data from sources Integrating: Loading, cleansing,... Processing: Query processing,

Detection of self-maintainability

Most algorithms are at table level Most algorithms are compile-time Tuple level at runtime [Huyn 1996,

1997] Use state of tables and update to

determine if self-maintainable E.g., check whether sale is for item

previously sold

Page 31: ITCS 6163 View Maintenance. Implementing a Warehouse Monitoring: Sending data from sources Integrating: Loading, cleansing,... Processing: Query processing,

Warehouse maintenance Current systems ignore integration

of new data Or assume warehouse can be rebuilt

periodically Depend on long “downtime” to

regenerate warehouse Technology gap: continuous

incremental maintenance

Page 32: ITCS 6163 View Maintenance. Implementing a Warehouse Monitoring: Sending data from sources Integrating: Loading, cleansing,... Processing: Query processing,

Maintenance research Change detection Data consistency

Single table consistency Multiple table consistency

Expiration of data Crash recovery

Page 33: ITCS 6163 View Maintenance. Implementing a Warehouse Monitoring: Sending data from sources Integrating: Loading, cleansing,... Processing: Query processing,

Snapshot change detection

Compare old & new snapshots Join-based algorithms

Hash old data, probe with new Window algorithm

Sliding window over snapshots Good for local changes

Page 34: ITCS 6163 View Maintenance. Implementing a Warehouse Monitoring: Sending data from sources Integrating: Loading, cleansing,... Processing: Query processing,

Integrated data consistency Conventional maintenance

inadequate Sources report changes but: No locking, no global transactions

(sources don’t communicate, coordinate with each other)

Inconsistencies caused by interleaving of updates

Page 35: ITCS 6163 View Maintenance. Implementing a Warehouse Monitoring: Sending data from sources Integrating: Loading, cleansing,... Processing: Query processing,

Example anomaly table Sold = catalog x sale x emp insert into sale [hat, Sue] delete from catalog [$12, hat]

price item clerk age

item clerk

$12 hat

clerk ageSue 26

price item

catalog sale emp

Sold

Page 36: ITCS 6163 View Maintenance. Implementing a Warehouse Monitoring: Sending data from sources Integrating: Loading, cleansing,... Processing: Query processing,

Anomaly (2)

insert into sale [hat, Sue]

item clerk

sale

Q1 = catalog [hat, Sue]

$12 hat

price item

catalog

A(Q1)= [$12,hat, Sue]

delete from catalog [$12, hat]

clerk ageSue 26

emp

Q2 = [$12,hat, Sue] emp

A(Q2)= [$12,hat,Sue,26]

price item clerk ageSold

$12,hat,Sue,26

ignored

hat Sue

price item

catalog

Page 37: ITCS 6163 View Maintenance. Implementing a Warehouse Monitoring: Sending data from sources Integrating: Loading, cleansing,... Processing: Query processing,

Choices to deal with anomalies

• Keep all relations in the DW (storage-expensive!)

• Run all queries as distributed (may not be feasible! --legacy systems-- + poor performance)

• Use specialized algorithms. E.g., Eager Compensation Algorithm (ECA), STROBE.

Page 38: ITCS 6163 View Maintenance. Implementing a Warehouse Monitoring: Sending data from sources Integrating: Loading, cleansing,... Processing: Query processing,

Another anomaly example

price clerkSold

$12 hat

price item

catalogitem clerk

sale

hat Sue

$12 Sue

Delete(catalog[$12,hat]) Q1= p,c([$12,hat] sale)

A(Q1) =

V = [$12, Sue]-= V

A(Q2) =

V = [$12, Sue]-= V WRONG!

Delete(sale[hat,Sue])price item

catalogitem clerk

sale

Page 39: ITCS 6163 View Maintenance. Implementing a Warehouse Monitoring: Sending data from sources Integrating: Loading, cleansing,... Processing: Query processing,

Yet another anomaly example

Depts

Depts = Dept(catalog Store)

Insert(catalog[Bags,NY])

Q1= Dept(catalog [NY,Madison Av])

A(Q1) = [[Shoes],[Bags]]

Shoes Bags

A(Q2) = [[Bags]]

Shoes Bags Bags

Insert(Store[NY, Madison Av. City Add.

Store

Shoes NY

Dept City

catalog

Bags NY

Q2= Dept([Bags,NY] Store)

City Add.

Store

NY Madison Ave

Page 40: ITCS 6163 View Maintenance. Implementing a Warehouse Monitoring: Sending data from sources Integrating: Loading, cleansing,... Processing: Query processing,

Eager Compensating Algorithm(ECA)

Principle: send compensating queries to offset the effect of concurrent updates

ONLY GOOD IF ALL THE SOURCE RELATIONS ARE STORED IN ONE NODE (ONE SOURCE).

Page 41: ITCS 6163 View Maintenance. Implementing a Warehouse Monitoring: Sending data from sources Integrating: Loading, cleansing,... Processing: Query processing,

Anomaly example revisited (ECA)

Depts

Depts = Dept(catalog Store)

Insert(catalog[Bags,NY])

Q1= Dept(catalog [NY,Madison Av])

A(Q1) = [[Shoes],[Bags]]

Shoes Bags

A(Q2) =

Insert(Store[NY, Madison Av. City Add.

Store

Shoes NY

Dept City

catalog

Bags NY

Q2= Dept([Bags,NY]

Store) -Dept([Bags,NY]

[NY,Madison Ave]]

City Add.

Store

NY Madison Ave

Page 42: ITCS 6163 View Maintenance. Implementing a Warehouse Monitoring: Sending data from sources Integrating: Loading, cleansing,... Processing: Query processing,

ECA Algorithm SOURCE DATA WAREHOUSE (DW)

S_upi: Execute Ui W_upi: receive Ui send Ui to DW Qi=V(Ui)-

QQSQj(Ui) trigger W_upi at DW UQS = UQS + {Qi} Send Qi to S

trigger S_qui at S

S_qui : Receive Qi W_ansi: Receive Ai let Ai = Qi(ssi) COL = COL + Ai Send Ai to DW UQS = UQS - {Qi} trigger W_ansi at DW if UQS =

MV=MV+COL COL =

ssi = current source state UQS = unanswered query set

Page 43: ITCS 6163 View Maintenance. Implementing a Warehouse Monitoring: Sending data from sources Integrating: Loading, cleansing,... Processing: Query processing,

ECA-key

Avoids the need for compensating queries.

Necessary condition: the view contains key attributes for each of the base tables (e.g., star schema)

Page 44: ITCS 6163 View Maintenance. Implementing a Warehouse Monitoring: Sending data from sources Integrating: Loading, cleansing,... Processing: Query processing,

UQS = {Q1}

Q1= i,d(catalog [hat,Jane])

Insert(catalog[bag,acc]))

hat acc

Item dept.

catalog

A1 = {[bag,Jane]}

Item clerk

Sells

hat Sue

Example of ECA-key

item clerk

emp

acc Sue

A(Q2) = {[bags,Sue],[bags,Jane]}

Insert (sale[acc,Jane])

acc Jane

Q2= i,c([bags,acc] emp)

hat accbags acc

Item dept.

catalog

bags acc

Item dept.

catalog

Delete(catalog,[hat,acc])

UQS=UQS+{Q2}={Q1,Q2}COL = {[hat,Sue]}COL = COL =

{[bags,Jane]}

UQS = {Q2}UQS = COL = {[bags,Jane],

[bags,Sue]}

bags Sue bagsJane

Page 45: ITCS 6163 View Maintenance. Implementing a Warehouse Monitoring: Sending data from sources Integrating: Loading, cleansing,... Processing: Query processing,

Strobe algorithm ideas Apply actions only after a set of

interleaving updates are all processed Wait for sources to quiesce

Compensate effects of interleaved updates Subtract effects of later updates before

installing changes Can combine these ideasSTROBE IS A FAMILY OF ALGORITHMS

Page 46: ITCS 6163 View Maintenance. Implementing a Warehouse Monitoring: Sending data from sources Integrating: Loading, cleansing,... Processing: Query processing,

Strobe Terminology

• The materialized view MV is the current state of the view at the warehouse V(ws).

• Given a query Q that needs to be evaluated, the function next_source(Q) returns the pair (x,Qi), where x is the next source to contact and Qi the portion of the query that can be answered by x.

Example: if V = r1 r2 r3, and U and update received from r2, then Q = (r1 U r3) and next_source(Q) = (r1, r1 U)

Page 47: ITCS 6163 View Maintenance. Implementing a Warehouse Monitoring: Sending data from sources Integrating: Loading, cleansing,... Processing: Query processing,

Strobe terminology (2)Source_evaluation(Q): /returns answers to Q/

Begin

i = 0; WQ = Q; A0 = Q;(x,Q1) next_source(WQ);While x is not nil do

Let i = i + 1;Send Qi to source x;When x returns Ai, let WQ = WQ(Ai);Let (x,Qi+1) next_source(WQ);

Return(Ai);

End

Page 48: ITCS 6163 View Maintenance. Implementing a Warehouse Monitoring: Sending data from sources Integrating: Loading, cleansing,... Processing: Query processing,

Strobe AlgorithmSource DW -After exec. Ui, send Ui to DW AL = -When receiving Qi When Ui is received

Compute Ai over ss[x] if a deletionSend Ai to DW Qj UQS add Ui to pend(Qj)

Add key_del(MV,Ui) to AL if an insertion

Qi = V(Ui), pend(Qi) = Ai = source_evaluate(Qi); Uj pend(Qi), key_del(Ai,Uj);Add insert(MV,Ai) to AL

When UQS = , apply AL to MVas a single transaction, without adding duplicate tuples to MV Reset AL

Page 49: ITCS 6163 View Maintenance. Implementing a Warehouse Monitoring: Sending data from sources Integrating: Loading, cleansing,... Processing: Query processing,

Example with Strobe

$12 hat

price item

catalog

item clerk

sale

clerk ageSue 26

emp

price item clerk ageSold

hat Sue

AL =

U1=Insert(sale, [hat, Sue])

Q1=catalog [hat,Sue]emp

Q11=(catalog[hat,Sue])A11=[$12,hat,Sue]

Q12 =[$12,hat,Sue] emp

price item

catalog

U2=Del([$12,hat])

Pend(Q1) = U2

AL ={key_del(MV,U2)}

A12=[$12,hat,Sue,26]

Apply key_del(A12,U2) A2 =

Add nothing to AL UQS = MV =

Pend(Q1) =

Page 50: ITCS 6163 View Maintenance. Implementing a Warehouse Monitoring: Sending data from sources Integrating: Loading, cleansing,... Processing: Query processing,

Transaction-Strobe

item clerk

sale

hat Sue

item clerk

sale

hat Sue

T1 = {delete(sale,[hat,Sue]), insert(sale,[shoes,Jane])}

item clerk

sale

shoes, Jane

AL = {del([hat,Sue]} MV = item clerk

saleAL = {ins([shoes,Jane]} MV = {[shoes,Jane]}

item clerk

sale

shoes Jane

Page 51: ITCS 6163 View Maintenance. Implementing a Warehouse Monitoring: Sending data from sources Integrating: Loading, cleansing,... Processing: Query processing,

Multiple table consistency More than 1 table at warehouse Multiple tables share source data Updates at source should be

reflected in all warehouse tables at the same time

Customer Sales

V1:Customer-info

V3:Total-sales

V2:Cust-prefs

Page 52: ITCS 6163 View Maintenance. Implementing a Warehouse Monitoring: Sending data from sources Integrating: Loading, cleansing,... Processing: Query processing,

Multiple table consistency

V1 V2

S1 S2

Vn

Sm

Sources

WarehouseMultiple table consistency

Single table consistency

Source consistency

...

...

Page 53: ITCS 6163 View Maintenance. Implementing a Warehouse Monitoring: Sending data from sources Integrating: Loading, cleansing,... Processing: Query processing,

Painting algorithm Use merge process (MP) to coordinate

sending updates to warehouse MP holds update actions for each table MP charts potential table states arising

from each set of update actions MP sends batch of update actions

together when tables will be consistent