ITCS 6163 View Maintenance. Implementing a Warehouse Monitoring: Sending data from sources...

ITCS 6163

View Maintenance

Implementing a Warehouse

Monitoring: Sending data from sources

Integrating: Loading, cleansing,... Processing: Query processing,

indexing, ... Managing: Metadata, Design, ...

Monitoring Source Types: relational, flat file, IMS,

VSAM, IDMS, WWW, news-wire, … Incremental vs. Refreshcustomer id name address city

53 joe 10 main sfo81 fred 12 main sfo

111 sally 80 willow la new

Monitoring Techniques Periodic snapshots Database triggers Log shipping Data shipping (replication service) Transaction shipping Polling (queries to source) Application level monitoring

A

dvan

tage

s &

Dis

adva

ntag

es!!

Monitoring Issues Frequency

periodic: daily, weekly, … triggered: on “big” change, lots of

changes, ... Data transformation

convert data to uniform format remove & add fields (e.g., add date to get

history) Standards (e.g., ODBC)

Gateways

Monitoring Products Gateways: Info Builders EDA/SQL, Oracle

Open Connect, Informix Enterprise Gateway, …

Data Shipping: Oracle Replication Server, Praxis OmniReplicator, …

Transaction Shipping: Sybase Replication Server, Microsoft SQL Server

Extraction: Aonix, ETI, CrossAccess, DBStar Monitoring/Integration products later on

Integration Data Cleaning Data Loading Derived Data

Client Client

Warehouse

Source Source Source

Query & Analysis

Integration

Metadata

integration

Query &analysis

Change detection

Detect & send changes to integrator Different classes of sources

Cooperative Queryable Logged Snapshot/dump

Data transformation Convert data to uniform format

Byte ordering, string termination Internal layout

Remove, add, & reorder attributes Add (regeneratable) key Add date to get history

Data transformation (2) Sort tuples May use external utilites

Can be much faster (10x) than SQL engine

E.g., perl script to reorder attributes

External functions (EFs)

Special transformation functions E.g., Yen_to_dollars

User defined Specified in warehouse table

definition Aid in integration Must be applied to updates, too

Data integration Rules for matching data from

different sources Build composite view of data Eliminate duplicate, unneeded

attributes

Data Cleaning Migration (e.g., yen dollars) Scrubbing: use domain-specific knowledge (e.g., social

security numbers) Fusion (e.g., mail list, customer merging)

Auditing: discover rules & relationships(like data mining)

billing DB

service DB

customer1(Joe)

customer2(Joe)

merged_customer(Joe)

Data cleansing Find (& remove) duplicate tuples

E.g., Jane Doe & Jane Q. Doe Detect inconsistent, wrong data

Attributes that don’t match E.g., city, state and zipcode

Patch missing, unreadable data Want to “backflush” clean data

Notify sources of errors found

Loading Data Incremental vs. refresh Off-line vs. on-line Frequency of loading

At night, 1x a week/month, continuously

Parallel/Partitioned load

Derived Data Derived Warehouse Data

indexes aggregates materialized views (next slide)

When to update derived data? Incremental vs. refresh

The “everything is a view” view

• Pure programs: e.g., “can queries.” Always the same cost. No data is materialized. (DBMSs)

• Derived data: Materialized views. Data always there but must be updated. (Good for warehouses.)

• Pure data: Snapshot. Procedure is thrown away! Not maintainable.

• Approximate: Snapshot+refresh procedure applied in some conditions. (Quasi-copies). Approximate models (e.g., statistical).

(Quasi-cubes).

Materialized Views Define new warehouse relations using

SQL expressions

sale prodId storeId date amtp1 c1 1 12p2 c1 1 11p1 c3 1 50p2 c2 1 8p1 c1 2 44p1 c2 2 4

product id name pricep1 bolt 10p2 nut 5

joinTb prodId name price storeId date amtp1 bolt 10 c1 1 12p2 nut 5 c1 1 11p1 bolt 10 c3 1 50p2 nut 5 c2 1 8p1 bolt 10 c1 2 44p1 bolt 10 c2 2 4

does not existat any source

Integration Products Monitoring & Integration: Apertus,

Informatica, Prism, Sagent, … Merging: DataJoiner, SAS,… Cleaning: Trillum, ... Typically take warehouse off-line Typically refresh

or simple incremental: e.g., Red Brick Table Management Utility, Prism

Managing Metadata Warehouse Design Tools Client Client

Warehouse

Source Source Source

Query & Analysis

Integration

Metadata

integration

Query &analysis

Metadata Administrative

definition of sources, tools, ... schemas, dimension hierarchies, … rules for extraction, cleaning, … refresh, purging policies user profiles, access control, ...

Metadata Business

business terms & definition data ownership, charging

Operational data lineage data currency (e.g., active, archived,

purged) use stats, error reports, audit trails

Tools Development

design & edit: schemas, views, scripts, rules, queries, reports

Planning & Analysis what-if scenarios (schema changes, refresh rates), capacity planning

Warehouse Management performance monitoring, usage patterns, exception reporting

System & Network Management measure traffic (sources, warehouse, clients)

Workflow Management “reliable scripts” for cleaning & analyzing data

Tools - Products Management Tools

HP Intelligent Warehouse Advisor, IBM Data Hub, Prism Warehouse Manager

System & Network Management HP OpenView, IBM NetView, Tivoli

Current State of Industry Extraction and integration done off-line

Usually in large, time-consuming, batches Everything copied at warehouse

Not selective about what is stored Query benefit vs storage & update cost

Query optimization aimed at OLTP High throughput instead of fast response Process whole query before displaying

anything

Future Directions Better performance Larger warehouses Easier to use What are companies & research

labs working on?

Research (1) Incremental Maintenance Data Consistency Data Expiration Recovery Data Quality Error Handling (Back Flush)

Research (2) Rapid Monitor Construction Temporal Warehouses Materialization & Index Selection Data Fusion Data Mining Integration of Text & Relational

Data

Make warehouse self-maintainable

Add auxiliary tables to minimize update cost

Original + auxiliary are self-maintainable E.g., auxiliary table of all unsold catalog

items Some updates may still be self-

maintainable E.g., insert into catalog if item (the join

attribute) is a keySales Catalog

Items sold

Detection of self-maintainability

Most algorithms are at table level Most algorithms are compile-time Tuple level at runtime [Huyn 1996,

1997] Use state of tables and update to

determine if self-maintainable E.g., check whether sale is for item

previously sold

Warehouse maintenance Current systems ignore integration

of new data Or assume warehouse can be rebuilt

periodically Depend on long “downtime” to

regenerate warehouse Technology gap: continuous

incremental maintenance

Maintenance research Change detection Data consistency

Single table consistency Multiple table consistency

Expiration of data Crash recovery

Snapshot change detection

Compare old & new snapshots Join-based algorithms

Hash old data, probe with new Window algorithm

Sliding window over snapshots Good for local changes

Integrated data consistency Conventional maintenance

inadequate Sources report changes but: No locking, no global transactions

(sources don’t communicate, coordinate with each other)

Inconsistencies caused by interleaving of updates

Example anomaly table Sold = catalog x sale x emp insert into sale [hat, Sue] delete from catalog [$12, hat]

price item clerk age

item clerk

$12 hat

clerk ageSue 26

price item

catalog sale emp

Sold

Anomaly (2)

insert into sale [hat, Sue]

item clerk

sale

Q1 = catalog [hat, Sue]

$12 hat

price item

catalog

A(Q1)= [$12,hat, Sue]

delete from catalog [$12, hat]

clerk ageSue 26

emp

Q2 = [$12,hat, Sue] emp

A(Q2)= [$12,hat,Sue,26]

price item clerk ageSold

$12,hat,Sue,26

ignored

hat Sue

price item

catalog

Choices to deal with anomalies

• Keep all relations in the DW (storage-expensive!)

• Run all queries as distributed (may not be feasible! --legacy systems-- + poor performance)

• Use specialized algorithms. E.g., Eager Compensation Algorithm (ECA), STROBE.

Another anomaly example

price clerkSold

$12 hat

price item

catalogitem clerk

sale

hat Sue

$12 Sue

Delete(catalog[$12,hat]) Q1= p,c([$12,hat] sale)

A(Q1) =

V = [$12, Sue]-= V

A(Q2) =

V = [$12, Sue]-= V WRONG!

Delete(sale[hat,Sue])price item

catalogitem clerk

sale

Yet another anomaly example

Depts

Depts = Dept(catalog Store)

Insert(catalog[Bags,NY])

Q1= Dept(catalog [NY,Madison Av])

A(Q1) = [[Shoes],[Bags]]

Shoes Bags

A(Q2) = [[Bags]]

Shoes Bags Bags

Insert(Store[NY, Madison Av. City Add.

Store

Shoes NY

Dept City

catalog

Bags NY

Q2= Dept([Bags,NY] Store)

City Add.

Store

NY Madison Ave

Eager Compensating Algorithm(ECA)

Principle: send compensating queries to offset the effect of concurrent updates

ONLY GOOD IF ALL THE SOURCE RELATIONS ARE STORED IN ONE NODE (ONE SOURCE).

Anomaly example revisited (ECA)

Depts

Depts = Dept(catalog Store)

Insert(catalog[Bags,NY])

Q1= Dept(catalog [NY,Madison Av])

A(Q1) = [[Shoes],[Bags]]

Shoes Bags

A(Q2) =

Insert(Store[NY, Madison Av. City Add.

Store

Shoes NY

Dept City

catalog

Bags NY

Q2= Dept([Bags,NY]

Store) -Dept([Bags,NY]

[NY,Madison Ave]]

City Add.

Store

NY Madison Ave

ECA Algorithm SOURCE DATA WAREHOUSE (DW)

S_upi: Execute Ui W_upi: receive Ui send Ui to DW Qi=V(Ui)-

QQSQj(Ui) trigger W_upi at DW UQS = UQS + {Qi} Send Qi to S

trigger S_qui at S

S_qui : Receive Qi W_ansi: Receive Ai let Ai = Qi(ssi) COL = COL + Ai Send Ai to DW UQS = UQS - {Qi} trigger W_ansi at DW if UQS =

MV=MV+COL COL =

ssi = current source state UQS = unanswered query set

ECA-key

Avoids the need for compensating queries.

Necessary condition: the view contains key attributes for each of the base tables (e.g., star schema)

UQS = {Q1}

Q1= i,d(catalog [hat,Jane])

Insert(catalog[bag,acc]))

hat acc

Item dept.

catalog

A1 = {[bag,Jane]}

Item clerk

Sells

hat Sue

Example of ECA-key

item clerk

emp

acc Sue

A(Q2) = {[bags,Sue],[bags,Jane]}

Insert (sale[acc,Jane])

acc Jane

Q2= i,c([bags,acc] emp)

hat accbags acc

Item dept.

catalog

bags acc

Item dept.

catalog

Delete(catalog,[hat,acc])

UQS=UQS+{Q2}={Q1,Q2}COL = {[hat,Sue]}COL = COL =

{[bags,Jane]}

UQS = {Q2}UQS = COL = {[bags,Jane],

[bags,Sue]}

bags Sue bagsJane

Strobe algorithm ideas Apply actions only after a set of

interleaving updates are all processed Wait for sources to quiesce

Compensate effects of interleaved updates Subtract effects of later updates before

installing changes Can combine these ideasSTROBE IS A FAMILY OF ALGORITHMS

Strobe Terminology

• The materialized view MV is the current state of the view at the warehouse V(ws).

• Given a query Q that needs to be evaluated, the function next_source(Q) returns the pair (x,Qi), where x is the next source to contact and Qi the portion of the query that can be answered by x.

Example: if V = r1 r2 r3, and U and update received from r2, then Q = (r1 U r3) and next_source(Q) = (r1, r1 U)

Strobe terminology (2)Source_evaluation(Q): /returns answers to Q/

Begin

i = 0; WQ = Q; A0 = Q;(x,Q1) next_source(WQ);While x is not nil do

Let i = i + 1;Send Qi to source x;When x returns Ai, let WQ = WQ(Ai);Let (x,Qi+1) next_source(WQ);

Return(Ai);

End

Strobe AlgorithmSource DW -After exec. Ui, send Ui to DW AL = -When receiving Qi When Ui is received

Compute Ai over ss[x] if a deletionSend Ai to DW Qj UQS add Ui to pend(Qj)

Add key_del(MV,Ui) to AL if an insertion

Qi = V(Ui), pend(Qi) = Ai = source_evaluate(Qi); Uj pend(Qi), key_del(Ai,Uj);Add insert(MV,Ai) to AL

When UQS = , apply AL to MVas a single transaction, without adding duplicate tuples to MV Reset AL

Example with Strobe

$12 hat

price item

catalog

item clerk

sale

clerk ageSue 26

emp

price item clerk ageSold

hat Sue

AL =

U1=Insert(sale, [hat, Sue])

Q1=catalog [hat,Sue]emp

Q11=(catalog[hat,Sue])A11=[$12,hat,Sue]

Q12 =[$12,hat,Sue] emp

price item

catalog

U2=Del([$12,hat])

Pend(Q1) = U2

AL ={key_del(MV,U2)}

A12=[$12,hat,Sue,26]

Apply key_del(A12,U2) A2 =

Add nothing to AL UQS = MV =

Pend(Q1) =

Transaction-Strobe

item clerk

sale

hat Sue

item clerk

sale

hat Sue

T1 = {delete(sale,[hat,Sue]), insert(sale,[shoes,Jane])}

item clerk

sale

shoes, Jane

AL = {del([hat,Sue]} MV = item clerk

saleAL = {ins([shoes,Jane]} MV = {[shoes,Jane]}

item clerk

sale

shoes Jane

Multiple table consistency More than 1 table at warehouse Multiple tables share source data Updates at source should be

reflected in all warehouse tables at the same time

Customer Sales

V1:Customer-info

V3:Total-sales

V2:Cust-prefs

Multiple table consistency

V1 V2

S1 S2

Vn

Sm

Sources

WarehouseMultiple table consistency

Single table consistency

Source consistency

...

...

Painting algorithm Use merge process (MP) to coordinate

sending updates to warehouse MP holds update actions for each table MP charts potential table states arising

from each set of update actions MP sends batch of update actions

together when tables will be consistent

ITCS 6163 View Maintenance. Implementing a Warehouse Monitoring: Sending data from sources...

Documents

Transcript of ITCS 6163 View Maintenance. Implementing a Warehouse Monitoring: Sending data from sources...