ITCS 6163 View Maintenance. Implementing a Warehouse Monitoring: Sending data from sources...
-
Upload
mitchell-gray -
Category
Documents
-
view
218 -
download
2
Transcript of ITCS 6163 View Maintenance. Implementing a Warehouse Monitoring: Sending data from sources...
ITCS 6163
View Maintenance
Implementing a Warehouse
Monitoring: Sending data from sources
Integrating: Loading, cleansing,... Processing: Query processing,
indexing, ... Managing: Metadata, Design, ...
Monitoring Source Types: relational, flat file, IMS,
VSAM, IDMS, WWW, news-wire, … Incremental vs. Refreshcustomer id name address city
53 joe 10 main sfo81 fred 12 main sfo
111 sally 80 willow la new
Monitoring Techniques Periodic snapshots Database triggers Log shipping Data shipping (replication service) Transaction shipping Polling (queries to source) Application level monitoring
A
dvan
tage
s &
Dis
adva
ntag
es!!
Monitoring Issues Frequency
periodic: daily, weekly, … triggered: on “big” change, lots of
changes, ... Data transformation
convert data to uniform format remove & add fields (e.g., add date to get
history) Standards (e.g., ODBC)
Gateways
Monitoring Products Gateways: Info Builders EDA/SQL, Oracle
Open Connect, Informix Enterprise Gateway, …
Data Shipping: Oracle Replication Server, Praxis OmniReplicator, …
Transaction Shipping: Sybase Replication Server, Microsoft SQL Server
Extraction: Aonix, ETI, CrossAccess, DBStar Monitoring/Integration products later on
Integration Data Cleaning Data Loading Derived Data
Client Client
Warehouse
Source Source Source
Query & Analysis
Integration
Metadata
integration
Query &analysis
Change detection
Detect & send changes to integrator Different classes of sources
Cooperative Queryable Logged Snapshot/dump
Data transformation Convert data to uniform format
Byte ordering, string termination Internal layout
Remove, add, & reorder attributes Add (regeneratable) key Add date to get history
Data transformation (2) Sort tuples May use external utilites
Can be much faster (10x) than SQL engine
E.g., perl script to reorder attributes
External functions (EFs)
Special transformation functions E.g., Yen_to_dollars
User defined Specified in warehouse table
definition Aid in integration Must be applied to updates, too
Data integration Rules for matching data from
different sources Build composite view of data Eliminate duplicate, unneeded
attributes
Data Cleaning Migration (e.g., yen dollars) Scrubbing: use domain-specific knowledge (e.g., social
security numbers) Fusion (e.g., mail list, customer merging)
Auditing: discover rules & relationships(like data mining)
billing DB
service DB
customer1(Joe)
customer2(Joe)
merged_customer(Joe)
Data cleansing Find (& remove) duplicate tuples
E.g., Jane Doe & Jane Q. Doe Detect inconsistent, wrong data
Attributes that don’t match E.g., city, state and zipcode
Patch missing, unreadable data Want to “backflush” clean data
Notify sources of errors found
Loading Data Incremental vs. refresh Off-line vs. on-line Frequency of loading
At night, 1x a week/month, continuously
Parallel/Partitioned load
Derived Data Derived Warehouse Data
indexes aggregates materialized views (next slide)
When to update derived data? Incremental vs. refresh
The “everything is a view” view
• Pure programs: e.g., “can queries.” Always the same cost. No data is materialized. (DBMSs)
• Derived data: Materialized views. Data always there but must be updated. (Good for warehouses.)
• Pure data: Snapshot. Procedure is thrown away! Not maintainable.
• Approximate: Snapshot+refresh procedure applied in some conditions. (Quasi-copies). Approximate models (e.g., statistical).
(Quasi-cubes).
Materialized Views Define new warehouse relations using
SQL expressions
sale prodId storeId date amtp1 c1 1 12p2 c1 1 11p1 c3 1 50p2 c2 1 8p1 c1 2 44p1 c2 2 4
product id name pricep1 bolt 10p2 nut 5
joinTb prodId name price storeId date amtp1 bolt 10 c1 1 12p2 nut 5 c1 1 11p1 bolt 10 c3 1 50p2 nut 5 c2 1 8p1 bolt 10 c1 2 44p1 bolt 10 c2 2 4
does not existat any source
Integration Products Monitoring & Integration: Apertus,
Informatica, Prism, Sagent, … Merging: DataJoiner, SAS,… Cleaning: Trillum, ... Typically take warehouse off-line Typically refresh
or simple incremental: e.g., Red Brick Table Management Utility, Prism
Managing Metadata Warehouse Design Tools Client Client
Warehouse
Source Source Source
Query & Analysis
Integration
Metadata
integration
Query &analysis
Metadata Administrative
definition of sources, tools, ... schemas, dimension hierarchies, … rules for extraction, cleaning, … refresh, purging policies user profiles, access control, ...
Metadata Business
business terms & definition data ownership, charging
Operational data lineage data currency (e.g., active, archived,
purged) use stats, error reports, audit trails
Tools Development
design & edit: schemas, views, scripts, rules, queries, reports
Planning & Analysis what-if scenarios (schema changes, refresh rates), capacity planning
Warehouse Management performance monitoring, usage patterns, exception reporting
System & Network Management measure traffic (sources, warehouse, clients)
Workflow Management “reliable scripts” for cleaning & analyzing data
Tools - Products Management Tools
HP Intelligent Warehouse Advisor, IBM Data Hub, Prism Warehouse Manager
System & Network Management HP OpenView, IBM NetView, Tivoli
Current State of Industry Extraction and integration done off-line
Usually in large, time-consuming, batches Everything copied at warehouse
Not selective about what is stored Query benefit vs storage & update cost
Query optimization aimed at OLTP High throughput instead of fast response Process whole query before displaying
anything
Future Directions Better performance Larger warehouses Easier to use What are companies & research
labs working on?
Research (1) Incremental Maintenance Data Consistency Data Expiration Recovery Data Quality Error Handling (Back Flush)
Research (2) Rapid Monitor Construction Temporal Warehouses Materialization & Index Selection Data Fusion Data Mining Integration of Text & Relational
Data
Make warehouse self-maintainable
Add auxiliary tables to minimize update cost
Original + auxiliary are self-maintainable E.g., auxiliary table of all unsold catalog
items Some updates may still be self-
maintainable E.g., insert into catalog if item (the join
attribute) is a keySales Catalog
Items sold
Detection of self-maintainability
Most algorithms are at table level Most algorithms are compile-time Tuple level at runtime [Huyn 1996,
1997] Use state of tables and update to
determine if self-maintainable E.g., check whether sale is for item
previously sold
Warehouse maintenance Current systems ignore integration
of new data Or assume warehouse can be rebuilt
periodically Depend on long “downtime” to
regenerate warehouse Technology gap: continuous
incremental maintenance
Maintenance research Change detection Data consistency
Single table consistency Multiple table consistency
Expiration of data Crash recovery
Snapshot change detection
Compare old & new snapshots Join-based algorithms
Hash old data, probe with new Window algorithm
Sliding window over snapshots Good for local changes
Integrated data consistency Conventional maintenance
inadequate Sources report changes but: No locking, no global transactions
(sources don’t communicate, coordinate with each other)
Inconsistencies caused by interleaving of updates
Example anomaly table Sold = catalog x sale x emp insert into sale [hat, Sue] delete from catalog [$12, hat]
price item clerk age
item clerk
$12 hat
clerk ageSue 26
price item
catalog sale emp
Sold
Anomaly (2)
insert into sale [hat, Sue]
item clerk
sale
Q1 = catalog [hat, Sue]
$12 hat
price item
catalog
A(Q1)= [$12,hat, Sue]
delete from catalog [$12, hat]
clerk ageSue 26
emp
Q2 = [$12,hat, Sue] emp
A(Q2)= [$12,hat,Sue,26]
price item clerk ageSold
$12,hat,Sue,26
ignored
hat Sue
price item
catalog
Choices to deal with anomalies
• Keep all relations in the DW (storage-expensive!)
• Run all queries as distributed (may not be feasible! --legacy systems-- + poor performance)
• Use specialized algorithms. E.g., Eager Compensation Algorithm (ECA), STROBE.
Another anomaly example
price clerkSold
$12 hat
price item
catalogitem clerk
sale
hat Sue
$12 Sue
Delete(catalog[$12,hat]) Q1= p,c([$12,hat] sale)
A(Q1) =
V = [$12, Sue]-= V
A(Q2) =
V = [$12, Sue]-= V WRONG!
Delete(sale[hat,Sue])price item
catalogitem clerk
sale
Yet another anomaly example
Depts
Depts = Dept(catalog Store)
Insert(catalog[Bags,NY])
Q1= Dept(catalog [NY,Madison Av])
A(Q1) = [[Shoes],[Bags]]
Shoes Bags
A(Q2) = [[Bags]]
Shoes Bags Bags
Insert(Store[NY, Madison Av. City Add.
Store
Shoes NY
Dept City
catalog
Bags NY
Q2= Dept([Bags,NY] Store)
City Add.
Store
NY Madison Ave
Eager Compensating Algorithm(ECA)
Principle: send compensating queries to offset the effect of concurrent updates
ONLY GOOD IF ALL THE SOURCE RELATIONS ARE STORED IN ONE NODE (ONE SOURCE).
Anomaly example revisited (ECA)
Depts
Depts = Dept(catalog Store)
Insert(catalog[Bags,NY])
Q1= Dept(catalog [NY,Madison Av])
A(Q1) = [[Shoes],[Bags]]
Shoes Bags
A(Q2) =
Insert(Store[NY, Madison Av. City Add.
Store
Shoes NY
Dept City
catalog
Bags NY
Q2= Dept([Bags,NY]
Store) -Dept([Bags,NY]
[NY,Madison Ave]]
City Add.
Store
NY Madison Ave
ECA Algorithm SOURCE DATA WAREHOUSE (DW)
S_upi: Execute Ui W_upi: receive Ui send Ui to DW Qi=V(Ui)-
QQSQj(Ui) trigger W_upi at DW UQS = UQS + {Qi} Send Qi to S
trigger S_qui at S
S_qui : Receive Qi W_ansi: Receive Ai let Ai = Qi(ssi) COL = COL + Ai Send Ai to DW UQS = UQS - {Qi} trigger W_ansi at DW if UQS =
MV=MV+COL COL =
ssi = current source state UQS = unanswered query set
ECA-key
Avoids the need for compensating queries.
Necessary condition: the view contains key attributes for each of the base tables (e.g., star schema)
UQS = {Q1}
Q1= i,d(catalog [hat,Jane])
Insert(catalog[bag,acc]))
hat acc
Item dept.
catalog
A1 = {[bag,Jane]}
Item clerk
Sells
hat Sue
Example of ECA-key
item clerk
emp
acc Sue
A(Q2) = {[bags,Sue],[bags,Jane]}
Insert (sale[acc,Jane])
acc Jane
Q2= i,c([bags,acc] emp)
hat accbags acc
Item dept.
catalog
bags acc
Item dept.
catalog
Delete(catalog,[hat,acc])
UQS=UQS+{Q2}={Q1,Q2}COL = {[hat,Sue]}COL = COL =
{[bags,Jane]}
UQS = {Q2}UQS = COL = {[bags,Jane],
[bags,Sue]}
bags Sue bagsJane
Strobe algorithm ideas Apply actions only after a set of
interleaving updates are all processed Wait for sources to quiesce
Compensate effects of interleaved updates Subtract effects of later updates before
installing changes Can combine these ideasSTROBE IS A FAMILY OF ALGORITHMS
Strobe Terminology
• The materialized view MV is the current state of the view at the warehouse V(ws).
• Given a query Q that needs to be evaluated, the function next_source(Q) returns the pair (x,Qi), where x is the next source to contact and Qi the portion of the query that can be answered by x.
Example: if V = r1 r2 r3, and U and update received from r2, then Q = (r1 U r3) and next_source(Q) = (r1, r1 U)
Strobe terminology (2)Source_evaluation(Q): /returns answers to Q/
Begin
i = 0; WQ = Q; A0 = Q;(x,Q1) next_source(WQ);While x is not nil do
Let i = i + 1;Send Qi to source x;When x returns Ai, let WQ = WQ(Ai);Let (x,Qi+1) next_source(WQ);
Return(Ai);
End
Strobe AlgorithmSource DW -After exec. Ui, send Ui to DW AL = -When receiving Qi When Ui is received
Compute Ai over ss[x] if a deletionSend Ai to DW Qj UQS add Ui to pend(Qj)
Add key_del(MV,Ui) to AL if an insertion
Qi = V(Ui), pend(Qi) = Ai = source_evaluate(Qi); Uj pend(Qi), key_del(Ai,Uj);Add insert(MV,Ai) to AL
When UQS = , apply AL to MVas a single transaction, without adding duplicate tuples to MV Reset AL
Example with Strobe
$12 hat
price item
catalog
item clerk
sale
clerk ageSue 26
emp
price item clerk ageSold
hat Sue
AL =
U1=Insert(sale, [hat, Sue])
Q1=catalog [hat,Sue]emp
Q11=(catalog[hat,Sue])A11=[$12,hat,Sue]
Q12 =[$12,hat,Sue] emp
price item
catalog
U2=Del([$12,hat])
Pend(Q1) = U2
AL ={key_del(MV,U2)}
A12=[$12,hat,Sue,26]
Apply key_del(A12,U2) A2 =
Add nothing to AL UQS = MV =
Pend(Q1) =
Transaction-Strobe
item clerk
sale
hat Sue
item clerk
sale
hat Sue
T1 = {delete(sale,[hat,Sue]), insert(sale,[shoes,Jane])}
item clerk
sale
shoes, Jane
AL = {del([hat,Sue]} MV = item clerk
saleAL = {ins([shoes,Jane]} MV = {[shoes,Jane]}
item clerk
sale
shoes Jane
Multiple table consistency More than 1 table at warehouse Multiple tables share source data Updates at source should be
reflected in all warehouse tables at the same time
Customer Sales
V1:Customer-info
V3:Total-sales
V2:Cust-prefs
Multiple table consistency
V1 V2
S1 S2
Vn
Sm
Sources
WarehouseMultiple table consistency
Single table consistency
Source consistency
...
...
Painting algorithm Use merge process (MP) to coordinate
sending updates to warehouse MP holds update actions for each table MP charts potential table states arising
from each set of update actions MP sends batch of update actions
together when tables will be consistent