Post on 10-Feb-2016
description
1
Transaction Management
2
Transaction ManagementAtomicity
Either all or none of the transaction’s operations are performed. Atomicity requires that if a transaction is interrupted by a failure, its partial results are undone.
Reasons for transaction not completed: Transaction aborts or system crashes.
Commitment: Completion of a transaction.Transaction primitives: BEGIN, COMMIT, ABORT
Global of transaction management: Efficient, reliable, and concurrent execution of transactions.Agents: A local process which performs some actions on behalf of an application.Root Agent:
Issuing begin transaction, commit, and abort primitives.Create a new agent.
3
System forces abort
Begin_Transaction Begin_TransactionBegin_Transaction
Commit
Abort X
Types of transaction termination.
4
Failure RecoveryBasic Techniques: LOG
A log contains information for undoing or redoing all actions which are performed by transactions.
Undo: Reconstruct the database as prior to its execution (e.g., abort)
Redo: Perform again its action (e.g., failure of volatile storage before writing onto stable storage, but already committed)
Undo and redo must be independent. Performing them several times should be equivalent to performing them once.
5
Failure Recovery (cont’d)
A log contains:
1. Transaction ID2. Record ID3. Type of action (insert, delete, modify)4. The old record value (required for undo)5. The new record value (required for redo)6. Information for recovery (e.g., a pointer to the
previous log record of the same transaction).7. Transaction status (begin, abort, commit).
6
Log Write-ahead protocolBefore performing a database update, log record recorded on stable storage.Before committing a transaction, all log records of the transaction must have been recorded on stable storage.
Recovery Procedure Via Check PointsCheck points are operations which are periodically performed (e.g. few minutes) writing the following to stable storage.
All log records and all database updates which are still in volatile storage.Check point record which contains the indication of transactions that are active at the time when check point is done.
Failure Recovery (cont’d)
7
Local Transaction Manager (LTM)Provides transaction management at local site, for example, local begin, local commit, local abort, perform sub-transaction (local transaction).
Distributed Transaction Manager (DTM)Provides global transaction management.
LTM has the capabilitiesEnsuring the atomicity of a sub-transaction.Write record on stable storage on behalf of DTM.
Atomicity at LTM is not sufficient for atomicity at DTM (i.e., single site vs. all sites).
Transaction Manager
8
Two Phase Commit ProtocolCoordinator: Making the final commit or abort
decision (e.g., DTM).Participants: Responsible for local sub-
transactions (e.g., LTM).Basic Idea: Unique decisions for all participants
with respect to committing or aborting all the local sub-transactions.
1st Phase: To reach a common decision.2nd Phase: Global commit or global abort
(recording the decision on the stable storage).
9
Phase 1The coordinator asks all the participants to prepare for commitment.Each participant answers READY if it is ready to commit and willing to do so. Each participant record on the stable storage.
1) All information which is required for locally committing the sub-transactions.
2) “ready” log record must be recorded on the stable storage.
The coordinator records a “prepare” log on the stable storage, which contains all the participants’ identification and also activates a time out mechanism.
10
Phase 2The coordinator recording on the stable storage of its decision “global commit” or “global abort”.The coordinator informs all the participants of its decision.All participants write a commit or abort record on the log (assure local sub-transaction will not be lost).All participants send a final acknowledgment message to the coordinator and perform the actions required for committing or aborting the sub-transaction.Coordinator writes a “complete” record on the stable storage.
11
Basic 2-Phase-Commit ProtocolCoordinator: Write a “prepare” record in the log:
Send PREPARE message and activate time-out
Participant: Wait for PREPARE message: If the participant is willing to commit then beginWrite sub-transaction’s records in the log;Write “ready” record in the log;Send READY answer message to coordinator end else beginWrite “abort” record in the log;Send ABORT answer message to coordinator
end
12
Basic 2-Phase-Commit Protocol (cont’d)
Coordinator: Wait for ANSWER message (READY or ABORT) from all participants or time-out; If time-out expired or some answer message is ABORT thenbeginWrite “global_abort” record the log;Send ABORT command message to all participantsend else (*all answers arrived and were READY*)begin
Write “global_commit” record in the log;Send COMMIT command message to all participantsend
13
Basic 2-Phase-Commit Protocol (cont’d)
Participant: Wait for command message; Write “abort” or “commit” record in the
log: Send the ACK message to coordinator; Execute command
Coordinator: Wait for ACK message form all participants:Write “complete” record in the log
14
Elimination of the PREPARE Message: 2P Commit
Coordinator:Write prepare record in the log;Request operations for participants, and activate time-out;Wait for completion of participants (READY message) or time-out expired:Write global_commit or global_abort record in the log;Send command message to all participants.
Participant:Receive request for operation:Perform local processing and write log records:Send READY message and write ready record in the log:Wait for command message:Write commit or abort records in the log;Execute command.
15
The Consistency Problem InA Distributed Database System
Multiple copies of the same data at different sites improve
AvailabilityBetter response time
Every update will result in a local execution and a sequence of updates sent to the various sites where there is a copy of the database.
16
Concurrency ControlPurpose:
To give each user the illusion that he is executing alone on a dedicated system when, in fact, many users are executing simultaneously on a shared system.
Goals:Mutual consistencyInternal consistency
Problems:Data stored at multiple sitesCommunication delays
Concurrency control is a component of a distributed database management system.
17
Criteria For Consistency
Mutual consistency among the redundant copies.
Internal consistency of each copy.Any alterations of a data item must be performed in all the copies.Two alterations to a data item must be performed in the same order in all copies.
18
The ProblemSite A
Part # Price . . . . . .102116 $10.00 . . . . . .
Site BPart # Price
. . . . . . 102116 $10.00 . . . . . .
Two Simultaneous TransactionsPrice $15.00 Price $12.00
Part # Price . . . . . .
102116 $12.00
Part # Price . . . . . .
102116 $15.00
Possible Result
Mutual consistency is not reserved.
19
The SolutionMutual consistency can be ensure by the use of time stamp given the update
message: TS 87, 6, 1, 12.01 ID PRICE 102116 $4.00 103522 $7.50The DB Copy was:
. . .
. . .
. . .102116 $2.50 87, 5, 15, 9.12. . .. . .. . .103522 $7.90 87, 6, 1, 12.15
After the Update:. . .. . .. . .112116 $4.00 87, 6, 1, 12.01. . .. . .. . .103522 $7.90 87, 6, 1, 12.15
20
X, Y, and Z are three data fields such thatX + Y + Z = 3
Site 1X=1Y=1Z=1
Site 2X=1Y=1Z=1
SupposeSite 1 executes X -1, Y 3Site 2 executes Y -1, Z 3
Possible ResultSite 1X -1Y -1Z 3
Site 2X -1Y -1Z 3
Mutual consistency was preserved but internal consistency was not.
21
A Good Solution Must Be
Deadlock free
Speed independent
Partially operable
22
Concurrency ControlCorrectness => Serializable ExecutionsSerializable Execution => Serial ExecutionSerial Execution => No Concurrency
Two operations are said to conflict if they operate on the same data item and at least one is a write.
Two types of conflicts:Write-Write (WW)Read-Write (RW)
Bernstein and Goodman separate techniques may be used to insure RW and WW synchronization.
The two techniques can be “glued” together via an interface, which assures one serial order consistent with both.
23
Definitions of Concurrency ControlA schedule (history or log) is a sequence of operations performed by transactions.
S 1: Ri(x)Rj(x)Wi(y)Rk(y)Wj(x)Two transactions Ti and Tj execute serially on a schedule S if the last operation of Ti precedes the first operation of Tj in S; otherwise they execute concurrently in it.A schedule is serial if no transactions execute concurrently in it.
For example:
S 2: Ri(x)Wi(x)Rj(x)Wj(y) Rk(y)Wk(x) = TiTjTk
Given a schedule S, operation 0i precedes 0j (0i < 0j), if 0i appears to the left of 0j in S.
A schedule is correct if it is serializable; it is computationally equivalent to a serial schedule.
24
Serializability in a Distributed DatabaseSerializability of local schedules is not sufficient to ensure the correctness of the executions of a set of distributed transactions.
For example:S1: Ti > TjS2: Tj > Ti
Thus, the execution of T1, ..., Tn is correct if
1) Each local schedule Sk satisfy the serializable.
2) There exists a total order of Ti, ..., Tn such that if Ti < Tj in this total ordering, then there is a serial schedule Sk’, such that Sk is equivalent to Sk’ and Ti < Tj in Sk’ for site K.
25
Consistency Control Techniques
Time stamps
LockingPrimary site locking
Exclusive-WriterExclusive-writer using sequence numberExclusive-writer using sequence number with lock options
26
Two Phase Locking (2PL)Read and Write locks
Locks may be obtained only if they do not conflict with a lock owned by another transaction.
Conflicts occur only if the locks refer to the same data item and:
RW - one is a read lock and the other is a write lockWW - both are write locks
“Two-phased-ness”Growing phaseLocked-pointShrinking phaseOnce a lock is released, no new locks may be obtainedLocked-point determines serialization order
27
Centralized Locking AlgorithmAll requests for resources must be sent to one site called the lock controller.The lock controller maintains a lock table for all the resources in the system.Before a site can execute a transaction, it must obtain the required locks from the lock controller.Advantage:
Fewer messages required for setting locks than in distributed locking algorithms.
Disadvantages:Poor reliability; Backup system required.Lock controller must handle large traffic volume.
28
Distributed Locking Algorithm1. Wait for a transaction request from user.2. Send n lock request messages.3. In case of any lock reject, send lock release and
go to 2 to retry after a random internal of time.4. Perform local transaction and send n update
messages.5. Wait for update ACK messages.6. Send n lock releases, notify user the transaction
is done, go to 1.5 n computer to computer messageTime consumingLong delay
29
Solutions With 2 Transmission
Requests propagate around the loop and are accepted when they return to sender.Update messages (in the same manner) are their own completion ACK.Priority to solve simultaneous requestsSerial propagation increases delay – good for small networks.
Nodes organized in a ring structure.
0
0
0
00
0
30
Voting AlgorithmThe data base manager process sends an update request to the other DBMP.The requests contain the variables that participate in the query with their time stamps and the new values for the updates variables.Each DBMP votes OK, REJ, or pass, or defers voting.The update will be rejected if any DBMP rejects.The transaction is accepted if the majority of the DBMP accepting the transaction voted OK so the transaction is OK.
If two requests are accepted, it means that at least one DBMP voted OK for both.
Broadcast 2.5 n TransmissionDaisy Chain 1.5 n Transmission
31
Primary Site Locking (PSL)
D
A
B
D
- Task Execution- Intercomputer Synchronization Delay
PS
Lock-Request B
Lock-Grant B
Update B
Update A
Update A
Update B
32
Characteristics of Primary Site Locking
Serializability
Mutual consistency
Moderate to high complexity
Can cause deadlocks
Inter-computer synchronization delays
33
Variable Level of SynchronizationGlobal database lock is not required by most of the transactions.Different types of transactions need different levels of synchronization.The level of synchronization can be represented by algorithms (protocols), which are executed when a transaction is requested.Goal: Each transaction should run under the protocol that gives the least delay without compromising the consistency.In SDD-1, four protocols are available. Different levels of synchronization yield different delays.Disadvantage: High overhead costs.
34
TM1 TMMTM2
EW
update-request message
update-request message
update-request message
Read & Update
Update
Read Only
TM1, …, TMM are TransactionsThe Exclusive Writer Approach
Shared File F
35
The Exclusive-Writer ProtocolImplementation requirements
Each copy of a file has an update sequence number (SN).
Operation Only the Exclusive-Writer (EW) distributes file updates.Updating tasks sends to the EW’s site update-request messages (update and SN).Before the EW distributes a file update, it increments the file’s SN.The EW detects a data conflict when the SN in the update-request is less than the SN of the corresponding file at the EW’s site.
36
Transactions TMA and TMB only access File FK
FK is replicated at all sites
TMA
TMB+
TE(A)
TU(A)
TMA ArrivesSite I EW Site Site J
SNK, I SNK, EW SNK, JN
TMB Arrives
TE(B)
N+1
Update-Request ASN = N
Update A
SN=N+1
+ N+1*
Update-Request
SN = NUpdate ASN=N+1
N+1 +
Notification of Discard(Optional)SNK, I = Update sequence number for the copy of File FK at Site I
= Transaction Execution= Transaction Execution Response Time= Update Confirmation Response Time= Update is Written= Update-Request is Discarded
TE
TU
+*
Timing Diagram for Exclusive-Writer Protocol (EWP)
N N
37
T1
EW1
TI
EWJ
TK
EW2
TJ
F1 F2 F2 F3F1
Interconnection Network
... ...
...
A Distributed Processing System which uses the EWP.
TJ = Task J
EWJ = Exclusive-Writer for FJ
FJ = File J
38
TMA
TMB
TMB
D
Site I EW SiteSNK, J
Site J
Update-Request ASN = N
SN=N+1
SNK, EWNK, 1
N+1Update A
N+1
N+2
Update AUpdate-Request
B
SN=N+1* Lock-Grant BSN=N+1
N+2**
Update B
SN=N+2SN=N+2
Update B
N+2
N+1
SN = N
Protocol Diagram for the EWL
- TM Execution- File is Locked- File is Unlocked- Intercomputer Synchronization Delay
***D
N N N
39
Comparison of PSL and the EWPPSL
No discarded updatesInter-computer synchronization delaysCan cause deadlocks
EWPConflicting updates are discarded (EWP without lock option)Non inter-computer synchronization delaysLower message volume than PSL
Design IssueSelection of primary site and exclusive-writer siteLimited replication of shared files
Performance AnalysisVolume of messageResponse timeProcessing overhead
40
TMA
TMA
TMB
TU(A)
+
*
Accept A
Update A, TS(A)
Accept B
Reject AUpdate B, TS’(B)
Update A, TS(A)
TU(A)+
TMA Arrives Site I Site J
TE(B)+
TMB Arrives
TU(B)A is Late
+
Transactions TMA and TMB only access File FK
FK is replicated at all sitesTS(B) < TS(A)
= Transaction Executed= Transaction Execution Response Time= Update Confirmation Response Time
TE
TU
= Update is Written= Database Rollback= Timestamp
+*TS
Timing Diagram for Basic Timestamp
+
41
Escrow LockingPat O’Neil
For updating numerical valuesmoneydisk space
Uses Primary Site.Lock in advance only the required amount (Errors).Release excess resources after execution.Advantage:
Less lock conflict, therefore, more data availability.Less concurrency control overhead for update; good for long transactions.
Disadvantage:Weak data consistency.Data usually are inconsistent, but within a certain bond.
EXAMPLE:Bank account with $50.
Need to clear a check for up to $30.Escrow lock $30 - other $20 is still available.If check is only for $25, return remaining $5.
42
Escrow Locking Under PartitioningSimilar to PSL
Only Primary Site partition can update.Primary Site may be isolated in a small partition.
FurtherEscrows may be outstanding when partitioning occurs.
Solution Grant the escrow amount to each partition.Based on user profile/history.Based on size/importance of the partitions.
43
Escrow Locking Under Partitioning (cont’d)
EXAMPLE:Escrow amount = total amount/# of partitions
Bank account with $50 If two partitions occur escrow $25 in each partition (for that partition to use) If some updates require $30, then update will be blocked
Based on historical information, give different escrow portions to different partitions.
E.g., Escrow for partition A = 35 Escrow for partition B = 15
Use normal escrow locking in each partition.Reconcile database afterwards.
44
Quasi Copies for Federated DatabasesHector Garcia-Molina, IEEE DE 1989
PS
Every database has a single controlling siteUpdates propagated to other (Read/Only) Sites
•If value changes (in percentage) by p > w•If value changes (in absolute value) by a > x•After a timeout period t > y•After a fixed number of updates u > z•Some Boolean combination (t > y) AND (p > w)
45
Quasi Copies for Federated Databases (cont’d)
Advantage:•Reduce update overhead
Disadvantage:•Weak concurrency control•Remote reads may read out-of-date information but still guaranteed within a certain tolerance
Examples:•Catalog
•Prices are guaranteed for one year•Government
•Old census data might be used to determine current representation
46
A Simple Deadlock
Process 1 has A needs B
Process 2 has B needs A
P1
P2
A
B
47
Deadlock Prevention MechanismsDeadlock Detection Mechanisms
48
Deadlock Prevention
PriorityTimestamps:
A transaction’s timestamp is the time at which it begins execution.Old transactions have higher priority than younger ones.
49
Timestamp Deadlock Prevention Schemes
Assume older transaction has higher priority than younger transaction.
Wait-Die-Non-preemptive technique.
If Ti requests a lock on a data item which is already locked by Tj and if Ti has higher priority than Tj (i.e., Ti is older than Tj), then Ti is permitted to wait. If Ti is younger than Tj, then Ti is aborted (“dies”) and restarts with the same timestamp.
“It is better always to restart the younger transaction.”
50
Timestamp Deadlock Prevention Schemes (cont’d)Wound-Wait-Preemptive counterpart to wait-die.Assume Ti requests lock on a data item which is already locked by Tj. If it is younger than Tj, then Ti is permitted to wait. If it is older than Tj, Tj is aborted and the lock is granted to Ti.
“Allow older transactions to pre-empt younger ones and therefore only younger transactions wait for
older ones.”
51
Transactions
Database
T1 : Begin;Read (X); Write (Y);
End
T2 : Begin;Read (Y); Write (Z);
End
T3 : Begin;Read (Z); Write (X);
End
A X1
Y1
C
B X2
Y2
Z3
•Suppose transactions execute concurrently, with each transaction issuing its READ before any transaction issues its END.•This partial execution could be represented by the following logs. DM A: r1[x1] DM B: r2[y2] DM C: r3[z3]
Deadlock
52
•At this point, T1 has readlock on x1
T2 has readlock on y2
T3 has readlock on z3
•Before Proceeding, all transactions must obtain writelocks. T1 requires writelocks on y2
T2 requires writelocks on z3
T3 requires writelock on x1
•But T1 cannot get writelock on y2, until T2 releases readlock T2 cannot get writelock on z3, until T3 releases readlock T3 cannot get writelock on x1, until T1 releases readlock This is a deadlock.
Deadlock (cont’d)
53
Wait-For–Graphs: Direct graphs that indicate which transactions are waiting for which other transactions.Node:Transactions.Edge: “Waiting for” relationship.
Ti Tj
Ti is waiting for a lock currently owned by Tj. If the wait-for-graph contains a cycle then there is a dead lock.
Dead Lock Detection
54
Waits-for Graph for Figure 6
T1 must wait for T2 to release read-lock on Y2
T3 must wait for T1 to release read-lock on X1
T2 must wait for T3 to release read-lock on Z3
T1
T3
T2
55
Multi-Site Deadlock•Consider the execution illustrated in Figures•Locks are requested at DMs in the following order:
DM A readlock x1 for T1
writelock y1 for T1
*writelock x1 for T3
DM B readlock y2 for T2
writelock z2 for T2
*writelock y2 for T1
DM C readlock z3 for T3
*writelock z3 for T2
None of the “starred” locks can be granted and the system is in deadlock. However, the waits-for graphs at each DM are acrylic.
DM A DM CDM BT3 T1 T1 T2 T2 T3
56
Implementation ApproachesCentralizedPeriodically (e.g., few minutes) each schedule sends its local wait-for graph to the deadlock detector. The deadlock detector combines the local graphs into a system wide wait-for graph by constructing the union of the local graphs.
HierarchicalThe data base sites are organized into a hierarchy (or tree) with a deadlock detector at each node of the hierarchy. Deadlocks local to a single site are detected at that site. Deadlock involving two or more sites are detected by the regional deadlock detector and so on.
57
Deadlock Detection in Distributed Systems
Local wait-for graphs are not sufficient to characterize all deadlocks in the distributed systems. Instead, local wait-for graphs must be combined into a more global wait-for graph. Centralized 2PL does not have this problem since there is only one lock schedule. In the case of a distributed lock schedule, however, the coordination task becomes very complex.
58
Periodic transmission of a local wait-for graph can cause the following two problems:
1) Deadlock may not be detected right away.
2) Phantom Deadlock - Transaction T may restart other than concurrency control (e.g., its site crashed). Until T’s restart propagates to the deadlock detector, the deadlock detector can find a cycle in the wait-for graph that includes T.
59
Deadlock Avoidance Method Used in Practice
In the absence of a general analysis which determines the tradeoff of various deadlock avoidance methods, as well as the lack of understanding of complex inter-relationships of deadlock among protocol levels, the commercial systems and real life applications are mainly based in the most experienced method: 2-phase locking with time-out based deadlock detection.
60
Resilient Commit Protocol
If an update is posted at any operating site, all other operating sites that keep a copy of the file will eventually receive the update regardless of multiple failures.
61
For Loosely Coupled Systems
Conventional techniqueTwo phase commit protocol has blocking problem when coordinator fails.Three phase commit protocol none blocking, but too time consuming.
Therefore, these techniques are not suitable for real time system applications.
62
A Low-Cost Commit ProtocolAssumptions
No network partition.No reliable network (e.g., no loss of messages, no out of sequence messages).“I am alive” messages are periodically exchanged among the sites for failure detection (in lieu of time out and acknowledgment messages).Failure sites are not allowed to rejoin the system during a mission time (could be relaxed).All the updates required for a transaction reside at the coordinator site.
63
Commit Protocol ProceduresFor each file, sites are numbered and updates are sent in this sequence number.Updates are posted immediately after being received.Each site saves the last updates from all other sites.When a site failure is detected, the smallest numbered surviving site retransmits the last update received from the failure site in the numbered sequence.Update sequence number is used to detect duplicates.
64
The Resilient Commit Protocol:No-Failure Case
Site 1 Site 3Site 2 Site 4SN = N
N+1
N+1
N+1
Update
N+1
SN=N+1
SN=N+1
UpdateUpdateSN=N+
1
SN = Sequence Number
SN = N SN = N SN = N
65
The Resilient Commit Protocol:Recovery From a Site Failure
Site 1 Site 3Site 2 Site 4
N+1
N+1
N+1
Update
N+1
SN=N+1
SN=N+1
Update
UpdateSN=N+
1 UpdateSN=N+
1
*
~~FAILS
*= Failure is detected= Duplicate update is discarded
SN = N SN = N SN = N SN = N
66
Reducing Messages For Failure Recovery
A coordinator sends an update complete (UC) message to the smallest numbered site after completing an update broadcast.
When a site receives the UC message, it discards the saved update.
This will eliminate unnecessary retransmission of completed updates.
67
The Resilient Commit Protocol withUpdate Complete Message
Site 1 Site 3Site 2 Site 4
N+1
N+1
N+1
Update
N+1
SN=N+1
SN=N+1
Update
UpdateSN=N+
1
UC = Update Complete Message
UC
SN = N SN = N SN = N SN = N
68
Resilient EWP OperationSites are numbered and the site with the smallest number is selected as the EW.
EW sends an update to other sites in the number sequence (i.e., lowest numbered site first, highest numbered site last).
Each non-EW site should save the last update received from the EW.
When EW fails, the site with the next smallest number becomes the new EW and retransmits the last update received from the old EW in the number sequence.
69
The Resilient EWP: No-Failure CaseSite 1 Site 3Site 2 Site 4
N+1
Update-Request
N+1
SN=N
+1
TM
UpdateSN=N+
1 UpdateSN=N+
1N+1
N+1
UpdateSN=N+
1
(EW)
TM = Transaction Module
SN = N SN = N SN = N SN = N
70
The Resilient EWP:Recovery From the EW Failure
Site 1 Site 3Site 2 Site 4
N+1
Update-Request
N+1
SN=N+1
TM
UpdateSN=N+
1 UpdateSN=N+
1N+1
N+1
UpdateSN=N+
1
(new EW)
TM
(old EW)
UpdateSN=N+
1*
FAILS ~~
= Transaction Module= Failure is detected= Duplicate update is discarded*
SN = N SN = N SN = N SN = N
71
Resilient PSL OperationSites are numbered, assign the site with the smallest number as the PS.Updates are broadcast in the number sequence.Each site saves the last updates from all other sites.When a non-PS failure is detected:
If the failed site is holding a lock, the lock is released by the PS.If the failed site has made a lock-request, the lock-request is discarded by the PS.Otherwise, the PS broadcasts the last update received from the failed site in the number sequence.
When the PS fails:The site with the next smallest number becomes the new PS.The new PS broadcasts the last update received from the old PS in the number sequence.To resume lock management, the new PS requests the lock-status of other sites
72
The Resilient PSL: No-Failure CaseSite 1 Site 3Site 2 Site 4
N+1
Lock-Request
N+1
TM
Lock-GrantSN = N
N+1
N+1
UpdateSN=N+
1
(PS)
UpdateSN=N+
1
UpdateSN=N+
1
TM = Transaction Module
SN = N SN = N SN = N SN = N
73
The Resilient PSL: A Non-PS Site Failure CaseSite 1 Site 3Site 2 Site 4
N+1
Lock-Request
N+1
TM
Lock-GrantSN = N
N+1
N+1Update
SN=N+1
UpdateSN=N+
1
UpdateSN=N+
1
(PS)
~FAILS~
UpdateSN=N+
1
*
= Transaction Module= Failure is detected= Duplicate update is discarded
TM
*
SN = N SN = N SN = N SN = N
74
The Resilient PSL: PS Site Failure Case
Site 1 Site 3Site 2 Site 4
N+1
Lock-Request A
N+1
Lock-Grant ASN = N
N+1Update A
SN=N+1
(old PS)
~FAILS ~
Lock-Grant BSN=N+
1
(new PS)
Lock-Request B
Lock-HoldLock-Request B
LSR
Update ASN=N+
1
LSR
TMB= Transaction Module= Lock-Status Requested= Failure is detected
TMLSR
SN = N SN = N SN = N SN = N
TMA
75
Site RecoveryThe recovering site will undo the last update along with the SN and broadcasts “I am up” message.
The recovering site is given the site number larger than any surviving site number.
An operating site is selected to provide all lost updates for the recovering site.
The posting of newly incoming updates (at the recovering site) is postponed until all lost updates are received.
76
SummaryUpdate broadcast in one phase according to a pre-assigned site sequence.Parameter: frequency of “I am alive” message.Requires additional overhead only when failure occurs.The resilient commit protocol can be incorporated into concurrency control techniques (e.g., PSL, EWP).The resilient commit protocol is suitable for real time applications.
77
Distributed Query Processing
78
A Query Processing Example For A Distributed Database System
Database (suppliers, parts, and supply):S (S#, CITY) 10,000 tuples, stored at site AP (P #, COLOR) 100,000 tuples, stored at site BSP (S#, P#) 1,000,000 tuples, stored at site AAssume that every tuple is 100 bits long.
Query: Suppliers numbers for London suppliers of red parts
SELECT S.S#FROM S, SP, PWHERE S.CITY = ‘LONDON’ AND S.S# = SP.S# AND SP.P# = P.P# AND P.COLOR = ‘RED’
S, SP P
A B
79
A Query Processing Example For A Distributed Database System (Cont’d)
Estimates (cardinalities of certain intermediate results):Number of red parts = 10Number of shipments by London suppliers = 100,000
Communication assumptions:Data rate = 10,000 bits per secondAccess delay = 1 second
T[i] = total access delay + (total data volume/ date rate) = (number of message * 1) + (total number of bits/10,000)
(measured in seconds).
80
Communication Time For Selected Distributed Query Processing
StrategiesStrategy Technique Communication
Time
1 Move P to A 16.7 min 2 Move S and SP to B 28 hr3 For each London shipment 2.3 daycheck corresponding part4 For each red part, 20 seccheck for London supplier5 Move London 16.7 minshipments to B6 Move red parts to A 1 sec
81
Distributed Query Processing Problem
Given a query that references information stored in several different sites:
1. Decompose it into a set of sub-queries or operations to be performed at individual sites.
2. Determine the site for performing each operation.
82
Cost Of A Query Processing Policy
Depends on:
Volume of data traffic,
Sequence of operations,
Degree of parallelism,
Sites of operations.
83
Query Tree
A query tree represents each sequence of operation that produces the correct result.
Given an arbitrary query tree, a set of equivalent query trees can be generated using the commutativity, associativity, and distributivity properties of query operations.
84
Property of Query OperationsCommutativityAssociativityDistributivity
Unary operations (e.g., selection, projection)
Binary operations (e.g., join, union, intersection, difference, division)
Adjacent unary operationsAdjacent binary operationsAdjacent unary and binary operations
85
Query Tree (Example)
R R
U
CBCA
U
BA
*
C * *
A) A query tree representing (AUB) * C
A) A query tree equivalent to (A), representing (A*C) U (B*C)
86
Placement Of Unary Operation in a Query TreeTheorem 1
Placing each unary operation at the lowest possible position in a query tree is a necessary condition to obtain the optimal query processing policy.
CorollaryIf the optimal placement of a unary operation is adjacent to two binary operations, then a necessary condition to obtain the optimal query processing policy is to process the unary operation at the same site as the binary operation that has the lower position in the tree (i.e., processed earlier).
87
88
Query Processing Graph
1. Sequence of operations.
2. Groups of operations performed at a single site.
Storage nodes have no inputs and represent initial operations on file.
Execution nodes have one or more inputs and represent multi-file operations.
89
Theorem 2
For each execution node of the graph, selecting the storage node site that sends the largest amount of data to that execution node as the site for performing its operations yields minimum operating cost for that graph.
90
Theorem 3For a given query processing graph that contains
a multi-operation execution node consisting of a set of operations (i…j), the sequence of operations from this set which has least processing cost is used by the policy that has least operating cost for this graph.
CorollaryTheorem 3 is true for a query processing graph
with more than one multi-operation execution node.
3
12 21
3
if C12 > C21,
then 21 is a better policy.
91
Theorem 4If the processing cost for a given
operation is the same for all the computers in the distributed database, and further, if the sequence of operations for processing the operation is fixed, then the processing policy that minimizes the communication cost (total volume of traffic for the case of the communication cost among each pair of computers being equal) yields the lowest operating cost among the set of policies that uses this fixed sequence of operations.
92
93
Applications of Theorems
Query tree optimization of unary operations: Theorem 1Site selection: Theorem 2Computation reduction: Theorem 3Local optimal query policies: Theorem 4
94
Procedure For Finding The Optimal Query Processing Policy
Decompose query into operations.Generate the set of equivalent query trees using properties of query operations and Theorem 1.Site selection using Theorems 2.Eliminate certain graphs using Theorem 3.Compute communication cost for each graph.Select local optimal policies based on Theorem 4.Select global optimal policy.
95
Example
Generate a listing of
<part number, supplier name, quantity>
for all “wheels” produced in Los Angeles in a quantity greater than 1,000 by any one supplier.
96
Computer Network for the Example
1 3
2
4
F1
F2
F3
Query Origin
97
File Characteristics of the Distributed Database for the
Example
File Location Contents Length (in bytes)F1 1 (Part #, Part Name) 105
F2 2 (Supplier #, Part #, Quantity) 105
F3 3 (Supplier #, Supplier Name, City) 104
98
0: Initial restriction and projection operations on files Fi for i = 1, 2, 3.
1: Join operation on part number.
2: Join operation on supplier number.
e: Transmission of the query result from the last operating site to the query originating site.
i
99
100
Policy Site for 1 Site for 2 Site for e CC PC
Optimal site selection for each query operationGraph is obtained using Theorem 3.
101
102
Query Graphs for Query Tree (A)
103
Query Graphs for Query Tree (B)
104
Operating Costs of the Three Cases for the Example
C1 has been reduced from $0.5 X 10-3/byte to $0.25 X 10-3/byte.
(1, f2, y) = 0.3(2, f2, y) = 0.45(2|, f2, y) = (1|, f2, y)
= 0.4
Same as case B, however, the data reduction functions for operations 1 and 2 have been increased. Their new values are:
Communication Processing OperatingCase Policy Cost Cost Cost
6.5 9.71 16.21 5.5 10.21 15.71 6.5 5.96 12.46 5.5 8.46 13.96 11.0 6.96 17.96 11.5 9.08 20.58
A
B
C
105
Semi-JoinTo Perform a set of operations which produce the same results as the Join.
R JNA=B S,A & B are attributes of R & S.
Semi JoinS JNA=B(RSJA=B PJB S)
Merging leaves corresponding to S into the same node R & S on a different site.
R’’
R’
RS
Semi-joins:R’ = (RSJA=B PJB S)R’’ = S JNA=B(RSJA=B PJB S)
106
107
108
Query Processing WithDomain Semantics
Wesley W. Chu
Computer Science DepartmentUniversity of California
Los Angeles
109
Query Optimization Problem
To find a sequence of operations, which has the minimal processing cost.
110
Conventional Query Optimization (CQO)
For a given query:
Generate a set of query that are equivalent to the given query.Determine the processing cost of each such query.Select the lowest cost query processing strategy among these equivalent queries.
111
Limitations of CQO
There are certain queries that cannot be optimized by Conventional Query Optimization.
For example, given the query:“Which ships have deadweight greater than 200 thousand tons?”
Search of entire database may be required to answer this query.
112
The Use of KnowledgeASSUMING EXPERT KNOWS THAT:
1. SHIP relation is indexed on ShipType. There are about 10 different ship types, and
2. the ship must be a “SuperTanker” (one of the ShipTypes) if the deadweight is greater than 150K tons.
AUGMENTED QUERY:“Which SuperTanker have deadweight greater than 200K tons?”
RESULT:About 90% time saved in searching the answers.
The technique of improving queries with semantic knowledge is called Semantic Query Optimization.
113
Semantic Query Optimization (SQO)Uses domain knowledge to transform the original query into a more efficient query yet still yields the same answer.
Assuming a set of integrity constraints is available as the domain knowledge,
Represent each integrity constraint as Pi Ci, where 1 < i < n.Translate (Augment) original query Q into Q’ subject to C1, C2, ..., Cn, such that Q’ yields lower processing cost than Q.Query Optimization Problem: Find C1, C2, ..., Cm that yields minimal query processing cost; that is,
C(Q’) = min C(QC1 ... Cm)Ci
114
Semantic EquivalenceDomain knowledge of the database application maybe used to transform the original query into semantically equivalent queries.
Semantic Equivalence: Two queries are considered to be semantically equivalent if they result in the same answer in any state of the database that conforms to the Integrity Constraints.
Integrity Constraints:A set of if and then rules that enforce the database to be accurate instance of the real world database application. Examples of constraints include:
state snapshot constraints:e.g., if deadweight > 150K then ShipType = “SuperTanker.”
state transition constraints:e.g., salary can only be increased,i.e., salary (new) > salary (old).
115
Limitations of Current Approach
Current approach of SQO using :
Integrity constraints as knowledgeConventional data models
116
Limitations of Integrity ConstraintsIntegrity constraints are often too general to be useful in SQO, because:
Integrity constraints describe every possible database stateUser is only concerned with the current database content.
Most database do not provide integrity checking due to:Unavailability of integrity constraintsOverhead of checking the integrity.
Thus, the usefulness of integrity constraints in SQO is quite limited.
117
Limitations Of Conventional Data Models
Conventional data models lack expressive capability for modeling conveniences. Many useful semantics are ignored. Therefore, limited knowledge are collected.
FOR EXAMPLE:
“Which employee earn more than 70K a year?”
The integrity constraint:
“The salary range of employee is between 20K to 90K.”
is useless in improving this query.
118
Augmentation Of SQO With Semantic Data Models
If the employees are divided into three categories: MANAGERS, ENGINEERS, STAFFSand each category is associated with some constraints:
1. The salary range of MANAGERS is from 35K to 90K.2. The salary range of ENGINEERS is from 25K to 60K.3. The salary range of STAFF is from 20K to 35K.
A better query can be obtained:
“Which managers earn more than 70K a year?”
119
Capture Database Semantics
By Rule Induction
120
Database SemanticsDatabase semantics can be classified into:
Database Structure, which is the description of the interrelationships between database objects.Database Characteristics, which defines the characteristics and properties of each object type.
However, only tools for modeling database structure are available. Very few tools exist in gathering and maintaining the database characteristics.
121
Knowledge AcquisitionA major problem in the development of a knowledge-
based data processing system.Knowledge Engineers - persons in the use of expert system tools.Domain Experts - persons with the expertise of the application domain.
The Process:Studying literature to obtain fundamental background.Interacting with domain experts to get their expertise.Translating the expertise into knowledge representation.Refining knowledge base through testing and further interacting with domain experts.
A VERY TIME-CONSUMING TASK!
122
Knowledge Acquisition from Database
Database schema is defined according to database semantics, andDatabase instances are constrained by the database characteristics.
Thus,database characteristics can be induced as the semantic knowledge from the database.database schema can be a useful tool to guide the knowledge acquisition.
123
Knowledge Acquisition By Rule Induction
Given an object hierarchy and a set of database instances contained in the object hierarchy, a set of classification rules can be induced by inductive learning techniques.
Given:H - an object type hierarchy : H1, ..., Hn
S - object schemaI - database instances representing H
Find:D - a set of descriptions, D1, ..., Dn such thatfor all x, x in I ,if Di (x) is true, then x ISA Hi
Example:SUBMARINES contains SSN, SSBN
DSSN : 2145 < Displacement < 6955DSSBN : 7250 < Displacement < 30000
124
Model-Based Knowledge Acquisition Methodology
The methodology consists of:
a Knowledge-based ER (KER) Model,a knowledge acquisition methodology, anda rule induction algorithm.
KER is used as a knowledge acquisition tool when no knowledge specification is provided, orthe database already exists.
125
Knowledge-Based ER (KER) ModelTo capture the database characteristics, a Knowledge-based Entity Relationship (KER) is proposed to extend the basic ER model to provide knowledge specification capability.
A KER schema is defined by the following constructs:
1. has-attributed/with (aggregation)This construct links an object with other objects and specify certain properties of the object.
2. isa/with (generalization)This construct specifies a type/subtype relationship between object types.
3. has-instance (classification)This construct links a type to an object that is an instance of that type.
The knowledge specification is represented by the with-constraint specification.
126
Components of the KER Diagram
127
A KER Diagram Example
128
Classification of Semantic Knowledge
Domain Knowledge:Specifying the static properties of entities and relationships.e.g., displacement in the range of (0 - 30,000)
Intra-Structure Knowledge:Specifying the relationships between attributes within an object (an entity or a relationship).e.g., if the displacement is less than 7000, then it is a nuclear submarine.
Inter-Structure Knowledge:Specifying the relationship that is related to attributes of several entities of the aggregation relationship.e.g., the instructor’s department must be the same as the department of the class offered.
129
Knowledge Acquisition MethodologyTo provide a systematical way of collecting domain knowledge guided by the database schema. It consists of three steps:
Schema Generating - using KER.a. Identify entities and associated attributes.b. Identify type hierarchies by determining the class
attributes of each type hierarchy.c. Identify aggregation relationships. Define each
referential key as a class attribute.Rule Induction.Knowledge Base Refinement.
130
Rule Induction AlgorithmSemantic rules for pair-wise attributes (X --> Y) are induced using the relational operations.
Sketch of the Algorithm:
1. Retrieving (X,Y) value pairs.Retrieve the instance of the (X,Y) pair from the database.Let S be the result.
2. Removing inconsistent (X,Y) value pairs.Retrieve all the (X,Y) pairs that for the same value of X has multiple values of Y. Let T be the result.Let S = S -T.
3. Constructing Rules.For each distinct value of Y in S, say y, determine the value range x of X and create a rule in the form of
if x1 < X < x2 then Y = y.
131
132
Examples Of Induced RulesA prototype system was implemented at UCLA using a naval ship database as a test bed. Examples of rules induced are:
Entity: SUBMARINEx isa SUBMARINE
R1 : if 0101 < x.Class < 0103 then x isa SSBNR2 : if 0201 < x.Class < 0215 then x isa SSNR3 : if Skate < x.ClassName < Thresher then x isa SSNR4 : if 2145 < x.Displacement < 6955 then x isa SSNR5 : if 7250 < x.Displacement < 30000 then x isa SSBN
133
Examples Of Induced Rules (Cont’d)Relationship: INSTALL
x isa SUBMARINE and y isa SONAR
R1: if SSN582 < x.Id = SSN601 then y isa BQSR2: if SSN604 < x.Id = SSN671 then y isa BQQR3: if x.Class = 0203 then y isa BQQR4: if 0205 < x.Class < 0207 then y isa BQQR5: if 0208 < x.Class < 0215 then y isa BQSR6: if y.Sonar = BQS-04 then x isa SSN
134
Pruning the Rule SetWhen the number of rules generated becomes too large, the system must reduce the size of the knowledge base.
Two Criteria for Rule Pruning:
1. Coverage.Keep the rules that are satisfied by more than
Nc instances and drop those rules that are satisfied by less than Nc instances.
2. Completeness.Keep the rule schema (X --> Y) that the total
number of instances satisfied by the rules of the same scheme is greater than a coverage threshold Cc.
135
Induced Rules from Relation “PORT”
136
137
CLASS = (Type, Class, Name, Displacement, Draft, Enlist)
138
139
140
141
Rule Induction
142
143
Generate the RulesSelect targetsTargets are the RHS attributes of rules.
Method of selection:
Use indices as targetsUse selectivity
selectivity = # of tuples with distinct value/total # of tuples
Targets are chosen based on database schema (e.g., type hierarchy).
Generate rules for each target
144
SummaryProviding a model-based methodology for acquiring knowledge from the database by rule induction.
Applications:1. Semantic Query Processing – use semantic
knowledge to improve query processing performance.
2. Deductive Database Systems - use induced rules to provide intentional answers.
3. Data Inference Applications - use rules to improve data availability by inferring inaccessible data from accessible data.
145
Fault Tolerant DDBMS via Data Inference
146
Fault Tolerant DDBMS via Data InferenceNetwork Partition
Causes: failures ofChannelsNodes
Effects:Queries cannot be processed if the required data is inaccessible.Replicated files in different partitions may be inconsistent.Updates may only be allowed in one partition.Transactions may be aborted.
147
Conventional Approach forHandling Network Partitioning
Based on syntax to serialize the operations
To ensure data consistencyNot all queries can be processed.Based on data availability, determine which partition is allowed to perform database update.
Poor Availability!
148
New Approach
Exploit data and transaction semantic
Use Data Inference ApproachAssumption: Data are correlated
e.g. salary and rank ship type and weapon
Infer inaccessible data
Use semantic information to permit update under network partitioning
149
Query Processing System with Data Inference
Consists of DDBMSKnowledge base (rule based)Inference engine
150
DDBMS with Data Inference
Query Parser and Analyzer
Database FragmentsAllocationAvailability
Inference EngineRule Based
Knowledge-Based System
Inference System
Information ModuleQuery Input
Query Output
DDBMS
151
Fault Tolerant DDBMS with Inference Systems
SF
LA NY
KB2
IEDB2
KB1
IEDB1
KB3
IEDB3
KBSHIP(SID) INSTALL (TYPE)INSTALL(TYPE) INSTALL(WEAPON)
152
Architecture of Distributed Database with Inference
153
Motivation of Open Data Inference
Correlated knowledge is incompleteIncomplete rulesIncomplete objects
154
Example of Incomplete ObjectType -------> Weapon
IF type in {CG, CGN} THEN weapon = SAM01 IF type = DDG THEN weapon = SAM02
TYPE WEAPON
CG SAM01 CGN SAM01 DDG SAM02 SSGN ??
Result: Incomplete rules generate incomplete object
155
Merge of Incomplete Objects
Observation:Relational join is not adequate for combining incomplete objects
Lose information
Questions:What kind of algebraic tools do we need to combine incomplete objects without losing information?Any correctness criteria to evaluate the incomplete results?
156
Merge of Incomplete ObjectsTYPE ---> WEAPON and WEAPON --->WARFARE
Type Weapon Weapon Warfare CG SAM01 SAM01 WF1C CGN SAM01 SAM03 WF1D DDG SAM02 SSGN ?
Use relational join to combine the above two paths:Type Weapon Warfare
CG SAM01 WF1C CGN SAM01 WFIC
Other way to combine:
TYPE WEAPON WARFARE CG SAM01 WF1C CGN SAM01 WF1C DDG SAM02 ? ? SAM03 WF1D SSGN ? ?
157
New Algebraic Tools for Incomplete Objects
S-REDUCTIONReduce redundant tuples in the object
OPEN S-UNIONCombine incomplete objects
158
S-ReductionRemove redundant tuples in the objectObject RR with key attribute A is reduced to R
RR RA B C A B Ca 1 aa a 1 aab 2 _ b 2 bbc _ cc c _ cca 1 aab _ bbc _ _
159
Open S-UnionModify join operation to accommodate incomplete information
Used to combine closed/open objects
R1 U^ R2 ----> Rsid type type weapon sid type weapons101 DD DD SAM01 s101 DD SAM01s102 DD CG - s102 DD SAM01s103 CG s103 CG -
160
Open S-Union and TolerationPerforming open union on two objects R1, R2 generates the third object which tolerates both R1 and R2.
R1 U^ R2 ----> Rsid type type weapon sid type weapons101 DD DD SAM01 s101 DD SAM01s102 DD CG - s102 DD SAM01s103 CG s103 CG -
R tolerates R1R tolerates R2
161
Example of Open Inferencesite LA: SHIP(sid,sname,class)site SF: INSTALL(sid,weapon)site NY: CLASS(class,type,tname)
Query: Find the ship names that carry weapon ‘SAM01’(assuming site SF is partitioned)
SF
NYLA
Rule: If SHIP TYPE = DD, Then WEAPON = SAM01
SHIP
INSTALL
CLASS
Network Partition
162
ImplementationDerive missing relations from accessible relations and correlated knowledge
Three types of derivations:
View mechanism to derive new relations based on certain source relationsValuations of incomplete relations based on correlated knowledgeCombine two intermediate results via open s-union operation
163
Example of Open InferenceDERIVATION 1: select sid, type from SHIP, CLASSDERIVATION 2: CLASS(type) --> INSTALL(weapon)
R1 U^ R2 ----> INSTALL_INFsid type type weapon sid type
weapons101 DD DD SAM01 s101 DD SAM01s102 DD CG - s102 DD SAM01s103 CG s103 CG -
INSTALL_INF can be used to replace missing relation INSTALL
164
Fault Tolerant DDBMS via Inference Techniques
Query Processing Under Network Partitioning
Open Inference: Inference with incomplete information.Algebraic tools for manipulating incomplete objects.Toleration: Weaker correctness criteria for evaluating incomplete information.
165
Conclusion
Data Inference is an effective method forproviding database fault tolerance duringnetwork partitioning.
166
167
Active Database Systems
168
Active Database SystemsRef: Modern Database Systems, Won Kim (ed.),
Addison Wesley, 1994 (ch. 21).Conventional databases are passiveActive databases
Monitor situations of interestTrigger a timely response when the situations occure.g., Inventory control systems monitor the quantity in stock. When the stock falls below a threshold, a reordering activity may be initiated
Possible implementationsCheck inventory level periodically (not timely)Check the semantics of the condition in every program that updates the inventory database (poor software engineering approach)
These methods are not general.
169
A General Approach: Active DatabasesExpress the derived behavior in rules (conditions => actions)
The rule can be shared by many applications programsImplementation can be optimized
Contains an inference engine which:applies all the rules in the systemmatches the condition parts of the rules with the data in the working memoryselects and finds the rule that best matches the conditions
The action part of the rule may modify the working memory which may cause further actions. The cycle continues until no more rules match.
on event if condition
then action
Rules can be triggered by the following events:
Database operationsOccurrence of database statusTransitions between database status
170
Characteristics of RulesRules are:
Defined and stored in the databasesEvaluated by the database systemSubject to
AuthorizationConcurrency controlRecovery
Rules can be used to:Enforce integrity constraintsImplement triggers and actionsMaintain derived dataEnforce access constraintsImplement version control policiesGather statistics for query optimizationGather statistics for database reorganization
171
Rule Models and LanguagesRules are defined as metadata in the schema together with table, view, and integrity constraints.Rule operations are provided to:
AddDropModify
Rules are structured objects with the components of:EventsConditionsActions
Special operations of rules are:Fire (triggers a rule)Enable (activates a rule)Disable (deactivates a rule)
172
Triggering an event may be implicit or explicit.
Explicit - Flexibility in expressing transactions e.g., Keep Bob and Alice’s salaries the same.
If Alice’s salary constraint changes, then change Bob’s too.
If Bob’s constraint is violated because of changes by the user transaction, then abort the transaction.
Implicit - any change to the database that can cause the condition to become true is treated as a triggering event.
Languages vary based on the complexity of specified events, conditions and actions.
173
Event SpecificationExplicitly triggered by database modification
Relational Database
define rule monitor new eventson insert to employerif…then…where employer is a table of employee information
Object-Oriented Database
define rules check raiseson employee salary raisesif…then…where salary raises are a method defined by our objects in an employee class
174
Event Specification (Cont’d)
Rules to be triggered by date retrieval:
define rule monitor sal accesson retrieve salaryfrom employeeif…then…
175
Temporal Eventsabsolute (8:00 on January 1, 1994)relative (5 seconds after take-off)periodic (17:00 hour every Friday)
In object-oriented databases, events can be:Generic database operations (retrieve, insert, delete, update)Type-specific operations (method invocation)An operation on rule objectsTransaction operationsExternal events (messages or signals from devices)A composition of events (disjunction, sequence, and repetition)
Events are defined to have formal parameters.Salary raise (e: employee, oldsal: integer, newsal: integer)
e of type employeeoldsal and newsal of type integer
176
Condition SpecificationThe condition is satisfied by a predicate or query over data in the database.The condition is satisfied if the predicate is true or if the query returns a non-empty answer.Rule conditions are arbitrary predicates over database status, thus modified data can be referenced and transition conditions can be specified.
define rule monitor raiseon update to employee.salaryif employee.salary is > 1.1 oldemployee.salarythen...
177
Action SpecificationAction part of a rule usually performs:
InsertsDeletesUpdates
Data in the working memory based on data matching the rule’s conditiondefine rule favor employeeor insert to employeethen deletes employee e where e.name = employee.name
Whenever a new employee is inserted, its action deletes any existing employees with the same name.Rule actions can also be:
Arbitrary database operationsTransaction operationsRule operationsSignals for the occurrence of user-defined eventsCalls to application procedures
178
Event-Condition-Action BindingDatabase production rule languages may have explicitly specific events.There is a need to specify different and varied conditions and actions.The motion of binding is different from Al production rules (e.g., triggering an event of a rule may be parameterized and the parameters may be referenced in the rule’s condition and action).e.g., If a rule is triggered by salary raise (e: employee; oldsal: integer; newsal: integer),
then e in the condition or action refers to the employee object. Oldsal and newsal refer to the integers bound when events occurred.Composite event allows events (e.g., salary raise) to occur one or more times (set of queries).
179
Event-Condition-Action Binding (cont’d)
Rules can be triggered by insertions, deletions, and/or updates on a particular table. For example:
create trigger DepDelbefore deletion on departmentwhen department budget < 100,000deleteemployeewhereemployee.dno=department.dno
The rule is triggered before the deletion of a department. The condition (where clause) and action refer to the department being deleted.
180
Keywords (old and new) can also be used.
define rules form new empson append to employeethen delete employee when employee.name=new.name
define rules Ave Too Bigon update to employee.salaryof (select ave (salary) from new updated) > 100then rollback
Abort transaction whenever the average of the update salaries exceeds 100.
181
Rule OrderingChoice of which rule to execute when more than one has been triggered is made:
arbitrarynumeric prioritiespartially ordered
For any 2 rules, one rule can be specified as having higher priority than the other rule, but an ordering is not required.
In HIPAC, multiple trigger rules can be executed concurrently. The rules are relatively ordered for serialization (concurrency control).
182
Rule OrganizationIn relational systems
Rules are defined in the schema.Rules refer to particular tables.Rules subject to the same controls as other metadata objects (e.g., views, constraints).If a table dropped, all rules defined for it are no longer operative.
In HiPAC (object-oriented systems)Rules are first class objects.Rules are organized in types like other objects.Rules can be included in collections which can be explicitly named or defined by queries.
In Flight-Rule where effective date after 1/1/90Flight rule is a rule type.Effective data is an attribute defined for this query.Collections of rules can be selectively activated or deactivated by the enable or disable operation.
183
Rule Execution SemanticsRule behavior can be complex and unpredictable so a precise execution semantics is importantUsing an inference engine is not adequate in a database system.Active database system rule processing must integrate with conventional database activities:
queriesmodifications transactions
which cause the rules to be triggered and initiate rule processing.Granularity of rule processing
set of tuples Firing the rules after modifying each tuple or firing after modifying all the tuples.set of objectsFiring the rule after the end of a transaction.
Triggering of more than one rule by the same eventconflict resolution
Nesting triggering (termination)
184
Error RecoveryCause for generating an error during execution of database rule
Data has been deletedData access privileges have been revokedDead lock is created from concurrently executing transactionsSystem generates errorRule action has uncovered an error condition
Error recovery techniquesDelete data and revoke access privilege, invalidating the corresponding rulesAbort the current rule processing if error occurredPropagate the failure to its parent, which may initiate other sub-transactions to repair the errorHandle error recovery during system crashes
For recoverable events, their occurrences and parameter bindings have to be reliably loggedFor the decoupled conditions and actions, restart uncommitted transactions
185
Implementation Issues
Active databases must provide mechanisms for
event detection and rule triggeringcondition testingrule action executionuser development of rule applications
186
Characteristics of Representative Systems
ARIEL (build using the Exodus database tool kit) Curey et. Al. 1991
A rule manager/rule catalog for handling rule definition and manipulation tasks.A rule execution monitor for maintaining the set of triggered rules and scheduling their execution.A rule execution planner to produce optimized execution strategy.
187
POSTGRESTuple-level processing (run time approach)
Places a marker on each tuple for each rule that has a condition matching that tuple.When a tuple is modified or retrieved
If a tuple has one or more markers on it, then the rule or rules associated with the marker(s) are located and their actions are executed.Marker installed when rules are created.Marker deleted when rules are deleted
188
POSTGRES (cont’d)Query rewrite (compiler time approach)
A module is added between command parser and the query processor to intercept user command.
Argument with additional commands reflecting the effect of rules (triggering by the original command), which may trigger other rules.Recursively triggering may not terminate.
Compiled time approach can be more efficient than runtime approachSelect which mechanism to use when a rule is created
189
STARBURST (IBM Research, Hass et. Al. 1990)
ExtensibilityUse attachment feature to monitor dataModifications are stored in a transaction logRules are processed at the end of a transaction or a user commandTrigger rules are indexed according to the prioritiesReference to transition tables are implementedMechanisms are provided for concurrency control, authorization, and crash recovery
190
HiPAC (an object-oriented database system)Dayal, U. et. al., 1998, SIGMOD Record, Vol. 17, No. 7, pp. 51-70.
Rule Managercoordinates rule processingstores rule objectsimplements operations
implements the coupling modes of the execution modelconcurrency controlrecovery for rule object
Event detector detects different eventsBi-directional interaction between application programs and database rule system
Rules running inside the database can invoke application operation.
191
Rule Programming Support
Trace rule execution.Display the current set of trigger rules.Query and browse the set of rules.Cross reference rules and data.Activate and deactivate the selected rules for processing the database transactions.
192
Rule TerminationImpose sufficient syntactic restrictions on rule definitions (limit the expressiveness of the rules).
Impose a rule triggering limit. The limit is specified by the user or system default (needs to monitor the number of rules executed during the rule processing).
Detect if the same rule is triggered a second time with the same set of parameters.
193
Future DirectionsSupport for application development
Application developments treat database rules as an assembly language.Rules are generated automatically from high-level specifications.An increase in communication capabilities between database rules and applications.
Increasing expressive power of rulesImproved algorithmsDistribution and parallelism
Distribution and fragmentation of rules.Algorithms that guarantee equivalence with centralized rule processing.
194
195
CoBase: Scalable and Extensible Cooperative
Information System
196
Conventional Query AnsweringNeed to know the detailed database schemaCannot get approximate answersCannot answer conceptual queries
Cooperative Query AnsweringDerive approximate AnswersAnswer Conceptual Queries
197
Find a seaport with railway facility in Los Angeles
CoBase ServersHeterogeneousInformation Sources
CoBase provides: Relaxation Approximation Association Explanation
Find a nearby friendly airport that can land F-15
Domain Knowledge
Find hospitals with facility similar to St. John’s near LAX
Cooperative Queries
198
Generalization and Specialization
More Conceptual Query
Specific Query
Conceptual Query Conceptual Query
Specific Query
Generalization
SpecializationGeneralization
Specialization
199
Type Abstraction Hierarchy (TAH)
Chemical-Suit Size TAH(A non-numerical TAH) All_Sizes
Large_SizeSmall_Size
Very_Small
Small_to_MediumLarge_to_Extra_Large
Very_Large
XL XXLLMSXXSXXXS
Provide multi-level knowledge Provide multi-level knowledge representationsrepresentations
200
Type Abstraction Hierarchy (TAH)
CA
N. CAS. CA C. CA
SanJose
PaloAltoSacramento
DavisSanDiego
LongBeach
LA LA
(Location Example)(Location Example)
201
Relaxation Agent
query conditionsconstraints
Use knowledge-based approach (generalizationand specialization via Type Abstraction
Hierarchy)to relax the followings for matching:
202
Query Relaxation
Yes
Query
Display
AnswersRelaxAttribute Database
No
QueryModificationTAHs
203
204
Visualization of Relaxation ProcessQuery: Find seaports in the given region.
given region
relaxed region
205
206
Relaxation Control Primitives
not-relaxablerunway-length
relaxation-order(runway length, location)
preference-listunacceptable-listanswer-sizerelaxation-level
207
Relaxation Primitives
^ (approximate)^ 9 am
betweennear-to (context-sensitive)
Airport near-to LAXRestaurant near-to UCLA
similar-toAirport similar-to LAX base-on (traffic,runway)
within
208
Similar-toFind all airports in Tunisia similar to the Bizerte airport based on runway length and (more importantly) runway width.
select aport_name, runway_length, runway_widthfrom runways, countrieswhere aport_name similar-to ‘Bizerte’
based-on ((runway_length 1.0) (runway_width 2.0)) and country_state_name = ‘Tunisia’ and countries.glc_cd = runways.glc_cd
209
Similar-to Result
APROT_NM LENGTH WIDTH RANKBezerte 8000 148 0.00El Borma 7200 144 0.09Monastir 9700 137 0.20Jerba 10171 148 0.24Bjedeida 6000 122 0.27
Similar-to module ranks the returned answersaccording to mean-squared error.
210
Unacceptable List Operator
NETunisia
CentralTunisia
NWTunisia
SWTunisia
Tunisia
Bizerte El Borma...CentralTunisia
SWTunisia
Tunisia
Gafsa El Borma
Type Abstraction Hierarchy Trimmed TAH
Avoid Northern Tunisia!
CoBaseRelaxationManager
Constraint
Gafsa
211
TAH Generation for Numerical Attribute Values
Relaxation ErrorDifference between the exact value and the returned approximate valueThe expected error is weighted by the probability of occurrence of each value
DISC (Distribution Sensitive Clustering) is based on the attribute values and frequency distribution of the data
212
TAH Generation forNon-numerical Attribute Values
Pattern Based Knowledge Induction (PBKI)
Rule-based approachClusters attribute values into TAH based on other attributes in the relation (i.e., Inter-Attributes Relationships)Provides attribute correlation value (measure how well the rules applied to the databases)
213
Type Abstraction Hierarchy (TAH)
Location Name Runway Length
All
Short Medium Long
0 ... 700 700 ... 1K 1K ... 5K
Tunisia
NE Tunisia
Bizerte
Tunis
Djedeida
CentralTunisia
SW Tunisia
El Borma...
Provide multi-level knowledge representations
214
Associative Query AnsweringProvide relevant information not explicitly asked by the userUser Query: List all airports with runway length between 8500
and approximately 10000 feet
Airport Name Runway Length (feet)Jerba 10171
Monastir 9700Tunis 10500
Weather Runway QualitySunny GoodRain Good
Foggy Damaged
Military or Civilian Flag
Refrigerated Storage Capacity (Tons)
CC 0.00C 1000.00
Query Answers
Associated Attributes and Answers Associated Attributes and Answers
User Type = Pilot User Type = Planner
215
Cooperative Geographical Information System
FeaturesSpatial cooperative query answering (e.g., similar to, near to, approximate)Relaxation of spatial constraintsSpatial inference and reasoningGraphical representation of information on a map (relaxation process, query answers)
MethodologyCoBase relaxation technologyRule inference
216
Mediator Inter-Communications via KQML
ModuleObjects
APIs
Content LanguageDataActions
CoBaseOntology
Mediator AModule A
CoBase Ontology
CoBase Content Language
KQML
Mediator BModule B
CoBase Ontology
CoBase Content Language
KQML
217
CoBase and GLAD TIE
ReportCollection
Report QueryConstructor
FilterEditor
ObjectCache
DisplayGeneratorQuery
Collection
GLADCoBase QueryEditor
CoBaseRelaxationManager
KnowledgeBase
DataCacheCoBase
Data Source Manager
Databases
NSNs
SpatialArea
Selection
218
CoGLAD Functionality
Provide approximate matchingFind HETs with capacity of approximate 5-ton
Provide conceptual query answeringFind “Earth Moving” Equipment
Provide content-sensitive spatial queriesFind storage sites near selected location(Integration with MATT map server)
Provide relaxation controlrelaxation ordernot-relaxableat-least (answer set, quantity on hand)
219
Cooperative Operations Added to GLAD
Implicit Query RelaxationExplicit Query Relaxation
approximate operatorsimilar-to/based-onspatial relaxation
Relaxation Controlrelaxation-ordernot-relaxableat-least (answer-set size, quantity on hand)
220
CoBase Features Added to GLAD
Enhance GLAD queries with cooperative operators (similar-to, relaxation-order, etc.)Display the query relaxation process
modified query conditions (value, spatial)type abstraction hierarchies
Rank returned answers with similarity measurese.g., spatial relaxation ranks answers
according to their distance from the selected location
221
Example Queries
Find HETs with capability of approximate 5 tonsFind seaports in Los AngelesFind friendly nearby airports which can land F-15
222
223
224
Query Answers Without CoBase
Query: find chemical suits
225
226
227
228
229
230
Electronic WarfareIdentify and locate sources of radiated electromagnetic energyDetermine emitter type based on the operating parameters of observed signals:
Radio Frequency (RF)Pulse Repetition Frequency (PRF)Pulse Duration (PD)Scan Period (SP)other operating parameters
Determine platform sites near the line of the bearing of an emitter
This research is a joint effort between CoBase and Lockheed Martin Communication Systems (Russ Frew, et al.), Camden, NJ
231
Performance Improvement by Using CoBase in EW
Conventional DB CoBaseCase 1 Case 2 Case 1 Case 2
identified 90.00% 30.00% 100.00% 85.90%id/ranking 100.00% 36.00% 100.00% 98.80%relaxation 0.00% 0.00% 95.90% 99.80%
Conventional DB: parameter ranges from emitter specificationsCoBase:
DB: peak parameters (RF,PRF) and parameter ranges (PD,SP)KB: TAHs based on RF and PRF peak parameters
TAHs based on PD and SP parameter rangesCase 1: emitter signals without noiseCase 2: add noise - PD & SP (10%), PRF (5%), RF (2.5%)Sample Size: 1000 signals Emitter Types: 75
This research is a joint effort between CoBase and Lockheed Martin Communication Systems (Russ Frew, et al.), Camden, NJ
232
Current CoBase Users and ApplicationsARPI members ISI Unisys
Enchance Query Capabilities in TransportationDomain (ARPI TARGET): query relaxation, association, and explanation
UCLA KMeD Project Medical School
Improve Search in Medical Images (X-rays, MRs) approximate matching of image features and
contents explanation of approximate matching quality
Hughes Research Lab Integrate Schema in Heterogeneous Databases approximate matching of attributes and views
Lockheed/Martin Marietta
Emitter and Platform Identification approximate matching of observed emitter signals relaxation of regions to identify emitter platforms
BBN Enchance DOD Logistic Anchor Desk (GLAD) query relaxation and spatial relaxation
233
Conclusions
Provide user and context sensitive query relaxations (structured and unstructured data)Provide additional information (associative query answering) based on past casesCoSQL (Cooperative SQL)
similar-to, near-to, approximaterelaxation control operators
GUImap server, high-level query formation
234
Supervised vs. Unsupervised Learning
Supervised Learning:Given instances with known class information, generate rules/decision tree that can be used to infer class of future instances.
Examples: ID3, Statistical Pattern Recognition
Unsupervised Learning:Given instances with unknown class information, generate concept tree that clusters instances into similar classes.
Examples: COBWEB, TAH Generation (DISC, PBKI)
235
Automatic Construction of TAHsNecessary for Scaling up CoBaseSources of Knowledge
Database InstanceAttribute Value DistributionsInter-Attribute Relationships
Domain ExpertApproach
Generate Initial TAHWith Minimal Expert Effort
Edit the Hierarchy to SuitApplication ContextUser Profile
236
Pattern-Based Knowledge Induction (PKI)
Non-numerical attributesRule-BasedCluster attribute values into TAH based on other attributes in the relationProvides Attribute Correlation value
Moderate Computational Complexity
237
DefinitionsThe cardinality of a pattern P, denoted |P|, is the number of distinct objects that match P.
The confidence of a rule A B, denoted by (A B), is
(A B) =
Let A B be a rule that applies to a relation R. The popularity of the rule over R is defined as
(A B) =
|A B||A|
|A||R|
238
Knowledge Inference: A Three-Step ProcessStep 1: Infer RulesConsider all rules of basic form A B.Calculate Confidence and Popularity.
Confidence measures how well a rule applies to the database.
A B has a confidence of .75 means that if A holds, B has a 75% chance of holding as well.
Popularity measures how often a rule applies to the database.
A B has a popularity of 10 means that it applies to 10 tuples in the database (A holds for 10 tuples).
239
Knowledge Inference (cont’d)Step 2: Combine RulesIf two rules share a consequence and have the same
attribute as a premise (with different values), then those values are candidates for clustering.
Color = red style = “sport” (1)Color = black style = “sport” (2)
Suggests red and black should be clustered.
Correlation is product of the confidences of the two rules:
= 1 x 2
240
ClusteringGreedy Algorithm
Algorithm: Binary Cluster
RepeatINDUCE RULES and determine sort in descending orderfor each (ai, aj)if ai and aj are unclusteredreplace ai and aj in DBwith joint value Ji,j
until fully clustered
Approximate n-ary using binary cluster a set of n values if the between all pairs is above threshold
Decrease threshold and repeat
241
Knowledge Inference (cont’d)Step 3: Combine CorrelationsClustering Correlation between two values is the
weighted sum of their correlations.
Combines all the evidence that two values should be clustered together into a single number ((a1, a2)).
(a1, a2) = i = 1wi x (A = a1 Bi = bi) x (A = a2 Bi = bi)
Where a1, a2 are values of attribute A, and there are m attributes B1, …, Bm in the relation with corresponding weights w1, …, wm
m
m
242
Pattern-Based Knowledge Induction (Example)
A B Ca1 b1 c1
a1 b2 c1
a2 b1 c1
a3 b2 c1Rules:
A = a1 B = b1 confidence = 0.5A = a2 B = b1 confidence = 1.0A = a1 C = c1 confidence = 1.0A = a2 C = c1 confidence = 1.0
correlation (a1, a2) = = 0.75correlation (a1, a3) = 0.75correlation (a2, a3) = 0.5
0.5x1.0+1.0x1.02
243
Pattern-Based Knowledge Induction (cont’d)
A B Ca12 b1 c1
a12 b2 c1
a12 b1 c1
a3 b2 c1
A = a12 B = b2 confidence = 0.33A = a3 B = b2 confidence = 1.0A = a12 C = c1 confidence = 1.0A = a3 C = c1 confidence = 1.0
correlation (a12, a3) = = 0.670.33x1.0+1.0x
1.02
a1 a2
a3
0.67
0.75
2nd iteration
244
Example for Non-Numerical Attribute ValueThe PEOPLE Relation
245
TAH for People
246
Cor(a12, a3) is computed as follows:Attribute origin: Same (Holland)contributes 1.0Attribute hair: Samecontributes 1.0Attribute eye: Differentcontributes 0.0Attribute height: Overlap on MEDIUM
5/10 of a12 and 2/2 of a3contributes 5/10 * 2/2 = 0.5
cor(a12, a3) = 1/4 * (1+1+0+0.5) = 0.63
247
Correlation ComputationCompute correlation between European and Asian.
Attributes ORIGIN and HAIR COLORNo overlap between Europe and Asia, no contributions to correlation
Attribute EYE COLORBROWN is the only attribute that has overlap1 out of 24 Europeans have BROWN12 out of 12 Asians have BROWNAttribute BROWN contributes 1/24 * 12/12 = 0.0416
Attribute HeightSHORT: 5/24 Europeans and 8/12 of AsiansMedium: 11/24 and 3/12Tall: 8/24 and 1/12Attribute HEIGHT contributes5/24 * 8/12 + 11/24 * 3/12 + 8/12 * 1/12 = 0.2812
Total Contribution = 0.0416 + 0.2812 = 0.3228Correlation = 1/4(0.3228) = 0.0807
248
Related WorkMaximum Entropy (ME) method:
Maximization of entropy (- p log p)Only considers frequency distribution
Biggest GapConceptual clustering systems:
Only allows non-numerical values (COBWEB)Assume a certain distribution (CLASSIT)
249
Conventional Clustering Methods:I. Maximum Entropy (ME)
Maximization of entropy (- p log p)Only considers frequency distribution:Example: {1,1,2,99,99,100} and
{1,1,2,3,100,100}have the same entropy (2/6,1/6,2/6,1/6)
ME cannot distinguish between {1,1,2},{99,99,100} (good partition)and {1,1,2},{3,100,100} (bad partition)
ME does not consider value distribution.
Clusters have no semantic meaning.
250
Conventional Clustering Methods:II. Biggest Gap (BG)
Consider only value distributionFind cuts at biggest gaps{1,1,1,10,10,20} is partitioned to{1,1,1,10,10} and {20} bad
A good partition:{1,1,1} and {10,10,20}
251
Distribution Sensitive Clustering (DISC)
Goal: automatic generation of TAH for a numerical attribute.
Task: given a numerical attribute and a number s, find the “optimal” s-1 cuts that partition the attribute into s sub-clusters.
Need a measure for optimality of clustering.
252
Clustering Measure: Quality of Approximate Answers
253
DISCFor numeric domainsUses intra-attribute knowledge
Sensitive to both frequency and value distributions of data.
RE = average difference between exact and approximate answers in a cluster.
Quality of approximate answers are measured by relaxation error (RE): the smaller the RE, the better the approximate answer.
DISC (Distribution Sensitive Clustering) generates AAHs based on minimization of RE.
254
Example of Relaxation Error Computation
A
B C
1 2 3 4 5
1 0+1+2 = 33 3 = 9( ) 1 1+0+1 = 23 3 = 9( ) 1 2+1+0 = 33 3 = 9( )
255
Relaxation Error:
RE(B) = average pair-wise difference = 3 + 2 + 3 = 8
9 9 9 9
RE(C) = 0.5
RE(A) = 2.08
correlation (B) = 1 - RE(B) = 1 - 0.89 = 0.57 RE(A) 2.08
correlation (C) = 1- 0.5 = 0.76 2.08
correlation (A) = 1- 2.08 = 0 2.08
256
Relaxation Error Comparisons Among ME,BG, and DISC
Example 1: {1,1,2,3,100,100}ME: {1,1,2},{3,100,100}RE({1,1,2}) = (0+1+0+1+1+1)/9 = 0.44RE({3,100,100}) = 388/9 = 43.11RE({1,1,2},{3,100,100}) = 0.44*3/6 + 43.11*3/6 = 21.78
Ours: RE({1,1,2,3},{100,100}) = 0.58
Example 2: {1,1,1,10,10,20}BG: {1,1,1,10,10},{20}RE({1,1,1,10,10},{20}) = 3.6
Ours: RE({1,1,1},{10,10,20}) = 2.22
257
An Example
Example:
The table SHIPS has 153 tuples and the attribute LENGTH has 33 distinct values ranging from 273 to 947. DISC and ME are used to cluster LENGTH into three sub-concepts: SHORT, MEDIUM, and LONG.
258
An Example (cont’d)Cuts by DISC
between 636,652 and 756,791average gap = 25.5
Cuts by MEbetween 540,560 and 681,685 (a bad cut)average gap = 12
Optimal cuts by exhaustive search:between 605,635 and 756,791average gap = 32.5
DISC is more effective than ME in discovering relevant concepts in the data.
259
An Example
Clustering of SHIP.LENGTH by DISC and ME
Cuts by DISC: - - -Cuts by ME: - . - .
260
Quality of PartitionsIf RE(C) is too big, we could partition C into smaller clusters.
The goodness measure for partitioning C into m sub-clusters {C1, …, Cm} is given by the relaxation error reduction per cluster (category utility CU)
CU =
RE (C ) – k=1 P (Ck) RE (Ck)m
m
For efficiency, use binary partitions to obtain m-ary partitions.
C2 . . .C1 Cm
CPartition C to C1, …, Cm to maximize RE reduction
Further partition
261
The Algorithms DISC and BinaryCutAlgorithm DISC(C)
if the number of distinct values C < T, return /* T is a threshold */let cut = the best cut returned by BinaryCut(C)partition values in C based on cutlet the resultant sub-clusters be C1 and C2
call DISC(C1) and DISC(C2)
Algorithm BinaryCut(C)/* input cluster C = {x1, …, xn} */
for h =1 to n – 1 /* evaluate each cut */Let P be the partition with clusters C1 = {x1, …, xh} andC2 = {xh+1, …, xn}computer category utility CU for Pif CU > Max CU thenMax CU = CU, cut = h /* the best cut */Return cut as the best cut
262
The N -ary Partition AlgorithmAlgorithm N –ary Partition(C)let C1 and C2 by the two sub-clusters of Ccompute CU for the partition C1, C2
for N = 2 to n – 1let Ci by the sub-cluster of C with maximum relaxation errorcall BinaryCut to find the best sub-clusters Ci1 and Ci2 of Ci
compute and store CU for the partition C1, …, Ci-1, Ci1, Ci2, Ci+1, …, CN
if current CU is less than the previous CUstopelsereplace Ci by Ci1 and Ci2
/* the result is an N –ary partition of C */
263
Using TAHs for Approximate Query Answering
select CARGO-IDfrom CARGOSwhere SQUARE-FEET = 300
and WEIGHT = 740
no answersThe query is relaxed according to TAHs.
264
Approximate Query Answering Exampleselect CARGO-IDfrom CARGOSwhere 294 < SQUARE-FEET < 300
and 737 < WEIGHT < 741
CARGO-ID SQUARE-FEET WEIGHT 10 296 740
Relaxation error = (4/11.95+0)/2 = 0.168
Further Relaxation:select CARGO-IDfrom CARGOSwhere 294 < SQUARE-FEET < 306
and 737 < WEIGHT < 749
CARGO-ID SQUARE-FEET WEIGHT 10 296 740 21 301 737 30 304 746 44 306 745
Relaxation error = (3.75/11.95+3.5/9.88)/2 = 0.334
265
Relationship between DISC and METheorem: Let D and M be the optimal binary cuts by
DISC and ME respectively. If the data distribution is symmetrical with respect to the median, then D = M (i.e., the cuts determined by DISC and ME are the same).
For skewed distributions, clusters discovered by DISC have less relaxation error than those by the ME method.
The more skewed the data, the greater the performance difference between DISC and ME.
266
Multi-Attribute TAH (MTAH)In many applications, concepts need to be characterized by multiple attributes, e.g., near-ness of geographical locations.
As MTAH•As a guidance for query modification•As a “semantic index”
267
Multi-Attribute TAH (MTAH)
268
Multi-Attribute DISC (M-DISC) AlgorithmAlgorithm M-DISC(C)
if the number of objects in C < T, return /* T is a threshold */for each attribute a = 1 to m
for each possible binary cut hcompute CU for hif CU > MaxCU then /* remember the best cut
*/MaxCU = CU, BestAttribute = a, cut = h
partition C based on cut of the attribute BestAttributelet the resultant sub-clusters be C1 and C2
call M-DISC(C1) and M-DISC(C2)
269
Greedy M-DISC Algorithm: gM-DISCAlgorithm gM-DISC(C)
if the number of objects in C < T, return /* T is a threshold */for each attribute a = 1 to mfor each possible binary cut hcompute REa for hif REa > Max RE then /* remember the best cut */Max RE = REa, BestAttribute = a, cut = hpartition C based on cut of the attribute BestAttributelet the resultant sub-clusters be C1 and C2
call gM-DISC(C1) and gM-DISC(C2)
270
MTAH of RECTANGLES (Height, Width)
271
The Database Table AIRCRAFT
272
MTAH for AIRCRAFT
273
Example for Numerical Attribute Value
Motor Data from PartNet(http://PartNet)
274
TAH for Motor Capability
275
TAH for Motor Size and Weight
276
TAHs for MotorThe Motor table was adapted from Housed
Torque from Part Net. After inputting the data, two TAHs were generated automatically from the DISC algorithm.
One TAH was based on peak torque, peak torque power, and motor constant. The other was based on outer diameter, length, and weight. The leaf nodes represent part number. THE intermediate nodes are classes. The relaxation error (average pair-wise distance between the parts) of each node are also given.
277
Application of TAHsThe TAHs can be used jointly to satisfy attributes in both TAHs. For example, find part similar to “T-0716” in terms of peak torque, peak torque power, motor constant, outer diameter, length, and weight. By examining both TAHs, we know that QT-0701 is similar to T-0716 with an expected relaxation error of (0.06 + 0.1)/2 = 0.08
278
Performance of TAHPerformance measures:
accuracy =
efficiency =
where “all relevant answers” are the best n answers determined by exhaustive search.
Compare an MTAH with a traditional 2-d index tree (based on frequency distribution).
retrieved relevant answers
retrieved relevant answers
all relevant answers
all retrieved answers
279
Performance of MTAHsBased on attributes Longitudes and Latitudes of 972 geographical
locations from a transportation database.
500 queries with the form:
“find the n locations nearest to (long,lat)”where n is randomly selected from 1 to 20, and long and lat are
generated based on the distributions of the geographical locations.
efficiency 0.54 0.53 0.64 0.011accuracy 0.85 0.84 0.68 1.0error 1.14 1.17 1.57 1.0
MTAH GMTAH ME-Tree E-S
MTAH is more accurate than 2-d-tree.MTAH is more efficient than Exhaustive Search.