Schema Tuning. Outline Database design: Normalization –Problem of redundancy –Why? Functional...
-
Upload
jeffrey-nichols -
Category
Documents
-
view
225 -
download
0
Transcript of Schema Tuning. Outline Database design: Normalization –Problem of redundancy –Why? Functional...
Outline
• Database design: Normalization– Problem of redundancy– Why?
• Functional dependency– How to solve?
• Decomposition– Objective of the decomposition?
• Normalized• Normal form
• Schema refinement– Vertical partitioning and Antipartitioning– Denormalization– Horizontal decomposition
Design Steps• The design steps:
1. Real-World 2. ER model 3. Relational Schema
4. Better relational Schema 5. Relational DBMS
• Guideline for Step (3) to step (4) – “design theory” for relations– Or “normalization theory”.
• Why normalization theory– Automatic mappings from ER to relations may not produce the best
relational design possible.– Database designers may go directly from (1) to (3), in which case, the
relational design can be really bad.
Database Schema
• A relation schema is a relation name and a set of attributesR(a int, b varchar[20]);
• A relation instance for R is a set of records over the attributes in the schema for R.
The Evils of Redundancy
• Redundancy is the root of several problems associated with relational schemas:– redundant storage– insert/delete/update anomalies
• Consider relation obtained from Hourly_Emps:– Hourly_Emps (ssn, name, lot, rating, hrly_wages, hrs_worked)
• Notation: We will denote this relation schema by listing the attributes: SNLRWH– This is really the set of attributes Hourly_Emps
{S,N,L,R,W,H}.
Example
• Problems due to R W :– Update anomaly:
• Can we change W in just the 1st tuple of SNLRWH?
– Insertion anomaly: • Suppose W is not allowed
null. What if we want to insert an employee and don’t know the hourly wage for his rating?
– Deletion anomaly: • If we delete all employees
with rating 5, we lose the information about the wage for rating 5!
S N L R W H
123-22-3666 Attishoo 48 8 10 40
231-31-5368 Smiley 22 8 10 30
131-24-3650 Smethurst 35 5 7 30
434-26-3751 Guldu 35 5 7 32
612-67-4134 Madayan 35 8 10 40
Functional Dependencies (FDs)• A functional dependency X Y holds over relation R
if, for every allowable instance r of R:– i.e., given two tuples in r, if the X values agree, then the Y
values must also agree. (X and Y are sets of attributes.)
• K is a key for relation R if:1. K determines all attributes of R.
2. For no proper subset of K is (1) true.• If K satisfies only (1), then K is a superkey.
Example• Consider relation Hourly_Emps:
– Hourly_Emps (ssn, name, lot, rating, hrly_wages, hrs_worked)
FD is a key:– ssn is the key
S SNLRWH
FDs give more detail than the mere assertion of a key.– rating determines
hrly_wages R W
S N L R W H
123-22-3666 Attishoo 48 8 10 40
231-31-5368 Smiley 22 8 10 40
131-24-3650 Smethurst 35 5 7 30
434-26-3751 Guldu 35 5 7 30
612-67-4134 Madayan 35 8 10 40
Decomposition of a Relation Scheme
• A decomposition of R(A1 ... An) consists of replacing R by two or more relations such that:– Each new relation scheme contains a subset of the attributes
of R • Implies that there exists no attributes that do not appear in R
– Every attribute of R appears as an attribute of one of the new relations.
• E.g., – SNLRWH can be decomposed into SNLRH and RW.
Problems with Decompositions
• Three potential problems to consider:Performance: Some queries become more expensive.
• e.g., How much did sailor Joe earn? (salary = W*H)
Lossless Join Decompositions: Given instances of the decomposed relations, we may not be able to reconstruct the corresponding instance of the original relation!
• Fortunately, not in the SNLRWH example.
Dependency Preserving Decomposition: Checking some dependencies may require joining the instances of the decomposed relations.
• Fortunately, not in the SNLRWH example.
• Tradeoff: Must consider these issues vs. redundancy.
Lossless Join Decompositions
• Decomposition of R into X and Y is lossless-join w.r.t. a set of FDs F if, for every instance r that satisfies F: X(r) Y(r)=r
• It is always true that rX(r) Y(r)– In general, the other direction does not
hold! If it does, the decomposition is lossless-join.
• Definition extended to decomposition into 3 or more relations in a straightforward way.
Dependency Preserving Decomposition
• Consider CSJDPQV, C is key, JP →C and SD → P.– BCNF decomposition: CSJDQV and SDP
– Problem: Checking JP → C requires a join!
• Dependency preserving decomposition (Intuitive):– If R is decomposed into X, Y and Z, and we enforce
the FDs that hold on X, on Y and on Z, then all FDs that were given to hold on R must also hold.
– No join is necessary to enforce original FD
Normalization
• A relation is normalized if every interesting functional dependency X A involving attributes in R has the property that X is a key of R.– BCNF– Kent :“ Each attribute
must describe [an entity or relationship identified by] the key, the whole key, and nothing but the key.“
• No partial dependency• No transitive denepdency
BCNF
Partial dependency
Transitive dependency
Example
– Schema1 (not normalized): OnOrder1(supplier_id, part_id, quantity,
supplier_address)
– Schema 2 (Normalized):OnOrder2(supplier_id, part_id, quantity);
Supplier(supplier_id, supplier_address);
Example #1
• Suppose that a bank associates each customer with his or her home branch. Each branch is in a specific legal jurisdiction( 权限 ). – Is the relation R(customer, branch, jurisdiction)
normalized?
Example #1
• What are the functional dependencies?– customer branch
– branch jurisdiction
– customer jurisdiction
– Customer is the key, but a functional dependency exists where customer is not involved.
– R is not normalized.
Example #2
• Suppose that a doctor can work in several hospitals and receives a salary from each one. – Is R(doctor, hospital, salary) normalized?
Example #2
• What are the functional dependencies?– doctor, hospital salary
• The key is doctor, hospital
• The relation is normalized.
Example #3
• Same relation R(doctor, hospital, salary) and we add the attribute primary_home_address. Each doctor has a primary home address and several doctors can have the same primary home address. Is R(doctor, hospital, salary, primary_home_address) normalized?
Example #3
• What are the functional dependencies?– doctor, hospital salary– doctor primary_home_address– doctor, hospital primary_home_address
• The key is “doctor, hospital”, however doctor (a subset) determines one attribute, hence not normalized
• A normalized decomposition would be:– R1(doctor, hospital, salary)– R2(doctor, primary_home_address)
Normal Forms
• Returning to the issue of schema refinement, the first question to ask is whether any refinement is needed!
• If a relation is in a certain normal form (BCNF, 3NF etc.), it is known that certain kinds of problems are avoided/minimized. This can be used to help us decide whether decomposing the relation will help.
Practical Schema Design
• Identify entities in the application (e.g., doctors, hospitals, suppliers).
• Each entity has attributes (an hospital has an address, a juridiction, …).
• There are two constraints on attributes:– An attribute cannot have attribute of its own.– The entity associated with an attribute must
functionally determine that attribute.
Practical Schema Design
• Each entity becomes a relation• To those relations, add relations that reflect
relationships between entities.– Worksin (doctor_ID, hospital_ID)
• Identify the functional dependencies among all attributes and check that the schema is normalized:– If functional dependency AB C holds, then ABC
should be part of the same relation.
Vertical Partitioning
• Three attributes: account_ID, balance, address.• Functional dependencies:
– account_ID balance– account_ID address
• Two normalized schema design:– (account_ID, balance, address)or– (account_ID, balance)– (account_ID, address)
• Which design is better?
Vertical Partitioning
• Which design is better depends on the query pattern:– The application that
sends a monthly statement is the principal user of the address of the owner of an account
– The balance is updated or examined several times a day.
• The second schema might be better because the relation (account_ID, balance) can be made smaller:– More account_ID, balance
pairs fit in memory, thus increasing the hit ratio
– A scan performs better because there are fewer pages.
Tuning Normalization
• Rule of thumb:– A single normalized relation XYZ is better than two
normalized relations XY and XZ if X, Y and Z are frequently accessed together
– Why? No costly join operations.– The two-relation design is better iff:
• Users access tend to partition between the two sets Y and Z most of the time
• Attributes Y or Z have large values (one-third the page size or larger)
– Too many accesses across pages
Vertical Partitioningand Scan
• R (X,Y,Z)– X is an integer– YZ are large strings
• Scan Query• Vertical partitioning
exhibits poor performance when XYZ are accessed together.
• Vertical partitioning provides a speedup if only two of the attributes (XY) are accessed.
0
0.005
0.01
0.015
0.02
No Partitioning -XYZ
VerticalPartitioning - XYZ
No Partitioning -XY
VerticalPartitioning - XY
Th
rou
gp
ut
(qu
erie
s/se
c)
Vertical Partitioningand Point Queries
• R (X,Y,Z)– X is an integer– YZ are large strings
• A mix of point queries access either XYZ or XY.
• Vertical partitioning gives a performance advantage if the proportion of queries accessing only XY is greater than 20%.
• The join is not expensive compared to a simple look-up.
0
200
400
600
800
1000
0 20 40 60 80 100
% of access that only concern XY
Th
rou
gh
pu
t (q
uer
ies/
sec)
no vertical partitioning
vertical partitioning
Vertical Antipartitioning
• Brokers’ bond-buying decisions are based on the price trends of those bonds.
• The database holds the closing price for the last 3000 trading days, however the 10 most recent trading days are especially important.– (bond_id, issue_date, maturity, …)
(bond_id, date, price)Vs.– (bond_id, issue_date, maturity, today_price, …
10dayago_price)(bond_id, date, price)
Tuning Denormalization
• Denormalizing means violating normalization for the sake of performance:– Denormalization speeds up performance when
attributes from different normalized relations are often accessed together
– Denormalization hurts performance for relations that are often updated.
Denormalizing -- data
Settings:lineitem ( L_ORDERKEY, L_PARTKEY , L_SUPPKEY,
L_LINENUMBER, L_QUANTITY, L_EXTENDEDPRICE , L_DISCOUNT, L_TAX , L_RETURNFLAG, L_LINESTATUS , L_SHIPDATE, L_COMMITDATE, L_RECEIPTDATE, L_SHIPINSTRUCT , L_SHIPMODE , L_COMMENT );
region( R_REGIONKEY, R_NAME, R_COMMENT );
nation( N_NATIONKEY, N_NAME, N_REGIONKEY, N_COMMENT,);
supplier( S_SUPPKEY, S_NAME, S_ADDRESS, S_NATIONKEY, S_PHONE, S_ACCTBAL, S_COMMENT);
– 600000 rows in lineitem, 25 nations, 5 regions, 500 suppliers
Denormalizing -- transactions
lineitemdenormalized ( L_ORDERKEY, L_PARTKEY , L_SUPPKEY, L_LINENUMBER, L_QUANTITY, L_EXTENDEDPRICE , L_DISCOUNT, L_TAX , L_RETURNFLAG, L_LINESTATUS , L_SHIPDATE, L_COMMITDATE, L_RECEIPTDATE, L_SHIPINSTRUCT , L_SHIPMODE , L_COMMENT, L_REGIONNAMEL_REGIONNAME);
– 600000 rows in lineitemdenormalized
– Cold Buffer
– Dual Pentium II (450MHz, 512Kb), 512 Mb RAM, 3x18Gb drives (10000RPM), Windows 2000.
Queries on Normalized vs. Denormalized Schemas
Queries:select L_ORDERKEY, L_PARTKEY, L_SUPPKEY, L_LINENUMBER, L_QUANTITY,
L_EXTENDEDPRICE, L_DISCOUNT, L_TAX, L_RETURNFLAG, L_LINESTATUS, L_SHIPDATE, L_COMMITDATE, L_RECEIPTDATE, L_SHIPINSTRUCT, L_SHIPMODE, L_COMMENT, R_NAME
from LINEITEM, REGION, SUPPLIER, NATIONwhereL_SUPPKEY = S_SUPPKEYand S_NATIONKEY = N_NATIONKEYand N_REGIONKEY = R_REGIONKEYand R_NAME = 'EUROPE';
select L_ORDERKEY, L_PARTKEY, L_SUPPKEY, L_LINENUMBER, L_QUANTITY, L_EXTENDEDPRICE, L_DISCOUNT, L_TAX, L_RETURNFLAG, L_LINESTATUS, L_SHIPDATE, L_COMMITDATE, L_RECEIPTDATE, L_SHIPINSTRUCT, L_SHIPMODE, L_COMMENT, L_REGIONNAME
from LINEITEMDENORMALIZEDwhere L_REGIONNAME = 'EUROPE';
Denormalization
• TPC-H schema• Query: find all lineitems
whose supplier is in Europe.
• With a normalized schema this query is a 4-way join.
• If we denormalize lineitem and introduce the name of the region for each lineitem we obtain a 30% throughput improvement.
0
0.0005
0.001
0.0015
0.002
normalized denormalized
Th
rou
gh
pu
t (Q
uer
ies/
sec)
Horizontal Decompositions
• Relation decomposition: Relation is replaced by a collection of relations that are projections.
• Horizontal Decomposition : Replace relation by a collection of relations that are selections.– Each new relation has same schema as the original,
but a subset of the rows.– Collectively, new relations contain all rows of the
original. Typically, the new relations are disjoint.
Horizontal Decompositions (Contd.)
• Suppose that contracts with value > 10000 are subject to different rules. This means that Contracts with val>10000 will often be queried
• One way to deal with this is to build a clustered B+ tree index on the val field of Contracts.
• A second approach is to replace contracts by two new relations: LargeContracts and SmallContracts, with the same attributes
• Performs like index on such queries, but no index overhead. Can build clustered indexes on other attributes, in addition!
Masking Conceptual Schema Changes
• The replacement of Contracts by LargeContracts and SmallContracts can be masked by the view.
• However, queries with the condition val>10000 must be asked wrt LargeContracts for efficient execution: so users concerned with performance have to be aware of the change
CREATE VIEW Contracts(cid, sid, jid, did, pid, qty, val)AS SELECT * FROM LargeContractsUNIONSELECT *FROM SmallContracts