21.entities - courses.cs.washington.edu
Transcript of 21.entities - courses.cs.washington.edu
CSE 344JULY 30TH
DB DESIGN (CH 4)
ADMINISTRIVIA• HW6 due next Thursday
• uses Spark API rather than MapReduce• (a bit higher level)
• be sure to shut down your AWS cluster when not in use
• Still grading midterms...• will hand back on Wednesday
REVIEW
Horizontally partitioned data:• allows for easy scale-up
• OLTP is easy!• but OLAP becomes harder...
• new problem for query execution: needed data may not be local• fix by reshuffling: put required data into the same machine
EXAMPLE 1
QuerySingle Node Plan
Multi-Node Plan
REVIEW
MapReduce• lower-level API for • handles networking, I/O, failure, and straggler issues• main piece provided by the API is shuffle
• read → map → shuffle → reduce → write• fits perfectly with what we need above
• one job contains one shuffle• complex queries becomes multiple jobs
PARALLEL QUERY PLANS• Split into phases at each reshuffle
• Do as much work as possible within each phase
• selections never require a reshuffle• group by & join may require it
• (consider what data is currently partitioned by)
• In general, n-phase plan can be implemented in (n-1) MapReduce jobs
• (since it has n-1 shuffles)
EXAMPLE 1
QuerySingle Node Plan
Multi-Node Plan
MapReduce• saw this before in three parts: left side, right side, join• only 1 reshuffle (for the join), so only 1 job• mapper does first two parts: input is Item i or Order o
• left side: output (i.item, (‘Item’, i))• right side: output (o.item, (‘Order’, o))
• reducer: input is (item, [(table_name, value)])• split list into two: items and orders• output all combinations (two nested loops)
EXAMPLE 2Schema:
Drug(spec VARCHAR(255), compatibility INT)Person(name VARCHAR(100) PK, compatibility INT)
QuerySELECT P.name, count(D.spec)FROM Person AS P, Drug AS DWHERE P.compatibility = D.compatibilityGROUP BY P.name;
Logical PlanƔ name, count(spec) (Person ⋈ Drug)
EXAMPLE 2Logical Plan
Ɣ name, count(spec) (Person ⋈ Drug)
Possible Reshuffles• join: get tuples with same “compatibility” on same machine• group by: get tuples with same “name” on same machine
EXAMPLE 2Logical Plan
Ɣ name, count(spec) (Person ⋈ Drug)
Assume• Drug is block partitioned• Person is hash partitioned by “compatibility”
Physical Plan• join: reshuffle Drug by compatibility (not needed for Person)• group-by: no reshuffle!
• name is a PK so only one tuple with that name• join can produce multiple tuples but all are on same machine
with the Person tuple having that name
DB DESIGN
DATABASE DESIGNWhat it is:Starting from scratch, design the database schema:relation, attributes, keys, foreign keys, constraints etc.Why it’s hardThe database will be in operation for a very long time (years). Updating the schema while in production is very expensive
• computer time• engineering time
DATABASE DESIGN
Consider issues such as:• What entities to model• How entities are related• What constraints exist in the domain
Several formalisms exists• We discuss E/R diagrams
• frequently used to communicate schemas in industry• UML, model-driven architecture
Reading: Sec. 4.1-4.6
DATABASE DESIGN PROCESS
companymakesproduct
name
price name address
Conceptual Model:
Relational Model:Tables + constraintsAnd also functional dep.
Normalization:Eliminates anomalies
Conceptual Schema
Physical SchemaPhysical storage details
ENTITY / RELATIONSHIP DIAGRAMS
Entity set = a class• An entity = an object
Attribute
Relationship
Product
city
makes
Company
Person
CompanyProduct
buys
makes
employs
name CEO
price
address name ssn
address
name
KEYS IN E/R DIAGRAMSEvery entity set must have a key
Product
name
price
MULTIPLICITY OF E/R RELATIONS
one-one:
many-one
many-many
123
abcd
123
abcd
123
abcd
Person
CompanyProduct
buys
makes
employs
name CEO
price
address name ssn
address
name
What doesthis say ?
ATTRIBUTES ON RELATIONSHIPS
ProductPerson Buys
name price
address
date
name
What doesthis say ?
MULTI-WAY RELATIONSHIPSHow do we model a purchase relationship between buyers,products and stores?
Purchase
Product
Person
Store
Can still model as a mathematical set (How?)
As a set of triples ⊆ Person × Product × Store
date
Q: What does the arrow mean ?
ARROWS IN MULTIWAY RELATIONSHIPS
A: Any person buys a given product from at most one store
Purchase
Product
Person
Store
[Fine print: Arrow pointing to E means that if we select one entity from each of the other entity sets in the relationship, those entities are related to at most one entity in E]
date
Q: What does the arrow mean ?
ARROWS IN MULTIWAY RELATIONSHIPS
A: Any person buys a given product from at most one storeAND every store sells to every person at most one product
Purchase
Product
Person
Store
date
CONVERTING MULTI-WAY RELATIONSHIPS TO BINARY
Purchase
Person
Store
Product
StoreOf
ProductOf
BuyerOf
date
Arrows go in which direction?
CONVERTING MULTI-WAY RELATIONSHIPS TO BINARY
Purchase
Person
Store
Product
StoreOf
ProductOf
BuyerOf
date
Make sure you understand why!
PurchaseProduct Person
President PersonCountry
Moral: Be faithful to the specifications of the application!
DESIGN PRINCIPLES:WHAT’S WRONG?
DESIGN PRINCIPLES:WHAT’S WRONG?
Purchase
Product
Store
date
personNamepersonAddr
Moral: pick the rightkind of entities.
DESIGN PRINCIPLES:WHAT’S WRONG?
Purchase
Product
Person
Store
dateDates
Moral: don’t complicate life morethan it already is.
FROM E/R DIAGRAMSTO RELATIONAL SCHEMA
Entity set à relationRelationship à relation
ENTITY SET TO RELATION
Product
prod-ID category
price
Product(prod-ID, category, price)
prod-ID category priceGizmo55 Camera 99.99Pokemn19 Toy 29.99
N-N RELATIONSHIPS TO RELATIONS
Orders
prod-ID cust-ID
date
Shipment Shipping-Co
address
namedate
How to represent Shipment in relations?
prod-ID cust-ID name date
Gizmo55 Joe12 UPS 4/10/2011
Gizmo55 Joe12 FEDEX 4/9/2011
N-N RELATIONSHIPS TO RELATIONS
Orders
prod-ID cust-ID
date
Shipment Shipping-Co
address
name
Orders(prod-ID,cust-ID, date)Shipment(prod-ID,cust-ID, name, date)Shipping-Co(name, address)
date
N-1 RELATIONSHIPS TO RELATIONS
Orders
prod-ID cust-ID
date
Shipment Shipping-Co
address
namedate
How to represent Shipment in relations?
N-1 RELATIONSHIPS TO RELATIONS
Orders
prod-ID cust-ID
date
Shipment Shipping-Co
address
name
Orders(prod-ID,cust-ID, date1, name, date2) Shipping-Co(name, address)
date
Remember: no separate relations for many-one relationship
MULTI-WAY RELATIONSHIPS TO RELATIONS
Purchase
Product
Person
Storeprod-ID price
ssn name
name address
Purchase(prod-ID, ssn, name)
MODELING SUBCLASSES
Some objects in a class may be special• define a new class• better: define a subclass
Products
Software products
Educational products
So --- we define subclasses in E/R
Product
name category
price
isa isa
Educational ProductSoftware Product
Age Groupplatforms
MODELING SUBCLASSES
Product
name category
price
isa isa
Educational ProductSoftware Product
Age Groupplatforms
Name Price Category
Gizmo 99 gadget
Camera 49 photo
Toy 39 gadget
Name platforms
Gizmo unix
Product
Sw.Product
Ed.Product
Other ways to convert are possible...
Name Age Group
Gizmo toddler
Toy retired
MODELING SUBCLASSES
Is this representation subclassing in Java sense?
MODELING UNION TYPES WITH SUBCLASSES
FurniturePiece
PersonCompany
Say: each piece of furniture is owned either by a person or by a company
MODELING UNION TYPES WITH SUBCLASSES
Say: each piece of furniture is owned either by a person or by a companySolution 1. Acceptable but imperfect (What’s wrong ?)
FurniturePiecePerson Company
ownedByPerson ownedByComp.
MODELING UNION TYPES WITH SUBCLASSES
Solution 2: better, more laborious
isa
FurniturePiece
Person CompanyownedBy
Owner
isa
WEAK ENTITY SETSEntity sets are weak when their key comes from otherclasses to which they are related.
UniversityTeam affiliation
numbersport name
Team(sport, number, universityName)University(name)
WHAT ARE THE KEYS OF R ?
R
A
B
S
T
V
Q
UW
V
Z
C
DE G
K
H
FL
INTEGRITY CONSTRAINTS MOTIVATION
ICs help prevent entry of incorrect informationHow? DBMS enforces integrity constraints
• Allows only legal database instances (i.e., those that satisfy all constraints) to exist
• Ensures that all necessary checks are always performed and avoids duplicating the verification logic in each application
An integrity constraint is a condition specified on a database schema that restricts the data that can be stored in an instance of the database.
CONSTRAINTS IN E/R DIAGRAMSFinding constraints is part of the modeling process. Commonly used constraints:
Keys: social security number uniquely identifies a person.
Single-value constraints: a person can have only one biological father.
Referential integrity constraints: if you work for a company, itmust exist in the database.
Other constraints: peoples’ ages are between 0 and 120
KEYS IN E/R DIAGRAMS
address name ssn
Person
Product
name category
price
No formal way to specify multiplekeys in E/R diagrams
Underline:
SINGLE VALUE CONSTRAINTS
makes
makes
vs.
REFERENTIAL INTEGRITY CONSTRAINTS
CompanyProduct makes
CompanyProduct makes
Each product made by at most one company.Some products made by no company
Each product made by exactly one company.
OTHER CONSTRAINTS
CompanyProduct makes<100
Q: What does this mean ?A: A Company entity cannot be connectedby relationship to more than 99 Product entities
CONSTRAINTS IN SQLConstraints in SQL:Keys, foreign keysAttribute-level constraintsTuple-level constraintsGlobal constraints: assertions
The more complex the constraint, the harder it is to check and to enforce
simplest
Mostcomplex
KEY CONSTRAINTS
OR:
CREATE TABLE Product (name CHAR(30) PRIMARY KEY,category VARCHAR(20))
CREATE TABLE Product (name CHAR(30),category VARCHAR(20),PRIMARY KEY (name))
Product(name, category)
KEYS WITH MULTIPLE ATTRIBUTES
CREATE TABLE Product (name CHAR(30),category VARCHAR(20),price INT,
PRIMARY KEY (name, category))
Name Category Price
Gizmo Gadget 10
Camera Photo 20
Gizmo Photo 30
Gizmo Gadget 40
Product(name, category, price)
OTHER KEYSCREATE TABLE Product (
productID CHAR(10),name CHAR(30),category VARCHAR(20),price INT,PRIMARY KEY (productID),UNIQUE (name, category))
There is at most one PRIMARY KEY;there can be many UNIQUE
FOREIGN KEY CONSTRAINTS
CREATE TABLE Purchase (prodName CHAR(30)REFERENCES Product(name),date DATETIME)
prodName is a foreign key to Product(name)name must be a key in Product
Referentialintegrity
constraints
May writejust Product
if name is PK
FOREIGN KEY CONSTRAINTS
Example with multi-attribute primary key
(name, category) must be a KEY in Product
CREATE TABLE Purchase (prodName CHAR(30),category VARCHAR(20),date DATETIME,FOREIGN KEY (prodName, category)
REFERENCES Product(name, category)
Name Category
Gizmo gadget
Camera Photo
OneClick Photo
ProdName Store
Gizmo Wiz
Camera Ritz
Camera Wiz
Product Purchase
WHAT HAPPENS WHEN DATA CHANGES?
Types of updates:In Purchase: insert/updateIn Product: delete/update
SQL has three policies for maintaining referential integrity:NO ACTION reject violating modifications (default)CASCADE after delete/update do delete/updateSET NULL set foreign-key field to NULLSET DEFAULT set foreign-key field to default value
• need to be declared with column, e.g., CREATE TABLE Product (pid INT DEFAULT 42)
WHAT HAPPENS WHEN DATA CHANGES?
MAINTAINING REFERENTIAL INTEGRITY
CREATE TABLE Purchase (prodName CHAR(30),category VARCHAR(20),date DATETIME,FOREIGN KEY (prodName, category)
REFERENCES Product(name, category)ON UPDATE CASCADEON DELETE SET NULL )
Name Category
Gizmo gadget
Camera Photo
OneClick Photo
ProdName Category
Gizmo Gizmo
Snap Camera
EasyShoot Camera
Product Purchase
CONSTRAINTS ON ATTRIBUTES AND TUPLES
Constraints on attributes:NOT NULL -- obvious meaning...CHECK condition -- any condition !
Constraints on tuplesCHECK condition
CONSTRAINTS ON ATTRIBUTES AND TUPLES
CREATE TABLE R (A int NOT NULL, B int CHECK (B > 50 and B < 100), C varchar(20),
D int,CHECK (C >= 'd' or D > 0))
CONSTRAINTS ON ATTRIBUTES AND TUPLES
CREATE TABLE Product (productID CHAR(10),name CHAR(30),category VARCHAR(20),price INT CHECK (price > 0),PRIMARY KEY (productID),UNIQUE (name, category))
CREATE TABLE Purchase (prodName CHAR(30)
CHECK (prodName IN(SELECT Product.nameFROM Product),
date DATETIME NOT NULL)
Constraints on Attributes and Tuples
Whatis the difference from
Foreign-Key ?
What does this constraint do?
GENERAL ASSERTIONS
CREATE ASSERTION myAssert CHECK(NOT EXISTS(
SELECT Product.nameFROM Product, PurchaseWHERE Product.name = Purchase.prodNameGROUP BY Product.nameHAVING count(*) > 200) )
But most DBMSs do not implement assertionsBecause it is hard to support them efficientlyInstead, they provide triggers