Advanced Database Systems - UAntwerpenadrem.uantwerpen.be/sites/default/files/adbs-lect1_0.pdf ·...
Transcript of Advanced Database Systems - UAntwerpenadrem.uantwerpen.be/sites/default/files/adbs-lect1_0.pdf ·...
Advanced Database Systems
Floris Geerts
University of Antwerp
Floris Geerts (University of Antwerp) Advanced Database Systems 1 / 384
Introduction Overview
What is a database?
Database = a very large, integrated collection of data.
Models real-world organisations
(e.g. enterprise, university, genome, ... ):I entities (e.g. students, modules, genes)I relationships (e.g. Joe is taking AD)
A DBMS is a software package designed to store, manage and querydatabases.
Floris Geerts (University of Antwerp) Advanced Database Systems 3 / 384
Introduction Overview
Database prehistory
Data entry Storage and retrieval
query processing sorting
Floris Geerts (University of Antwerp) Advanced Database Systems 4 / 384
Introduction Overview
Why use a database?
Why?
A DBMS provides generic functionality that otherwise would have to beimplemented over and over again.
Data independence;
E�cient access;
Data integrity and security;
Uniform data administration;
Concurrent access, recovery from crashes; and
Reduced application development time.
Floris Geerts (University of Antwerp) Advanced Database Systems 5 / 384
Introduction Overview
Why study databases?
Everybody needs them.
They are connected to most other areas of computer science:I programming languages and software engineering;I algorithms;I logic, discrete math, and theory of comp. (essential for data
organization and query languages); andI Systems issues: concurrency, operating systems, file organization and
networks.
There are lots of interesting problems, both in database research andin implementation.
Good design is always a challenge.
Floris Geerts (University of Antwerp) Advanced Database Systems 6 / 384
Introduction Overview
Modeling data
How to
model
the
data?
DBMS
Floris Geerts (University of Antwerp) Advanced Database Systems 7 / 384
Introduction Overview
Data models
Data model = a collection of concepts for describing data:I Relations, attributes, tuples (relational model)I Classes, subclasses, attributes, objects (object oriented)I Entities, relationships, attributes (entity-relationship)
A schema is a description of a particular collection of data using agiven data model.
The relational model of data is the most widely used model today:I Main concept: relation/table with rows and columnsI Every relation has a schema which describes the table.
Munros: MId MName Lat Long Height Rating1 The Saddle 57.167 5.384 1010 42 Ladhar Bheinn 57.067 5.750 1020 43 Schiehallion 56.667 4.098 1083 2.54 Ben Nevis 56.780 5.002 1343 1.5
Floris Geerts (University of Antwerp) Advanced Database Systems 8 / 384
Introduction Overview
MunrosSir Hugh Thomas Munro (1856—1919)
Scottish mountaineer
List of mountains in Scotland over 3,000 feet(914.4 m), known as the Munros.
283 Munros in total (in 2009)
Floris Geerts (University of Antwerp) Advanced Database Systems 9 / 384
Introduction Overview
Levels of abstraction
Data in DBMS is described at three levels ofabstraction:
Views describe how users see the data
Conceptual Schema defines logicalstructure
Physical schema describes the files andindexes used
External Schema 1 External Schema 2 External Schema 3
Conceptual Schema
Physical Schema
Disk
Schemas are defined using data definition language (DDL) data ismodified/queried using data manipulation language (DML)
Floris Geerts (University of Antwerp) Advanced Database Systems 10 / 384
Introduction Overview
Example database
External Schema (View): All Munros that are not climbed
NotClimbed (MId: integer, MName: char(30))
Conceptual Schema:
Hikers (HId: integer, HName: char(30), Skill: char(3), Age: integer)
Munroes (MId: integer, MName: char(30), Lat: real, Long: real,
Height: integer, Rating: real)
Climbs (HId: integer, MId: integer, Date: data, Time: integer)
Physical Schema:I which relations are stored as unordered files.I which index structures are uses (e.g., on first attributes)
Floris Geerts (University of Antwerp) Advanced Database Systems 11 / 384
Introduction Overview
Data independence
Applications insulated from how data is structured and stored
Logical data independence: Protection from changes in logicalstructure of the data
I When conceptual schema changes, views can be redefinedI User can query same way as before
Physical data independence: Protection from physical changes in thestructure of the data
I When physical schema changes, conceptual schema stays the sameI Storage details are hidden from upper layers
Floris Geerts (University of Antwerp) Advanced Database Systems 12 / 384
Introduction Overview
E�ciency
There are things that we like to do quickly and e�ciently:
I Give me all Munros higher than 1000mI Who climbed Ben Nevis?
We would like to program these as quickly as possible.
Such questions involving data stored in a DBMS are called queries.
DBMS ensures that such queries can be answered e�ciently usingpowerful query languages.
Floris Geerts (University of Antwerp) Advanced Database Systems 13 / 384
Introduction Overview
Concurrency control
Concurrent execution of user queries is essential for good DBMSperformance:
I Disk access is slow therefore most e�cient access is for several usersconcurrently
Interleaving actions of di↵erent user programs/requests can lead toinconsistency:
I e.g. simultaneously money being transferred out of an account twicewhen su�cient funds only cover one transaction
DBMS ensures such problems do not occur!
Floris Geerts (University of Antwerp) Advanced Database Systems 14 / 384
Introduction Overview
DBMS structure
A typical DBMS has a layered architecture
Concurrency control and recovery not shown
One of several possible variations
Query optimization &
Execution
Relational operators
File access
Buffer management
Disk management
Disk
Some “real” DBMSmysql: www.mysql.org, open source, quite powerful
PostgreSQL: www.postgresql.org. open source, powerful
Microsoft Access: simple system, lots of nice GUI wrappers
Commercial systems:I Oracle 11g (www.oracle.com/database)I SQL Server 2008 (www.microsoft.com/sql)I DB2 (www.ibm.com/db2)
Floris Geerts (University of Antwerp) Advanced Database Systems 15 / 384
Introduction Relational model
Outline
1 IntroductionOverviewRelational modelRelational query languages
2 Storage and indexing
3 Query evaluation
4 Query optimisation
5 Transactions, concurrency, and recovery
6 Parallel data management
Floris Geerts (University of Antwerp) Advanced Database Systems 16 / 384
Introduction Relational model
Why study the relational model?
Its the dominant model in the marketplaceI Vendors: Microsoft, Oracle, IBM,I Open source: PostgreSQL, mysql, ...
SQL is the industrial realisation of the relational model
SQL has been standardised (several times)
Most of the commercial systems have substantially extended thestandard!
SQLSQL=Structured Query Language
Floris Geerts (University of Antwerp) Advanced Database Systems 17 / 384
Introduction Relational model
The relational model: early history
Proposed by E.F. Codd (IBM San Jose) 1970I Prior to this the dominant model was the network model (CODASYL)
Mid 70s: prototypesI Sequel at IBM San JoseI INGRES at UC Berkeley
1976-: System R at IBM San JoseI TransactionsI Query optimiserI Extended �-testing
Then...commercial systems... Figure: E.F. Codd
Floris Geerts (University of Antwerp) Advanced Database Systems 18 / 384
Introduction Relational model
The relational model: basics
A relational database is a collection of relations.
A relation consists of two parts:
I Relation instance: a table, with columns and rows.
I Relation schema: specifies the name of the relation, plus the name andtype of each column.
You can think of a relation instance as a set of rows or tuples
Floris Geerts (University of Antwerp) Advanced Database Systems 19 / 384
Introduction Relational model
Example
Relation schema:
Climbs (HId: integer, MId: integer, Date: date, Time: integer)
relation name
field name
(attribute name)
domain
In general (and more formally):
R.f1 WD1; : : : ;fn WDn/
relation name
field name
(attribute name)
domain
Floris Geerts (University of Antwerp) Advanced Database Systems 20 / 384
Introduction Relational model
Example
Relation instance:
Munros: MId MName Lat Long Height Rating1 The Saddle 57.167 5.384 1010 42 Ladhar Bheinn 57.067 5.750 1020 43 Schiehallion 56.667 4.098 1083 2.54 Ben Nevis 56.780 5.002 1343 1.5
Hikers: HId HName Skill Age123 Edmund EXP 80214 Arnold BEG 25313 Bridget EXP 33212 James MED 27
Climbs: HId MId Date Time123 1 10/10/88 5123 3 11/08/87 2.5313 1 12/08/89 4214 2 08/07/92 7313 2 06/07/94 5
relation name
field names
tuples/records/
rows
fields (attributes, columns)
Floris Geerts (University of Antwerp) Advanced Database Systems 21 / 384
Introduction Relational model
Some terminology
A domain is a set of values. All domains in a relation must be atomic(indivisible).
Given a relation R(f1 : D1, . . . , fn : Dn), R is said to have arity(degree) n.
Given a relation instance, its cardinality is the number of rows.I For example, in Climbs, cardinality=5 and arity=4, domain of HId is
integer and that for Date is date.
Beware:Attributes within a table have di↵erent names; and
Tables have di↵erent names.
Floris Geerts (University of Antwerp) Advanced Database Systems 22 / 384
Introduction Relational model
Relations and sets
A relation R(f1 : D1, . . . , fn : Dn) can be defined more formally as
{hf1 : d1, . . . , fn : dni | d1 2 Dom1, . . . , dn 2 Domn}.
Thus a relation is a set of tuples:
I There is no ordering of the tuples in the table; and
I There are no duplicate rows in the table.
Floris Geerts (University of Antwerp) Advanced Database Systems 23 / 384
Introduction Relational model
SQL
SQL is the ubiquitous language for relational databases.
Standardised by ANSI/ISO in 1986, 89, 92 and 1999.
Most DBMS support SQL-92 and currently most features of SQL-99are covered as well.
Part of SQL is a Data Definition Language (DDL) that supports:I creation of tables;I deletion of tables; andI modification of tables.
Floris Geerts (University of Antwerp) Advanced Database Systems 24 / 384
Introduction Relational model
Creating tables
Consider
Munros(MId:int, MName:string, Lat:real, Long:real, Height:int,Rating:real)
Hikers(HId:int, HName:string, Skill:string, Age:int)
Climbs(HId:int, MId:int, Date:date, Time:int)
In its simplest use, SQL’s DDL provides a name and a type for eachcolumn of a table.
CREATE TABLE Hikers ( HId INTEGER,
HName CHAR(40),
Skill CHAR(3),
Age INTEGER )
Note that the domain of each field is specified and enforced by theDBMS.
Floris Geerts (University of Antwerp) Advanced Database Systems 25 / 384
Introduction Relational model
Removing and altering tables
We can delete both the schema information and all the tuples, e.g.
DROP TABLE Hikers;
We can alter existing schemas, e.g. adding an extra field
ALTER TABLE Hikers
ADD COLUMN gender CHAR(2);
(every tuple is extended by a so-called null value).
or change the domain of a field:
ALTER TABLE Hikers
ALTER COLUMN gender CHAR(1);
or remove a fieldALTER TABLE Hikers
DROP COLUMN gender;
Floris Geerts (University of Antwerp) Advanced Database Systems 26 / 384
Introduction Relational model
Adding and deleting tuples
Can insert tuples into a table, e.g
INSERT INTO Hikers (HId,HName,Skill,Age)
VALUES (314, ‘Sam’, ‘Exp’, 26);
Can remove tuples satisfying certain conditions, e.g.
DELETE
FROM Hikers H
WHERE H.Name=‘Arnold’
Can update tuples satisfying certain conditions, e.g.,
UPDATE Hikers H
SET H.Age=H.Age+1
WHERE H.Name=‘Arnold’;
More ways of changing things will be considered in the labs.
Floris Geerts (University of Antwerp) Advanced Database Systems 27 / 384
Introduction Relational model
Updating tuples: old value semantics
Consider the following update:
UPDATE Hikers H
SET H.Age=H.Age+1
WHERE H.Age <= 25;
and instance: Hikers: HId HName Skill Age
123 Edmund EXP 80
214 Arnold BEG 25
313 Bridget EXP 33
212 James MED 27
WHERE clause is evaluated first, update (SET) statement second.
Floris Geerts (University of Antwerp) Advanced Database Systems 28 / 384
Introduction Relational model
Updating tuples: old value semantics
Consider the following update:
UPDATE Hikers H
SET H.Age=H.Age+1
WHERE H.Age <= 25;
and instance: Hikers: HId HName Skill Age
123 Edmund EXP 80
214 Arnold BEG 25
313 Bridget EXP 33
212 James MED 27
WHERE clause is evaluated first, update (SET) statement second.
Floris Geerts (University of Antwerp) Advanced Database Systems 28 / 384
Introduction Relational model
Updating tuples: old value semantics
Consider the following update:
UPDATE Hikers H
SET H.Age=H.Age+1
WHERE H.Age <= 25;
and instance: Hikers: HId HName Skill Age
123 Edmund EXP 80
214 Arnold BEG 26
313 Bridget EXP 33
212 James MED 27
WHERE clause is evaluated first, update (SET) statement second.
Floris Geerts (University of Antwerp) Advanced Database Systems 28 / 384
Introduction Relational model
Integrity constraints (IC)
IC: condition that must be true for any instance of the database, e.g.,domain constraints.
I ICs are specified when schema is defined.I ICs are checked when relations are modified.
A legal instance of a relation is one that satisfies all specified ICs.I DBMS should not allow illegal instances.
If the DBMS checks ICs, stored data is more faithful to real-worldmeaning.
I Avoids data entry errors, too!
Floris Geerts (University of Antwerp) Advanced Database Systems 29 / 384
Introduction Relational model
Primary key constraints
A set of fields is a key for a relation if:
1 No two distinct tuples can have same values in all key fields, and2 This is not true for any subset of the key.
Part 2 false? A superkey.
If theres >1 key for a relation, one of the keys is chosen (by DBA) tobe the primary key.
E.g., HId is a key for Hikers. (What about HName?). The set{HId ,HName} is a superkey.
Floris Geerts (University of Antwerp) Advanced Database Systems 30 / 384
Introduction Relational model
Key constraints
CREATE TABLE Hikers ( HId INTEGER,HName CHAR(30),Skill CHAR(3),Age INTEGER,CONSTRAINT Blah PRIMARY KEY (HId) );
CREATE TABLE Climbs ( HId INTEGER,MId INTEGER,Date DATE,Time INTEGER,PRIMARY KEY (HId, MId, ) ;)
CONSTRAINT is optional and is only to provide name for constraint.
Updates that violate key constraints are rejected (and if constraintsare named, error message will include those names).
Do you think the key in the second example is the right choice? Becareful when assigning primary keys...
Floris Geerts (University of Antwerp) Advanced Database Systems 31 / 384
Introduction Relational model
Key constraints
CREATE TABLE Hikers ( HId INTEGER,HName CHAR(30),Skill CHAR(3),Age INTEGER,CONSTRAINT Blah PRIMARY KEY (HId) );
CREATE TABLE Climbs ( HId INTEGER,MId INTEGER,Date DATE,Time INTEGER,PRIMARY KEY (HId, MId,Date) ;)
CONSTRAINT is optional and is only to provide name for constraint.
Updates that violate key constraints are rejected (and if constraintsare named, error message will include those names).
Do you think the key in the second example is the right choice? Becareful when assigning primary keys...
Floris Geerts (University of Antwerp) Advanced Database Systems 31 / 384
Introduction Relational model
Key constraints
CREATE TABLE Hikers ( HId INTEGER,HName CHAR(30),Skill CHAR(3),Age INTEGER,UNIQUE (HName, Age)PRIMARY KEY (HId) );
Other keys can be specified using UNIQUE.
A tuple can only be referred to from elsewhere by storing its primarykey fields.
Index can be built on top of primary key fields to optimize access.
Floris Geerts (University of Antwerp) Advanced Database Systems 32 / 384
Introduction Relational model
Foreign keys
Foreign key: set of fields in one relation that is used to “refer’ to atuple in another relation.
I Must correspond to primary key of the second relation.I Like a “logical pointer”.
E.g., we expect any MId value in the Climbs table to be included inthe MId column of the Munros table. Similarly for HId.
CREATE TABLE Climbs ( HId INTEGER,MId INTEGER,Date DATE,Time INTEGER,PRIMARY KEY (HId, MId,Date),FOREIGN KEY (HId) REFERENCES Hikers(HId),FOREIGN KEY (MId) REFERENCES Munros(MId) )
Floris Geerts (University of Antwerp) Advanced Database Systems 33 / 384
Introduction Relational model
Foreign keys
Munros: MId MName Lat Long Height Rating1 The Saddle 57.167 5.384 1010 42 Ladhar Bheinn 57.067 5.750 1020 43 Schiehallion 56.667 4.098 1083 2.54 Ben Nevis 56.780 5.002 1343 1.5
Hikers: HId HName Skill Age123 Edmund EXP 80214 Arnold BEG 25313 Bridget EXP 33212 James MED 27
Climbs: HId MId Date Time123 1 10/10/88 5123 3 11/08/87 2.5313 1 12/08/89 4214 2 08/07/92 7313 2 06/07/94 5
Floris Geerts (University of Antwerp) Advanced Database Systems 34 / 384
Introduction Relational model
Foreign keys
A foreign key can refer to the same relation.
E.g., extend Hikers with partner field containing the partner’s HId.Declare this field as foreign key referring to Hikers.
Hikers: HId HName Skill Age Partner
123 Edmund EXP 80 214
214 Arnold BEG 25 123
313 Bridget EXP 33 null
212 James MED 27 null
nonexisting
partners
no null values
No null values in primary key fields (they are used to identify tuples).
Floris Geerts (University of Antwerp) Advanced Database Systems 35 / 384
Introduction Relational model
Enforcing integrity constraints
Consider Climbs and Munros; Climbs is a foreign key that referencesMunros.
What should be done if a Climbs tuple with a non-existent Munro idis inserted? (Reject it!)
What should be done if a Munro tuple is deleted?I Also delete all Climbs tuples that refer to it.I Disallow deletion of a Munro tuple that is referred to.I Set MId in Climbs tuples that refer to it to a default MId. (e.g., null in
case it is not a primary key field.)
Similar if primary key of Munro tuple is updated.
Floris Geerts (University of Antwerp) Advanced Database Systems 36 / 384
Introduction Relational model
Integrity in SQL-99
SQL/99 supports all 4 options on deletes and updates.I Default is NO ACTION (delete/update is rejected)I CASCADE(also delete all tuples that refer to deleted tuple)I SET NULL /SET DEFAULT (sets foreign key value of referencing
tuple)
Default value has to be specified when creating table.
CREATE TABLE Climbs ( HId INTEGER,MId INTEGER,Date DATE,Time INTEGER,PRIMARY KEY (HId, MId,Date),FOREIGN KEY (HId) REFERENCES Hikers(HId),
ON DELETE NO ACTION
ON UPDATE SET DEFAULT
FOREIGN KEY (MId) REFERENCES Munros(MId)ON DELETE CASCADE
ON UPDATE SET DEFAULT )
Floris Geerts (University of Antwerp) Advanced Database Systems 37 / 384
Introduction Relational model
Integrity in SQL-99
SQL/99 supports all 4 options on deletes and updates.I Default is NO ACTION (delete/update is rejected)I CASCADE(also delete all tuples that refer to deleted tuple)I SET NULL /SET DEFAULT (sets foreign key value of referencing
tuple)
Default value has to be specified when creating table.
CREATE TABLE Climbs ( HId INTEGER,MId INTEGER,Date DATE DEFAULT 7/10/2009,Time INTEGER,PRIMARY KEY (HId, MId,Date),FOREIGN KEY (HId) REFERENCES Hikers(HId),
ON DELETE NO ACTION
ON UPDATE SET DEFAULT
FOREIGN KEY (MId) REFERENCES Munros(MId)ON DELETE CASCADE
ON UPDATE SET DEFAULT )
Floris Geerts (University of Antwerp) Advanced Database Systems 37 / 384
Introduction Relational model
Where do ICs come from?
ICs are based upon the semantics of the real- world enterprise that isbeing described in the database relations.
We can check a database instance to see if an IC is violated, but wecan NEVER infer that an IC is true by looking at an instance.
I An IC is a statement about all possible instances!I From example, we know HName is not a key, but the assertion that
HId a key is given to us.
Key and foreign key ICs are the most common; more general ICssupported too.
Floris Geerts (University of Antwerp) Advanced Database Systems 38 / 384
Introduction Relational query languages
Outline
1 IntroductionOverviewRelational modelRelational query languages
2 Storage and indexing
3 Query evaluation
4 Query optimisation
5 Transactions, concurrency, and recovery
6 Parallel data management
Floris Geerts (University of Antwerp) Advanced Database Systems 39 / 384
Introduction Relational query languages
Relational query languages
Query languages allow the manipulation and retrieval of data from adatabase.
The relational model supports simple, powerful query languages:
I strong formal foundation; and
I allows for much (provably correct) optimisation.
NOTE: Query languages are not (necessarily) programming languages.
Floris Geerts (University of Antwerp) Advanced Database Systems 40 / 384
Introduction Relational query languages
Formal relational query languages
Relational Algebra
Simple “operational” model, useful for expressing execution plans.
Relational Calculus
Logical model (declarative), useful for theoretical results.
Both languages were introduced by Codd in a series of papers.
They have equivalent expressive power.
They are the key to understanding SQL query processing!
Floris Geerts (University of Antwerp) Advanced Database Systems 41 / 384
Introduction Relational query languages
Preliminaries
A query is applied to relation instances, and the result of a query isalso a relation instance.
input
instance
output
instance
query
For a given query, the schema of input relations are fixed.
The query will then execute over any valid instance.
The schema of the result can also be determined (and is fixed for thegiven query).
Floris Geerts (University of Antwerp) Advanced Database Systems 42 / 384
Introduction Relational query languages
Relational algebra
Basic operations:I Selection (�): Selects a subset of rows from relation.I Projection (⇡): Deletes unwanted columns from relation.I Cross-product (⇥): Allows us to combine two relations.I Set-di↵erence (�): Allows us to subtract relations.I Union ([): Allows us to union relations.I Renaming (⇢): Allows to rename relation and field names.
Additional operations:I Intersection, join, division,I Not essential, but (very!) useful (especially join).
ClosureSince each operation returns a relation, operations can be composed!(One says that the algebra is closed.)
Floris Geerts (University of Antwerp) Advanced Database Systems 43 / 384
Introduction Relational query languages
Projection
Choose a set of field names A and a table R
⇡A(R) extracts the columns in A from the table.
Example, given Munros =MId MName Lat Long Height Rating1 The Saddle 57.167 5.384 1010 42 Ladhar Bheinn 57.067 5.750 1020 43 Schiehallion 56.667 4.098 1083 2.54 Ben Nevis 56.780 5.002 1343 1.5
⇡MId,Rating(Munros) is
MId Rating1 42 43 2.54 1.5
Provides the user with a view by hiding some attributes.
Floris Geerts (University of Antwerp) Advanced Database Systems 44 / 384
Introduction Relational query languages
Projection – continued
Suppose the result of a projection has a repeated value, how do wetreat it?
⇡Rating(Munros) is Rating442.51.5
or Rating42.51.5
?
In “pure” relational algebra the answer is always a set (recall that wedefined a relation instance as a set).
However, SQL and some other languages return a multiset for someoperations from which duplicates may be eliminated by a furtheroperation. (Why? Eliminating duplicates is expensive in practice).
Floris Geerts (University of Antwerp) Advanced Database Systems 45 / 384
Introduction Relational query languages
Selection
Chooses tuples that satisfy some condition.
Selection �C (R) takes a table R and extracts those rows from it thatsatisfy the condition C .
For example,�Height > 1050(Munros) =
MId MName Lat Long Height Rating3 Schiehallion 56.667 4.098 1083 2.54 Ben Nevis 56.780 5.002 1343 1.5
What can go into a condition C?
Floris Geerts (University of Antwerp) Advanced Database Systems 46 / 384
Introduction Relational query languages
Selection - continued
Conditions are built up from:
I Comparisons on attributes: R .A = R .A0, R .A 6= R .A0
I Comparisons on values. E.g., Height > 1000, MName = "BenNevis".
I Predicates constructed from these using _ (or), ^ (and), ¬ (not).E.g. (Lat > 57 ^ Height > 1000) _ (Height=Lat) .
A selection provides the user with a view of data by hiding tuples that donot satisfy the condition the user wants.
Floris Geerts (University of Antwerp) Advanced Database Systems 47 / 384
Introduction Relational query languages
Combining selection and projection
Find all names and age of climbers of age > 30.
Relational algebra query
Q1 = ⇡HName,Age(�Age>30(Hikers))
An equivalent relational algebra query
Q2 = �Age>30(⇡HName,Age(Hikers))
The same declarative query can be translated into more than oneprocedural query.
Floris Geerts (University of Antwerp) Advanced Database Systems 48 / 384
Introduction Relational query languages
Combining selection and projection
Are Q1 and Q2 the same?
They are semantically, as they produce the same result.
But they di↵er in terms of e�ciency:
I Q1 scans Hikers, selects some tuples, and the only scans selectedtuples.
I Q2 scans Hikers, projects out two attributes and then scans the resultagain.
Q1 is likely to be more e�cient than Q2.
Procedural languages can be optimized....
Floris Geerts (University of Antwerp) Advanced Database Systems 49 / 384
Introduction Relational query languages
Set operations – union
If two tables have the same structure, we can perform set operations.
Same structure means union-compatible:I Same number of fields; andI Corresponding fields (taken from left to right) have the same domains.
Example:Hikers = HId HName Skill Age
123 Edmund EXP 80214 Arnold BEG 25313 Bridget EXP 33212 James MED 27
Climbers = HId HName Expertise Age214 Arnold BEG 25898 Jane MED 39
Hikers [ Climbers = HId HName Skill Age123 Edmund EXP 80214 Arnold BEG 25313 Bridget EXP 33212 James MED 27898 Jane MED 39
Output schema is that of the first relation (Hikers in the Example).
Floris Geerts (University of Antwerp) Advanced Database Systems 50 / 384
Introduction Relational query languages
Set operations – set di↵erence
We can also take the di↵erence of two union-compatible tables:
Hikers � Climbers = HId HName Skill Age123 Edmund EXP 80313 Bridget EXP 33212 James MED 27
Again, output schema is that of the first relation.
Floris Geerts (University of Antwerp) Advanced Database Systems 51 / 384
Introduction Relational query languages
Set operations – intersection
It turns out we can implement intersection in terms of otheroperations:
R \ S = R � (R � S)
Although it is mathematically nice to have fewer operators, this maynot be an e�cient way to implement intersection.
Intersection is also a special case of a join, which we’ll shortly discuss.
Floris Geerts (University of Antwerp) Advanced Database Systems 52 / 384
Introduction Relational query languages
Cross product (Cartesian product)
The basic operation is the Cartesian product, R ⇥ S , whichconcatenates every tuple in R with every tuple in S .
Example:
A Ba1 b1a2 b2
⇥
C Dc1 d1c2 d2c3 d3
=
A B C Da1 b1 c1 d1a1 b1 c2 d2a1 b1 c3 d3a2 b2 c1 d1a2 b2 c2 d2a2 b2 c3 d3
Floris Geerts (University of Antwerp) Advanced Database Systems 53 / 384
Introduction Relational query languages
Cartesian product – continued
What happens when we form a product of two tables with columnswith the same name?
Recall the schemas: Hikers(HId, HName, Skill, Age) andClimbs(HId, MId, Date,Time). What is the schema of Hikers ⇥Climbs?
Various possibilities including:I Forget the conflicting name (as in R&G) ((Hid), HName,Skill,
Age, (HId), MId, Date, Time). Allow positional references (bynumber) to columns.
I Label the conflicting colums with 1,2... (HId.1, HName,Skill, Age,HId.2, MId, Date, Time).
Neither of these is satisfactory. The product operation is no longercommutative (a property that is useful in optimization.)
Floris Geerts (University of Antwerp) Advanced Database Systems 54 / 384
Introduction Relational query languages
Cartesian product – continued
If R1 has n tuples and R2 has m tuples then R1 ⇥ R2 has n ⇥mtuples.
This is an expensive operation: if R1 and R2 have both 1 000 tuples(small relation) then R1 ⇥ R2 has 1 000 000 tuples (large relation).
Query processors try to avoid building products - instead theyattempt to build only subsets which contain relevant information.
Floris Geerts (University of Antwerp) Advanced Database Systems 55 / 384
Introduction Relational query languages
Renaming
To avoid confusion about attribute names, one can use the renamingoperator ⇢:
⇢(C (1 ! sid1, 5 ! sid2), Hikers⇥ Climbs)
This operatorI names result relation C ; andI explicitly names fields on positions 1 and 5 into sid1 and sid2.
In general,
⇢(R(oldname ! newname, . . . , position ! newname,E ),
Where E is a relational algebra expression.
Floris Geerts (University of Antwerp) Advanced Database Systems 56 / 384
Introduction Relational query languages
Natural join
For obvious reasons of e�ciency we rarely use unconstrained crossproducts in practice.
A natural join (./) produces the set of all merges of tuples that agreeon their commonly named fields.
Example:HId MId Date Time123 1 10/10/88 5123 3 11/08/87 2.5313 1 12/08/89 4214 2 08/07/92 7313 2 06/07/94 5
./
HId HName Skill Age123 Edmund EXP 80214 Arnold BEG 25313 Bridget EXP 33212 James MED 27
=
HId MId Date Time HName Skill Age123 1 10/10/88 5 Edmund EXP 80123 3 11/08/87 2.5 Edmund EXP 80313 1 12/08/89 4 Bridget EXP 33214 2 08/07/92 7 Arnold BEG 25313 2 06/07/94 5 Bridget EXP 33
Floris Geerts (University of Antwerp) Advanced Database Systems 57 / 384
Introduction Relational query languages
Natural Join – cont.
Natural join has interesting relationships with other operations. Whatis R ./ S when
I R = S
I R and S have no column names in common
I R and S have all column names in common, i.e., they are unioncompatible
Natural join has nice properties (assuming fields are identified bynames):
I Commutative: R ./ S = S ./ R
I Associative: R ./ (S ./ T ) = (R ./ S) ./ T
I Hence we can always simply write R1 ./ R2 ./ · · · ./ Rk .
Floris Geerts (University of Antwerp) Advanced Database Systems 58 / 384
Introduction Relational query languages
Conditional Join
Extension of natural join in which a join condition is specified:
R ./C S for �C (R ./C S)
Special case in which join condition consists of equality conditions iscalled the equijoin.
A natural join is an equijoin in which equalities are specified on allcommon fields.
Floris Geerts (University of Antwerp) Advanced Database Systems 59 / 384
Introduction Relational query languages
Interaction of the relational algebra operators
⇡A(R [ S) = ⇡A(R) [ ⇡A(S)
�C (R [ S) = �C (R) [ �C (S)
(R [ S) ./ T = (R ./ T ) [ (S ./ T )
T ./ (R [ S) = (T ./ R) [ (T ./ S).
Floris Geerts (University of Antwerp) Advanced Database Systems 60 / 384
Introduction Relational query languages
Division
Suppose we have two tables with schemas R(A,B) and S(B). R/S isdefined to be the set of A values in R which are paired (in R) with allB values in S .
That is the set of all x for which ⇡B(S) ✓ ⇡B(�A=x(R)).
A/B = ⇡AR � ⇡A(⇡A(R) ./ ⇡B(S)� R)
The general definition of division extends this idea to more than oneattribute.
Floris Geerts (University of Antwerp) Advanced Database Systems 61 / 384
Introduction Relational query languages
Examples
The names of people who have climbed The Saddle.
⇡HName(�MName="The Saddle"(Munros ./ Hikers ./ Climbs))
Note the optimization to:
⇡HName(�MName="The Saddle"(Munros) ./ Hikers ./ Climbs)
In what order would you perform the joins?
Floris Geerts (University of Antwerp) Advanced Database Systems 62 / 384
Introduction Relational query languages
Examples – contThe highest Munro(s)This is more tricky. We first find the peaks (their MIds) that are lowerthan some other peak. LowerIds =
⇡MId(�Height<Height’(Munros ./ ⇡Height’(
⇢(Height ! Height’, Munros))))
(we could have used ⇥ instead of ./ here)Now we find the MIds of peaks that are not in this set (they must bethe peaks with maximum height)
MaxIds = ⇡MId(Munros)� LowerIds
Finally we get the names:
⇡MName(MaxIds ./ Munros)Floris Geerts (University of Antwerp) Advanced Database Systems 63 / 384
Introduction Relational query languages
Examples – contThe names of hikers who have climbed all MunrosWe start by finding the set of HId,MId pairs for which the hiker hasnot climbed that peak.We do this by subtracting part of the Climbs table from the set of allHId,MId pairs. NotClimbed=
⇡HId(Hikers) ./ ⇡MId(Munros)� ⇡HId,MId(Climbs)
(we could have used ⇥ instead of ./ here)The HIds in this table identify the hikers who have not climed somepeak. By subtraction we get the HIds of hikers who have climbed allpeaks:
ClimbedAll = ⇡HId(Hikers)� ⇡HId(NotClimbed)
A join gets us the desired information:
⇡HName(Hikers ./ ClimbedAll)Floris Geerts (University of Antwerp) Advanced Database Systems 64 / 384
Introduction Relational query languages
What we cannot compute with relational algebra
Aggregate operations. E.g. “The number of hikers who have climbedSchiehallion” or “The average age of hikers”. These are possible inSQL which has numerous extensions to the relational algebra.
Recursive queries. Given a table Parent(Parent, Child) computethe Ancestor table. This appears to call for an arbitrary number ofjoins.
Non-relational data. For example, lists, arrays, multisets (bags); orrelations that are nested. These are ruled out by the relational datamodel, but they are important and are the province of object-orienteddatabases and “complex-object”/XML query languages.
Of course, we can always compute such things if we can talk to adatabase from a full-blown (Turing complete) programming language.
Floris Geerts (University of Antwerp) Advanced Database Systems 65 / 384
Introduction Relational query languages
Relational calculus
Declarative way of writing queries;
Ignorant of how things are computed;
Equivalent to relational algebra: Every query that can be expressed inthe relational algebra can be expressed in the calculus, and vice versa.
Floris Geerts (University of Antwerp) Advanced Database Systems 66 / 384