STAT 598W Lecture 26 Relational Database Fundamentals.

51
STAT 598W Lecture 26 Relational Database Fundamentals

Transcript of STAT 598W Lecture 26 Relational Database Fundamentals.

STAT 598W Lecture 26

Relational Database Fundamentals

Let’s Define a Few Terms

A database is a collection of persistent data, collected and stored for its usefulness in some endeavor.

A database management system (DBMS) is database and a set of procedures (usually computer programs) whose purpose is to extract utility from the database.

Example Databases My “to do” list, including

Referee three papers Prepare for algebraic statistics seminar Paint the kitchen

My fish collection Name Species Habitat

Stock market transactions 1950—2011 Stock symbol Date Closing price Shares traded Etc.

Example DBMSs

My “to do” list A 35 index card

My fish collection MS Access

Stock market transactions 1950—2011 An Oracle or MySQL system, along

with lots of other stuff…

Computer DBMSs “Flat-file” systems, like dBASE (forever) “Hierarchical” systems like IDMS (late 1960s) “Network” systems (usually called CODASYL, as in

COBOL) like IMS (1971) “Relational” systems like DB2, mySQL, and Oracle (and

yes, MS Access). Original papers published by Codd, 1974, but it took 20 years before computers were fast enough to implement.

Object-oriented DBMS, like ObjectStore Data mining software, open source and from e.g., SAS

and Oracle Hadoop

Some Elementary Definitions Data item—the smallest named

unit of data with meaning: last name, first name, ID number, etc. (a.k.a. attribute)

Record—a group of related data items: salesperson, customer, product, etc. (a.k.a. row)

File—a collection of records of a single type (a.k.a. table)

The Database Lifecycle

Requirements analysis Logical design

Entity-relationship (E-R) modeling View integration Transform to SQL tables Normalization

Physical design

Components of the ER Model Entity: person, place, thing, event, idea,

etc. The principal data objects of interest. A particular occurrence is an entity instance

Relationship: an association between one or more entities. Relationships are described in terms of Degree Connectivity Existence

Components of the ER Model

Attribute: a characteristic of entity Employee has emp_id, name, age,

position, etc. Department has dept_id, name, main

office address, manager’s name, etc. An attribute may be an identifier or a

descriptor.

Relationships Degree: the number of entities associated

in the relationship: binary, ternary, n -ary. Connectivity: a constraint on the mapping

of one entity to another; one or many are the choices

Existence: a condition on the association describing whether or not an instance of one entity must be associated with an instance of another entity; mandatory and optional are the choices

A Bigger E-R Diagram

ER Diagrams Encode “Business Rules” (Not “truth”)

department employeeis-managed-by

0,1 1

“A department is managed by exactly one employee, andan employee manages zero or one departments.”

dept_num dept_name mgr_id

10 Tresuries 330

11 Corp. bonds 15

12 Morgage-backed 221

emp_id emp_name school salary

15 Dave CMU Big

110 Dan NYU Ok

560 Xue Haas Bigger

330 Rohit CMU Ok

221 Xin CMU Really big

451 Eduardo Sloan Ok

Another Business Rule

department divisionis-subunit-of

11..N

“A department is a subunit of exactly one division, anda division consists of many departments.”

dept_num dept_name mgr_id div_num

10 Tresuries 330 1

11 Corp. bonds 15 1

12 Morgage-backed

221 1

32 Swaptions 533 2

div_num div_name mgr_id

1 Fixed income 224

2 Options 114

A Different Way to Organize Managers

employee is-manager-of1

“An employee is managed by exactly one employee, andan employee is the manager of zero or more employees.”

0..N

This is implies a recursive table: mgr_id refers to emp_id

emp_id emp_name

school salary mgr_id

15 Dave CMU Big 554

110 Dan NYU Ok 15

560 Xue Haas Bigger 321

330 Rohit CMU Ok 560

221 Xin CMU Really big 554

451 Eduardo Sloan Ok 221

More ER Diagrams

project employeeworks-on

1…4 N

“An employee works on at least one but at most four projects, anda project is worked on by many employees”

employee project

skill

skill-usedM N

P

“Employees can use many skills on any one of many projects,and each project has many employees with various skills”

More ER Diagrams

technician project

notebook

uses-notebook

1 1

1

“A technician uses exactly one notebook for each project. Eachnotebook belongs to one technician for each project.” Note thata technician may still work on many projects and maintain differentnotebooks for each project.

More ER Diagrams

project employee

location

assigned-to1 N

1

“Each employee assigned to a project works at only onelocation for that project but can be at different locations for different projects. At a particular location, an employeeworks on only one project and many employees can beassigned to a given project.”

Adding Attributes

For small ER diagrams, or for fragments of larger ones, it may be useful to include attribute names.

department employeeis-managed-by

1 1

dept_id

dept_name

dept_address

emp_id

emp_name

job_class

Fancier Ideas Composite attributes have several “sub-

attributes”. Ex: address has street_address, city, state, zip

Multivalued attributes may have a set of values. Ex. college_degree may have several values

Derived attributes can be computed from others. Ex: age can be computed from date_of_birth

Null values indicate that either the attribute value is not applicable (apartment_number), or unknown (either missing but we know it exists, like a person’s height, or not known, like a cell phone number).

Key Attributes The rows in a table need to be unique. Uniqueness is determined over some

set of attributes. Often we introduce an “artificial”

attribute to enforce uniqueness (e.g., dept_id).

A key is one or more attributes which, together, are guaranteed to be unique for rows in the table.

Weak Entities Entities that don’t have key attributes of

their own are called weak entities. “Normal” entities are called strong. Ex: Dependent (i.e., son or daughter or

spouse) is related to Employee, but may have attributes that are not unique amongst all dependents.

Uniqueness is achieved through the attachment to the employee (who has a unique key).

Weak Entities The ER diagram notation is this:

dependent employeeis-dependent-of

N 1

name

birth_date

relationship

emp_id

emp_name

job_class

A Larger Example A flight has a flight number, and is associated

with a particular airline. A flight can have one or more legs, which are

trips between two airports. A leg instance is a leg of a flight, flown on a

particular day. Each leg instance has some seats available. A particular airplane is assigned to each leg

instance. Each airplane is of a particular airplane type. A particular airplane type can land at some,

but not all, airports.

airport flight_leg

departure-airport

arrival-airport

1

1 N

N

flight

leg_no

N

1

number

airline

airplane_type

can-land

N

M

airplane

airplane_id num_seats

type_name max_seats

1

N

airport_code

type

leg_instance

date

assigned N1

instance-of

N

1

seatseat_no

reservation 1

N

More Details

What if a relationship needs attributes?

leg_instanceseat

seat_no date

reservation

cust_name cust_phone

N 1

Basic Transformation Rules Transform each entity into a table

containing its key and non-key attributes. Transform every many-to-many binary or

binary recursive relationship into a relationship table with the keys of the entities and the attributes of the relationship.

Transform every ternary or n -ary relationship into a relationship table.

The Basic Relational Database Move (Tattoo On Your Arm!)

student enrolled_in courseN M

student_id student_name course_id course_name

student course1 N

student_id student_name course_id course_name

enroll

student_id course_id

N 1

student(student_id, student_name,…)course(course_id, course_name,…)enroll(student_id*, course_id*,…)

This Works for Recursive Relationships Too

professor is-coauthor-withN

professor

coauthor

M

2

N

professor( prof_id, prof_name)coauthor( author_id*, coauthor_id*)

Transformation to Actual Tables Here come a series of examples, showing

An ER diagram fragment, and SQL to create an actual set of tables, and Some alternate notation we’ll use later

We don’t use the create table statements in MS Access (we would use the ACCESS GUI instead), but we will use them when declaring MySQL tables.

1-1, Both Entities Mandatory

report

has-abbr

abbreviation

Every report has exactly one abbreviation, andevery abbreviation represents exactly one report

1

1

create table report (report_no integer, report_name char(256), primary key(report_no));create table abbreviation (abbr char(6), report_no integer not null unique, primary key(abbr), foreign key(report_no) references report on delete cascade on update cascade);

report(report_no, report_name)abbreviation(abbr, report_no*)

If the referencedreport is deleted,delete the abbr

If the referencedreport has its primarykey updated, update thisforeign key

1-1, One Mandatory, One Not

department

managed-by

employee

Every department must have a manager, but anemployee can be a manager of at most one dept.

1

1

create table department (dept_no integer, dept_name char(20), mgr_id char(10) not null unique, primary key(dept_no); foreign key(mgr_id) references employee on delete set default on update cascade);create table employee (emp_id char(10), emp_name char(20), primary key(emp_id));

department(dept_no, dept_name, mgr_id*)employee(emp_id, emp_name)

default is nullby default

1-1, Both Entities Optional

employee

has-allocated

desktop

Some desktop computers are allocated to employees,but some aren’t; and some employees might not get one.

1

1

create table employee (emp_id char(10), desktop_no integer, primary key(emp_id));create table desktop (desktop_no integer, emp_id char(10), primary key(desktop_no), foreign key(emp_id) references employee on delete set null on update cascade);

employee(emp_id, desktop_no)desktop(desktop_no, emp_id*)

Here?

Or here?

1-N, Both Entities Mandatory

department

has

employee

Every employee works in exactly one department,and every department has at least one employee

1

N

create table department (dept_no integer, dept_name char(20), primary key(dept_no));create table employee (emp_id char(10), emp_name char(20), dept_no integer not null, primary key(emp_id), foreign key(dept_no) references department on delete set default on update cascade);department(dept_no, dept_name)

employee(emp_name, dept_no*)

1-N, One Mandatory, One Optional

department

publishes

report

Each department publishes one or more reports. Agiven report may not necessarily be published bya department.

1

N

create table department (dept_no integer, dept_name char(20), primary key(dept_no));create table report (report_no integer, dept_no integer, primary key(report_no), foreign key(dept_no) references department on delete set null on update cascade);

department(dept_no, dept_name)report(report_no, dept_no*)

N-N, Both Entities Optional

engineer

belongs-to

prof-assoc

Every professional association has none, one or manyengineer members. Likewise for engineers and associations.

N

M

create table engineer (emp_id char(10), primary key(emp_id));create table prof_assoc (assoc_name varchar(256), primary key(assoc_name));create table belongs_to (emp_id char(10), assoc_name varchar(256), primary key(emp_id, assoc_name), foreign key(emp_id) references engineer on delete cascade on update cascade foreign key(assoc_name) references prof_assoc on delete cascade on update cascade);

engineer(emp_id)prof_assoc(assoc_name)belongs_to(emp_id*, assoc_name*)

Recursive Relationships

professor is-coauthor-withN

professor

coauthor

M

2

N

create table professor (prof_id char(10), prof_name char(20), primary key(prof_id));create table coauthor (author_id (char(10), coauthor_id char(10), primary key(author_id, coauthor_id), foreign key(author_id) references professor on delete cascade on update cascade foreign key(coauthor_id) references professor on delete cascade on update cascade);

Weak Entities

dependent employeeis-dependent-of

N 1

create table employee (emp_id char(10), emp_name char(20), primary key(emp_id));create table dependent (dep_id char(10), dep_name(char(20), emp_id char(10), primary key(dep_id, emp_id), foreign key(emp_id) references employee on delete cascade on update cascade); employee(emp_id, emp_name)

dependent(dep_id, emp_id*, dep_name)

Normalization

The issues here are database Integrity Performance Maintainability

“Normalization” is a theory that pushes integrity, but it recognizes the other virtues.

Normalization & Redundancy

product_name order_no

cust_name

cust_addr credit

date sales_name

vacuum cleaner

1458 Bachmann

Austin 6 5-5-02 Bloch

computer 2730 Huang Mt. View 10 5-6-02 Hanss

refrigerator 2460 Stolarchuck

Ann Arbor 8 7-3-02 Phillips

television 519 Honeyman

Detroit 3 9-5-02 Remley

radio 1986 Antonelli Chicago 7 9-18-02 Metz

CD player 1817 Raviman Bombay 8 1-3-03 Basile

vacuum cleaner

1865 Antonelli Chicago 7 4-18-03 Bloch

vacuum cleaner

1885 Blower Detroit 8 5-13-03 Bloch

refrigerator 1943 Bachmann

Austin 6 6-19-03 Phillips

television 2315 Bachmann

Austin 6 7-15-03 Remley

A “flat file” database

Some Problems If we ask “Which customers ordered vacuum

cleaners last month?” we have to search the entire table.

If Bachmann’s address changes, there are lots of updates to be done.

If we delete the order of Huang (who buys expensive things), we lose her address and credit rating.

Can we do better? Yes! Let’s break up the table into non-redundant

chunks, without losing information. This is “normalization”.

First Normal Form: No Repeating Groups

paper_num

paper_title

auth_id auth_name editor dept_no dept_name

03-01-02 Buy Low 001 Black Woolf 15 Design

03-01-02 Buy Low 002 Scholes Woolf 15 Design

04-21-02 Sell High 002 Scholes Koenig 27 Analysis

04-21-02 Sell High 001 Black koenig 27 Analysis

paper_num

paper_title

auth1_id

auth1_name

auth2_id

auth2_name

03-01-02 Buy Low 001 Black 002 Scholes

04-21-02 Sell High 002 Scholes 001 Black

editor dept_no dept_name

Woolf 15 Design

Keonig 27 Analysis

Non-First Normal Form

Equivalent First Normal Form

This is a different,simpler example.

Primary Keys A superkey is a set of one or

more attributes that, collectively, uniquely identifies each row.

A candidate key is a minimal (w.r.t. inclusion) superkey.

A primary key is selected arbitrarily from the collection of candidate keys; it is used as an index for the table.

Primary Keys

In this example, the pair {paper_num, auth_id} is a candidate key.

Is it the only one, i.e., must it be the primary key? (What assumptions?)

paper_num

paper_title

auth_id auth_name

editor dept_no dept_name

03-01-02 Buy Low 001 Black Woolf 15 Design

03-01-02 Buy Low 002 Scholes Woolf 15 Design

04-21-02 Sell High 002 Scholes Koenig 27 Analysis

04-21-02 Sell High 001 Black Koenig 27 Analysis

Second Normal Form If attribute A uniquely determines the

value of attribute B, then B is functionally dependent on A; we write AB

A table is in second normal form (2NF) if it is in first normal form and every non-key attribute is fully dependent on the primary key (transitive dependencies allowed).

We “decompose” a table into several new tables in order to achieve 2NF.

Conversion to 2NF

paper_num

paper_title

editor dept_no dept_name

03-01-02 Buy Low woolf 15 Design

04-21-02 Sell High koenig 27 Analysisauth_id auth_nam

e

001 Black

002 Scholes

paper_num

auth_id

03-01-02 001

03-01-02 002

04-21-02 001

04-21-02 002

paper_num

paper_title

auth_id auth_name

editor dept_no dept_name

03-01-02 Buy Low 001 Black woolf 15 Design

03-01-02 Buy Low 002 Scholes woolf 15 Design

04-21-02 Sell High 002 Scholes koenig 27 Analysis

04-21-02 Sell High 001 Black koenig 27 Analysis

Third Normal Form

A table is in third normal form (3NF) if it is in second normal form and there are no transitive dependencies.

In our example, we have

So decompose further

report_no editor, dept_nodept_no dept_name

Example in 3NF

paper_num paper_title

editor dept_no

03-01-02 Buy Low woolf 15

04-21-02 Sell High koenig

27

auth_id auth_name

001 Black

002 Scholes

paper_num

auth_id

03-01-02 001

03-01-02 002

04-21-02 001

04-21-02 002dept_no dept_nam

e

15 Design

27 Analysis There is absolutely no loss ofinformation here. It’s just abit more work to answer somekinds of questions.

Why 3NF? It avoids update anomalies

No duplication of report number, editor and department number, so we can’t forget to update every instance

It avoids insert anomalies In the 1NF table, a new department can’t be

added unless it has originated a paper It avoids delete anomalies

In the 1NF table, if we delete a row, information about, e.g., department number and department name may be lost.

“Steve, this is fun. We want more normal forms!”

Boyce-Codd normal form Fourth normal form Fifth normal form Even higher! Nobody ever really uses them, it

seems Examples violating them are

contrived.

Generalization

Just like the O-O idea of inheritance.

A manager is an employee, as are engineers, technicians and secretaries.

employee

manager engineer technician secretary

d

“d” means “distinct.There is also “o” foroverlapping

Generalization

individual

employee customer

o

create table individual (indiv_id char(10), indiv_name char(20), indiv_addr char(20), primary key(indiv_id));create table employee (emp_id char(10), job_title char(15), primary key (emp_id), foreign key (emp_id) references individual on delete cascade on update cascade);create table customer (cust_no char(10), cust_credit char(12), primary key (cust_no), foreign key (cust_no) references individual on delete cascade on update cascade);

“overlapping”