CSE544 Introduction Monday, March 27, 2006. Staff Instructor: Dan Suciu –CSE 662,...

39
CSE544 Introduction Monday, March 27, 2006

Transcript of CSE544 Introduction Monday, March 27, 2006. Staff Instructor: Dan Suciu –CSE 662,...

Page 1: CSE544 Introduction Monday, March 27, 2006. Staff Instructor: Dan Suciu –CSE 662, suciu@cs.washington.edu –Office hours: Wednesdays, 12pm-1pm TA: Bhushan.

CSE544Introduction

Monday, March 27, 2006

Page 2: CSE544 Introduction Monday, March 27, 2006. Staff Instructor: Dan Suciu –CSE 662, suciu@cs.washington.edu –Office hours: Wednesdays, 12pm-1pm TA: Bhushan.

Staff• Instructor: Dan Suciu

– CSE 662, [email protected]

– Office hours: Wednesdays, 12pm-1pm

• TA: Bhushan Mandhani– Office hours: TBA

• Mailing list: [email protected]– http://mailman.cs.washington.edu/mailman/private/cse544

• Web page: (a lot of stuff already there) http://www.cs.washington.edu/544

Page 3: CSE544 Introduction Monday, March 27, 2006. Staff Instructor: Dan Suciu –CSE 662, suciu@cs.washington.edu –Office hours: Wednesdays, 12pm-1pm TA: Bhushan.

Course Times

• Mon, Wed, 10:30-12

• Final: – 8:30-10:20 a.m. Monday, Jun. 5, 2006– In this room

Page 4: CSE544 Introduction Monday, March 27, 2006. Staff Instructor: Dan Suciu –CSE 662, suciu@cs.washington.edu –Office hours: Wednesdays, 12pm-1pm TA: Bhushan.

Goals of the Course

• Using database systems

• Foundations of data management.

• Issues in building database systems.

• Current research topics in databases.

Page 5: CSE544 Introduction Monday, March 27, 2006. Staff Instructor: Dan Suciu –CSE 662, suciu@cs.washington.edu –Office hours: Wednesdays, 12pm-1pm TA: Bhushan.

Format

Basic structure:

• Lectures on Wednesdays

• Paper discussions on Mondays (reviews !)

Page 6: CSE544 Introduction Monday, March 27, 2006. Staff Instructor: Dan Suciu –CSE 662, suciu@cs.washington.edu –Office hours: Wednesdays, 12pm-1pm TA: Bhushan.

Content

• Data modeling basics (3weeks):– SQL/XQuery with homeworks on Postgres/Galax– Logical foundations of databases

• Transactions (2weeks):– concurrency control (locks, timestamps)– recovery (undo, redo, undo/redo)

• Topics in query execution/optimization (3weeks)• Database security (1 week)• Databases + IR, Probabilistic databases (1 week)

Page 7: CSE544 Introduction Monday, March 27, 2006. Staff Instructor: Dan Suciu –CSE 662, suciu@cs.washington.edu –Office hours: Wednesdays, 12pm-1pm TA: Bhushan.

Textbooks

Won’t follow any book, but you may want to consult them if you need more details to understand a topic

• Database Management Systems, Ramakrishnan• The Complete Book, GarciaMolina, Ullman, Widom• Xquery from the Experts, Howard Katz, Ed.• Data on the Web, Abiteboul, Buneman, Suciu• Theory of database systems, Abiteboul, Hull, Vianu

Page 8: CSE544 Introduction Monday, March 27, 2006. Staff Instructor: Dan Suciu –CSE 662, suciu@cs.washington.edu –Office hours: Wednesdays, 12pm-1pm TA: Bhushan.

Grading

• Homework: 20%

• Paper reviews: 20%

• Participation in the discussions: 10%

• Project: 30%

• Final: 20%

Page 9: CSE544 Introduction Monday, March 27, 2006. Staff Instructor: Dan Suciu –CSE 662, suciu@cs.washington.edu –Office hours: Wednesdays, 12pm-1pm TA: Bhushan.

Homework: 20%

HW1:

minor programming in SQL and Xquery

HW2:

problem sets, no programming

theory, optimizations, query execution, transactions

Page 10: CSE544 Introduction Monday, March 27, 2006. Staff Instructor: Dan Suciu –CSE 662, suciu@cs.washington.edu –Office hours: Wednesdays, 12pm-1pm TA: Bhushan.

Project: 30%

• Choose from a list of mini-research topics, or come up with your own

• Open ended

• Write short research paper (2-3 pages)

• Conference-style presentation

Page 11: CSE544 Introduction Monday, March 27, 2006. Staff Instructor: Dan Suciu –CSE 662, suciu@cs.washington.edu –Office hours: Wednesdays, 12pm-1pm TA: Bhushan.

Project: 30%

• Goals: apply database principles to a new problem– Understand and model the problem

– Research and understand related work (1-2 papers)

– Propose some new approach (creativity will be evaluated)

– Implement some part

• NOT intended to be a major software development

• Amount of work may vary widely between groups

Page 12: CSE544 Introduction Monday, March 27, 2006. Staff Instructor: Dan Suciu –CSE 662, suciu@cs.washington.edu –Office hours: Wednesdays, 12pm-1pm TA: Bhushan.

Project: 30%

Milestones:• Groups of 1-3 assembled by 4/5

• Proposals due by 4/10

• Short research papers (2-3pages) due by 5/30

• Presentations on 5/31 in class (MAY TAKE LONGER THAN 12pm)

Page 13: CSE544 Introduction Monday, March 27, 2006. Staff Instructor: Dan Suciu –CSE 662, suciu@cs.washington.edu –Office hours: Wednesdays, 12pm-1pm TA: Bhushan.

Paper Reviews: 20%

• There will be reading assignments

• Papers are discussed Mondays

• You have to write the reviews by Sunday night

Page 14: CSE544 Introduction Monday, March 27, 2006. Staff Instructor: Dan Suciu –CSE 662, suciu@cs.washington.edu –Office hours: Wednesdays, 12pm-1pm TA: Bhushan.

Final: 20%

• June 5, 8:30-10:30, same room

• Challenging and fun

Page 15: CSE544 Introduction Monday, March 27, 2006. Staff Instructor: Dan Suciu –CSE 662, suciu@cs.washington.edu –Office hours: Wednesdays, 12pm-1pm TA: Bhushan.

Database

What is a database ?

Give examples of databases

Page 16: CSE544 Introduction Monday, March 27, 2006. Staff Instructor: Dan Suciu –CSE 662, suciu@cs.washington.edu –Office hours: Wednesdays, 12pm-1pm TA: Bhushan.

Database

What is a database ?

• A collection of files storing related data

Give examples of databases

• Accounts database; payroll database; UW’s students database; Amazon’s products database; airline reservation database

Page 17: CSE544 Introduction Monday, March 27, 2006. Staff Instructor: Dan Suciu –CSE 662, suciu@cs.washington.edu –Office hours: Wednesdays, 12pm-1pm TA: Bhushan.

Database Management System

What is a DBMS ?

Give examples of DBMS

Page 18: CSE544 Introduction Monday, March 27, 2006. Staff Instructor: Dan Suciu –CSE 662, suciu@cs.washington.edu –Office hours: Wednesdays, 12pm-1pm TA: Bhushan.

Database Management System

What is a DBMS ?

• A big C program written by someone else that allows us to manage efficiently a large database and allows it to persist over long periods of time

Give examples of DBMS

• DB2 (IBM), SQL Server (MS), Oracle, Sybase

• MySQL, Postgres, …

Page 19: CSE544 Introduction Monday, March 27, 2006. Staff Instructor: Dan Suciu –CSE 662, suciu@cs.washington.edu –Office hours: Wednesdays, 12pm-1pm TA: Bhushan.

Market Shares

From 2004 www.computerworld.com

• IMB: 35% market with $2.5BN in sales

• Oracle: 33% market with $2.3BN in sales

• Microsoft: 19% market with $1.3BN in sales

Page 20: CSE544 Introduction Monday, March 27, 2006. Staff Instructor: Dan Suciu –CSE 662, suciu@cs.washington.edu –Office hours: Wednesdays, 12pm-1pm TA: Bhushan.

An Example

The Internet Movie Databasehttp://www.imdb.com

• Entities: Actors (800k), Movies (400k), Directors, …

• Relationships:who played where, who directed what, …

Want to store and process locally; what functions do we need ?

Page 21: CSE544 Introduction Monday, March 27, 2006. Staff Instructor: Dan Suciu –CSE 662, suciu@cs.washington.edu –Office hours: Wednesdays, 12pm-1pm TA: Bhushan.

Functionality

1. Create/store large datasets

2. Search/query/update

3. Change the structure

4. Concurrent access to many user

5. Recover from crashes

6. Security (not here, but in other apps)

Page 22: CSE544 Introduction Monday, March 27, 2006. Staff Instructor: Dan Suciu –CSE 662, suciu@cs.washington.edu –Office hours: Wednesdays, 12pm-1pm TA: Bhushan.

Possible Organizations

• Files

• Spreadsheets

• DBMS

Page 23: CSE544 Introduction Monday, March 27, 2006. Staff Instructor: Dan Suciu –CSE 662, suciu@cs.washington.edu –Office hours: Wednesdays, 12pm-1pm TA: Bhushan.

1. Create/store Large Datasets

• Files

• Spreadsheets

• DBMS

Yes, but…

Not really…

Yes

Page 24: CSE544 Introduction Monday, March 27, 2006. Staff Instructor: Dan Suciu –CSE 662, suciu@cs.washington.edu –Office hours: Wednesdays, 12pm-1pm TA: Bhushan.

2. Search/Query/Update

• Simple query:– In what year was ‘Rain man’ produced ?

• Multi-table query:– Find all movies by ‘Coppola’

• Complex query:– For each actor, count her/his movies

• Updating– Insert a new movie; add an actor to a movie; etc

Page 25: CSE544 Introduction Monday, March 27, 2006. Staff Instructor: Dan Suciu –CSE 662, suciu@cs.washington.edu –Office hours: Wednesdays, 12pm-1pm TA: Bhushan.

2. Search/Query/Update

• Files

• Spreadsheets

• DBMS

Simple queries

Multi-table queries(maybe)

All

Updates: generally OK

Page 26: CSE544 Introduction Monday, March 27, 2006. Staff Instructor: Dan Suciu –CSE 662, suciu@cs.washington.edu –Office hours: Wednesdays, 12pm-1pm TA: Bhushan.

3. Change the Structure

Add Address to each Actor

• Files

• Spreadsheets

• DBMS

Very hard

Yes

Yes

Page 27: CSE544 Introduction Monday, March 27, 2006. Staff Instructor: Dan Suciu –CSE 662, suciu@cs.washington.edu –Office hours: Wednesdays, 12pm-1pm TA: Bhushan.

4. Concurrent Access

Multiple users access/update the data concurrently

• What can go wrong ?

• How do we protect against that in OS ?

• This is insufficient in databases; why ?

Page 28: CSE544 Introduction Monday, March 27, 2006. Staff Instructor: Dan Suciu –CSE 662, suciu@cs.washington.edu –Office hours: Wednesdays, 12pm-1pm TA: Bhushan.

4. Concurrent Access

Multiple users access/update the data concurrently

• What can go wrong ?– Lost update; resulting in inconsistent data

• How do we protect against that in OS ?– Locks

Page 29: CSE544 Introduction Monday, March 27, 2006. Staff Instructor: Dan Suciu –CSE 662, suciu@cs.washington.edu –Office hours: Wednesdays, 12pm-1pm TA: Bhushan.

4. Concurrent Access

X = Read(Accounts, A);X.amount = X.amount - 100;Write(Accounts, A, X);

Y = Read(Accounts, B);Y.amount = Y.amount + 100;Write(Accounts, B, Y);

X = Read(Accounts, A);X.amount = X.amount - 100;Write(Accounts, A, X);

Y = Read(Accounts, B);Y.amount = Y.amount + 100;Write(Accounts, B, Y);

Transfer $100 fromaccount A to B:

Find total amountin A and B:

X = Read(Accounts, A);Y = Read(Accounts, B);S = X.amount + Y.amountreturn S

X = Read(Accounts, A);Y = Read(Accounts, B);S = X.amount + Y.amountreturn S

What can go wrong ? Do locks help ?

Page 30: CSE544 Introduction Monday, March 27, 2006. Staff Instructor: Dan Suciu –CSE 662, suciu@cs.washington.edu –Office hours: Wednesdays, 12pm-1pm TA: Bhushan.

5. Recover from crashes

X = Read(Accounts, A);X.amount = X.amount - 100;Write(Accounts, A, X);

Y = Read(Accounts, B);Y.amount = Y.amount + 100;Write(Accounts, B, Y);

X = Read(Accounts, A);X.amount = X.amount - 100;Write(Accounts, A, X);

Y = Read(Accounts, B);Y.amount = Y.amount + 100;Write(Accounts, B, Y);

CRASH !

What is the problem ?

Page 31: CSE544 Introduction Monday, March 27, 2006. Staff Instructor: Dan Suciu –CSE 662, suciu@cs.washington.edu –Office hours: Wednesdays, 12pm-1pm TA: Bhushan.

Enters a DMBS

Data files

Database server(someone else’s

C program) Applications

connection

(ODBC, JDBC)

“Two tier system” or “client-server”

Page 32: CSE544 Introduction Monday, March 27, 2006. Staff Instructor: Dan Suciu –CSE 662, suciu@cs.washington.edu –Office hours: Wednesdays, 12pm-1pm TA: Bhushan.

DBMS = Collection of Tables

Still implemented as files,but behind the scenes can be quite complex

Directors: Movie_Directors:

Movies:

“data independence”

id fName lName

15901 Francis Ford Coppola

. . .

mid Title Year

130128 The Godfather 1972

. . .

id mid

15901 130128

. . .

Page 33: CSE544 Introduction Monday, March 27, 2006. Staff Instructor: Dan Suciu –CSE 662, suciu@cs.washington.edu –Office hours: Wednesdays, 12pm-1pm TA: Bhushan.

1. Create/store Large Datasets

Use SQL to create and populate tables:

CREATE TABLE Actors ( Name CHAR(30) DateOfBirth CHAR(20)) . . .

CREATE TABLE Actors ( Name CHAR(30) DateOfBirth CHAR(20)) . . .

INSERT INTO ActorsVALUES(‘Tom Hanks’, . . .)INSERT INTO ActorsVALUES(‘Tom Hanks’, . . .)

Size and physical organization is handled by DBMS

We focus on modeling the database

Will study data modeling in this course

Page 34: CSE544 Introduction Monday, March 27, 2006. Staff Instructor: Dan Suciu –CSE 662, suciu@cs.washington.edu –Office hours: Wednesdays, 12pm-1pm TA: Bhushan.

2. Searching/Querying/Updating

• Find all movies by ‘Coppola’

• What happens behind the scene ?

SELECT titleFROM Movies, Directors, Movie_DirectorsWHERE Directors.lname = ‘Coppola’ and Movies.mid = Movie_Directors.mid and Movie_Directors.id = Directors.id

SELECT titleFROM Movies, Directors, Movie_DirectorsWHERE Directors.lname = ‘Coppola’ and Movies.mid = Movie_Directors.mid and Movie_Directors.id = Directors.id

We will study SQL in gory details in this course

We will discuss the query optimizer in class.

Page 35: CSE544 Introduction Monday, March 27, 2006. Staff Instructor: Dan Suciu –CSE 662, suciu@cs.washington.edu –Office hours: Wednesdays, 12pm-1pm TA: Bhushan.

3. Changing the Structure

Add Address to each Actor

ALTER TABLE Actor ADD address CHAR(50) DEFAULT ‘unknown’

ALTER TABLE Actor ADD address CHAR(50) DEFAULT ‘unknown’

Lots of cleverness goes on behind the scenes

Page 36: CSE544 Introduction Monday, March 27, 2006. Staff Instructor: Dan Suciu –CSE 662, suciu@cs.washington.edu –Office hours: Wednesdays, 12pm-1pm TA: Bhushan.

3&4 Concurrency&Recovery:Transactions

• A transaction = sequence of statements that either all succeed, or all fail

• E.g. Transfer $100 BEGIN TRANSACTION;

UPDATE AccountsSET amount = amount - 100WHERE number = 4662

UPDATE AccountsSET amount = amount + 100WHERE number = 7199

COMMIT

BEGIN TRANSACTION;

UPDATE AccountsSET amount = amount - 100WHERE number = 4662

UPDATE AccountsSET amount = amount + 100WHERE number = 7199

COMMIT

Page 37: CSE544 Introduction Monday, March 27, 2006. Staff Instructor: Dan Suciu –CSE 662, suciu@cs.washington.edu –Office hours: Wednesdays, 12pm-1pm TA: Bhushan.

Transactions

• Transactions have the ACID properties:A = atomicity

C = consistency

I = isolation

D = durability

Page 38: CSE544 Introduction Monday, March 27, 2006. Staff Instructor: Dan Suciu –CSE 662, suciu@cs.washington.edu –Office hours: Wednesdays, 12pm-1pm TA: Bhushan.

4. Concurrent Access

• Serializable execution of transactions– The I (=isolation) in ACID

We study three techniques in this course

Locks

Timestamps

Validation

Page 39: CSE544 Introduction Monday, March 27, 2006. Staff Instructor: Dan Suciu –CSE 662, suciu@cs.washington.edu –Office hours: Wednesdays, 12pm-1pm TA: Bhushan.

5. Recovery from crashes

• Every transaction either executes completely, or doesn’t execute at all– The A (=atomicity) in ACID

We study three types of log files in this course

Undo log file

Redo log file

Undo/Redo log file