Introduction Zachary G. Ives University of Pennsylvania CIS 650 – Implementing Data Management...

22
Introduction Zachary G. Ives University of Pennsylvania CIS 650 – Implementing Data Management Systems January 10, 2005
  • date post

    21-Dec-2015
  • Category

    Documents

  • view

    216
  • download

    0

Transcript of Introduction Zachary G. Ives University of Pennsylvania CIS 650 – Implementing Data Management...

Page 1: Introduction Zachary G. Ives University of Pennsylvania CIS 650 – Implementing Data Management Systems January 10, 2005.

Introduction

Zachary G. IvesUniversity of Pennsylvania

CIS 650 – Implementing Data Management Systems

January 10, 2005

Page 2: Introduction Zachary G. Ives University of Pennsylvania CIS 650 – Implementing Data Management Systems January 10, 2005.

2

Welcome to CIS 650, Database and Information Systems!

Instructor: Zachary Ives, zives@cis 576 Levine Hall North Office hours: Tuesday, 2:30-3:30PM (before colloquium)

Home page: www.seas.upenn.edu/~zives/cis650/

Texts and readings: Hellerstein and Stonebraker: Readings in Database

Systems, 4th ed. (Should be available soon)

Supplementary papers (will be linked via schedule)

Page 3: Introduction Zachary G. Ives University of Pennsylvania CIS 650 – Implementing Data Management Systems January 10, 2005.

3

Course Format and Grading

Very discussion-oriented; about one topic area per week or two Readings in the text & other research papers –

summaries/commentary on papers (20%)“Midterm report” (25%)

You’ll take one of the topics we’ve discussed and write a summary and synthesis paper

Graded for organization, clarity, grammar, etc. as well as contentProject (50%) -- may choose to work in teams:

Implementation Experimentation / validation Project report (should be in the style of a research paper) Brief (~15-minute) presentation

Participation, discussion, intangibles (5%)

At the end, you should be equipped to do research in this field, or to take ideas from databases and apply them to your field

Page 4: Introduction Zachary G. Ives University of Pennsylvania CIS 650 – Implementing Data Management Systems January 10, 2005.

4

So What Is This Course About?

Not how to build an Oracle-driven Web site…

… nor even how to build Oracle…

Page 5: Introduction Zachary G. Ives University of Pennsylvania CIS 650 – Implementing Data Management Systems January 10, 2005.

5

What Is Unique about Data Management?

It’s been said that databases and data management focus on scalability to huge volumes of data

What is it that makes this possible – and what makes the work interesting if NOT at huge scale?

Why are data management techniques useful in situations where scale isn’t the bottleneck?

Page 6: Introduction Zachary G. Ives University of Pennsylvania CIS 650 – Implementing Data Management Systems January 10, 2005.

6

The Key Principle: Data Independence

Most methods of programming don’t separate the logical and physical representations of data The data structures, access methods, etc. are

all given via interfaces!

The relational data model was the first model for data that is independent of its data structures and implementation

Page 7: Introduction Zachary G. Ives University of Pennsylvania CIS 650 – Implementing Data Management Systems January 10, 2005.

7

What Is Data Independence?

Codd points out that previous methods had: Order dependence Index dependence Access path dependence

Still true in today’s Java/C#: what is the drawback?

What might you be able to do in removing those?

Page 8: Introduction Zachary G. Ives University of Pennsylvania CIS 650 – Implementing Data Management Systems January 10, 2005.

8

The Relational Data Model

More than just tables! True relations: sets of tuples The only data representation a user/programmer “sees” Explicit encoding of everything in values

Additional integrity constraints Key constraints, functional dependencies, …

General and universal means of encoding everything! (Semantics are pushed to queries)

A secondary concept: views Define virtual, derived relations that are always “live” A way of encapsulating, abstracting data

Page 9: Introduction Zachary G. Ives University of Pennsylvania CIS 650 – Implementing Data Management Systems January 10, 2005.

9

Constraints and Normalization

Fundamental idea: we don’t want to build semantics into the data model, but we want to be able to encode certain constraints Functional dependencies, key constraints, foreign-key

constraints, multivalued dependencies, join dependencies, etc. Allows limited data validation, plus opportunities for

optimization

The theory of normalization (see CSE 330, CIS 550) makes use of known constraints Idea: eliminate redundancy, in order to maintain consistency in

the presence of updates (Note that there’s no reason for normalization of data in views!)

Ergo, XML???

Page 10: Introduction Zachary G. Ives University of Pennsylvania CIS 650 – Implementing Data Management Systems January 10, 2005.

10

Relational Completeness(Plus Extensions): Declarativity

What is special about relational query languages that makes them amenable to scalability? Limited expressiveness – particularly when we consider

conjunctive queries (even with recursion) Guaranteed polytime execution in size of data Can reason about containment, invert them, etc. “Magic sets” (What about XQuery’s Turing-completeness???)

Equivalence between relational calculus and algebra Calculus fully declarative, basis of query languages Algebra imperative but polytime, basis of runtime

systems Predictability of operations cost models Ability to supplement data with auxiliary structures for

performance

Page 11: Introduction Zachary G. Ives University of Pennsylvania CIS 650 – Implementing Data Management Systems January 10, 2005.

11

Concurrency and Reliability(Generally requires full control)

Another key element of databases – ACID properties Atomicity, Consistency, Isolation, Durability

Transaction : an atomic sequence of database actions (read/write) on data items (e.g. calendar entry)

Recoverability via a log: keeping track of all actions carried out by the database

How do distributed systems, Web services, service-oriented architectures, and the like affect these properties?

Page 12: Introduction Zachary G. Ives University of Pennsylvania CIS 650 – Implementing Data Management Systems January 10, 2005.

12

Other Data Models

Concepts from the relational data model have been adapted to form object-oriented data models (with classes and subclasses), XML models, etc. But doesn’t this result in some loss of logical-

physical independence?

GMAP and answering queries using views?

Page 13: Introduction Zachary G. Ives University of Pennsylvania CIS 650 – Implementing Data Management Systems January 10, 2005.

13

What Is a Data Management System?

Of course, there are traditional databases The focus of most work in the past 25 years “Tight loops” due to locally controlled data Indexing, transactions, concurrency, recovery,

optimization

But…

Page 14: Introduction Zachary G. Ives University of Pennsylvania CIS 650 – Implementing Data Management Systems January 10, 2005.

14

80% of the World’s Data is Not in Databases!

Examples: Scientific data

(large images, complex programs that analyze the data) Personal data WWW and email

(some of it is stored in something resembling a DBMS) Network traffic logs Sensor data

Are there benefits to declarative techniques and data independence in tackling these issues? XML is a great way to make this data available Also need to deal with data we don’t control and can’t

guarantee consistency over

Page 15: Introduction Zachary G. Ives University of Pennsylvania CIS 650 – Implementing Data Management Systems January 10, 2005.

15

An Example of Data Management with Heterogeneity: Data Integration

A layer above heterogeneous sources, to combine them under a unified logical abstraction Some of these are databases over which we have no

control Some must be accessed in special ways Data integration system translates queries over

mediated schema to the languages of the sources; converts answers to mediated schema

XML

“Mediated Schema”

Page 16: Introduction Zachary G. Ives University of Pennsylvania CIS 650 – Implementing Data Management Systems January 10, 2005.

16

Other Interesting Points

Data streams and sensor dataHow do we process infinite amounts of data?

Peer-to-peer architecturesWhat’s the best way of finding data here?

Personal information managementCan we use integration-style concepts and a bit of AI to

manage associations between our data?

Web searchWhat’s the back-end behind Google?

Semantic WebHow do we semantically interrelate data to build a better

Web?

Page 17: Introduction Zachary G. Ives University of Pennsylvania CIS 650 – Implementing Data Management Systems January 10, 2005.

17

Layers of a Typical Data Management System

API/GUI

Optimizer

Physical retrieval

Exec. Engine

Source

Catalog

Query

Physical plan

Pages

RequestsData

Pages

Stats

Schemas

(Simplification!)

Buffer Mgr

Access MethodsData/etc Requests

RequestsData/etc

Logging, recovery

Red = logicalBlue = physical

Page 18: Introduction Zachary G. Ives University of Pennsylvania CIS 650 – Implementing Data Management Systems January 10, 2005.

18

Query Answering in a Data Management System

Based on declarative query languages Based on restricted first-order logic expressions over relations Not procedural – defines constraints on the output

Converted into a query plan that exploits properties; run over the data by the query optimizer and query execution engine Data may be local or remote Data may be heterogeneous or homogeneous Data sources may have different interfaces, access methods,

etc. Most common query languages:

SQL (based on tuple relational calculus) Datalog (based on domain relational calculus, plus fixpoint) XQuery (functional language; has an XML calculus core)

Page 19: Introduction Zachary G. Ives University of Pennsylvania CIS 650 – Implementing Data Management Systems January 10, 2005.

19

Processing the Query

SELECT * FROM STUDENT, Takes, COURSE

WHERE STUDENT.sid = Takes.sID AND Takes.cID = cid

STUDENT

Takes COURSE

Merge

Hash

by cid by cidOptimizer

ExecutionEngine

StorageSubsystem

Web Server / UI / etc

Page 20: Introduction Zachary G. Ives University of Pennsylvania CIS 650 – Implementing Data Management Systems January 10, 2005.

20

DBMSs in the Real World

Big, mature relational databases IBM, Oracle, Microsoft

“Middleware” above these SAP, PeopleSoft, dozens of special-purpose apps

“Application servers” Integration and warehousing systems Current trends:

Web services; XML everywhere Smarter, self-tuning systems Stream systems

Page 21: Introduction Zachary G. Ives University of Pennsylvania CIS 650 – Implementing Data Management Systems January 10, 2005.

21

Our Agenda this Semester

Reading the canonical papers in the data management literature Some are very systems-y Some are very experimental Some are highly algorithmic, complexity-

oriented

Gaining an understanding of the principles of building systems to handle declarative queries over large volumes of data

Page 22: Introduction Zachary G. Ives University of Pennsylvania CIS 650 – Implementing Data Management Systems January 10, 2005.

22

For Next Time

Skim Codd if you haven’t already Read the overview papers of the two first

database systems: Astrahan et al., pp. 117- Wong et al. (skip Section 2; focus on pp. 200-)

Write a summary of your assigned paper and email it to me at zives@cis Key question: how well did this system mesh

with Codd’s relational model? (You may need to skim through other aspects of your assigned paper to help answer that question)