CPSC-608 Database Systems Fall 2011 Instructor: Jianer Chen Office: HRBB 315C Phone: 845-4259 Email:...

38
CPSC-608 Database Systems Fall 2011 Instructor: Jianer Chen Office: HRBB 315C Phone: 845-4259 Email: [email protected] Notes #15
  • date post

    19-Dec-2015
  • Category

    Documents

  • view

    213
  • download

    0

Transcript of CPSC-608 Database Systems Fall 2011 Instructor: Jianer Chen Office: HRBB 315C Phone: 845-4259 Email:...

Page 1: CPSC-608 Database Systems Fall 2011 Instructor: Jianer Chen Office: HRBB 315C Phone: 845-4259 Email: chen@cse.tamu.edu Notes #15.

CPSC-608 Database Systems

Fall 2011

Instructor: Jianer Chen

Office: HRBB 315C

Phone: 845-4259

Email: [email protected]

Notes #15

Page 2: CPSC-608 Database Systems Fall 2011 Instructor: Jianer Chen Office: HRBB 315C Phone: 845-4259 Email: chen@cse.tamu.edu Notes #15.

2

Brief Overview on

• Data/information integration (data warehouse)

• Data mining

Page 3: CPSC-608 Database Systems Fall 2011 Instructor: Jianer Chen Office: HRBB 315C Phone: 845-4259 Email: chen@cse.tamu.edu Notes #15.

3

Data Warehouse (Overview)

A data warehouse is the main repository of an organization's historical data, its corporate memory. It contains the raw material for management's decision support system. The critical factor leading to the use of a data warehouse is that a data analyst can perform complex queries and analysis, such as data mining, on the information without slowing down the operational systems. [Wikipedia]

Page 4: CPSC-608 Database Systems Fall 2011 Instructor: Jianer Chen Office: HRBB 315C Phone: 845-4259 Email: chen@cse.tamu.edu Notes #15.

What is a Warehouse?

• Collection of (possibly diverse) data– subject oriented

– aimed at executive, decision maker, analysts

– often a copy of operational data– with value-added data (e.g., summaries, history)

– integrated schema

– time-varying

– non-volatile

4

Page 5: CPSC-608 Database Systems Fall 2011 Instructor: Jianer Chen Office: HRBB 315C Phone: 845-4259 Email: chen@cse.tamu.edu Notes #15.

What is a Warehouse?

• Collection of tools/services– gathering data

– cleansing, integrating, ...

– querying, reporting, aggregation, analysis

– data mining

– monitoring, administration

5

Page 6: CPSC-608 Database Systems Fall 2011 Instructor: Jianer Chen Office: HRBB 315C Phone: 845-4259 Email: chen@cse.tamu.edu Notes #15.

Why a Warehouse?

• Ship and integrate data from different sources to the analyst

• Three Approaches:– Database federations (legacy)– Query-driven (lazy)– Warehouse (eager)

66

Page 7: CPSC-608 Database Systems Fall 2011 Instructor: Jianer Chen Office: HRBB 315C Phone: 845-4259 Email: chen@cse.tamu.edu Notes #15.

Database Federations

7

• An application program for each connection, • Simple, good if DB communications are limited• Needs to write many application programs

Page 8: CPSC-608 Database Systems Fall 2011 Instructor: Jianer Chen Office: HRBB 315C Phone: 845-4259 Email: chen@cse.tamu.edu Notes #15.

Warehouse Architecture

8

Client

MetadataMetadata

Client

SQL & data stored in unifiedDB schema

SQL & data stored in unifiedDB schema

Each source has a wrapper/extractor that consists of a collection of predefined queries on the source, and communication mechanisms

Page 9: CPSC-608 Database Systems Fall 2011 Instructor: Jianer Chen Office: HRBB 315C Phone: 845-4259 Email: chen@cse.tamu.edu Notes #15.

Query-Driven Approach

9

query

result

queryresult

query result

query result query resultquery result

SQL, but

no data storedSQL, but

no data stored

Each source has a wrapper, which classifies queries into templates, and translates them into queries for the source. The wrapper can be generated from templates using modern compiler techniques.

Page 10: CPSC-608 Database Systems Fall 2011 Instructor: Jianer Chen Office: HRBB 315C Phone: 845-4259 Email: chen@cse.tamu.edu Notes #15.

Advantages of Query-Driven

• No need to copy data– less storage

– no need to purchase data

• More up-to-date data

• Query needs can be unknown

• Only query interface needed at sources

• May be less draining on sources

10

Page 11: CPSC-608 Database Systems Fall 2011 Instructor: Jianer Chen Office: HRBB 315C Phone: 845-4259 Email: chen@cse.tamu.edu Notes #15.

Advantages of Warehousing

• High query performance

• Queries not visible outside warehouse

• Local processing at sources unaffected

• Can operate when sources unavailable

• Can query data not stored in a DBMS

• Extra information at warehouse– Modify, summarize (store aggregates)

– Add historical information

11

Page 12: CPSC-608 Database Systems Fall 2011 Instructor: Jianer Chen Office: HRBB 315C Phone: 845-4259 Email: chen@cse.tamu.edu Notes #15.

OLTP vs. OLAP

• OLTP: On Line Transaction Processing– Describes processing at operational sites (sources)

• OLAP: On Line Analytical Processing– Describes processing at warehouse

12

Page 13: CPSC-608 Database Systems Fall 2011 Instructor: Jianer Chen Office: HRBB 315C Phone: 845-4259 Email: chen@cse.tamu.edu Notes #15.

OLTP vs. OLAP

• Mostly updates

• Many small transactions

• Megabyte-terabyte of data

• Raw data

• Up-to-date data

• Consistency, recoverability critical

• Clerical users

• Mostly reads

• Queries long, typically complex aggregations

• Gigabyte-terabyte of data

• Summarized, consolidated data

• Decision-makers, analysts as users

13

OLTP OLAP

Page 14: CPSC-608 Database Systems Fall 2011 Instructor: Jianer Chen Office: HRBB 315C Phone: 845-4259 Email: chen@cse.tamu.edu Notes #15.

Implementing a Warehouse

• Monitoring: Sending data from sources

• Integrating: Data loading, cleaning,...

• Processing: Query processing, indexing, ...

• Managing: Metadata, Design, ...

14

Page 15: CPSC-608 Database Systems Fall 2011 Instructor: Jianer Chen Office: HRBB 315C Phone: 845-4259 Email: chen@cse.tamu.edu Notes #15.

Monitoring Issues

• Frequency– periodic: daily, weekly, …

– triggered: on “big” change, lots of changes, ...

• Data transformation/normalization– convert data to uniform format– remove & add fields (e.g., add date to get history)

• Standards• Gateways (Intranet/internet, firewalls, VPN, etc.)

15

Page 16: CPSC-608 Database Systems Fall 2011 Instructor: Jianer Chen Office: HRBB 315C Phone: 845-4259 Email: chen@cse.tamu.edu Notes #15.

Integration

• Data Cleaning

• Data Loading

• Derived Data

16

Page 17: CPSC-608 Database Systems Fall 2011 Instructor: Jianer Chen Office: HRBB 315C Phone: 845-4259 Email: chen@cse.tamu.edu Notes #15.

Processing

• Index Structures

• What to Materialize?

• Algorithms

17

Client Client

Warehouse

Source Source Source

Query & Analysis

Integration

Metadata

Page 18: CPSC-608 Database Systems Fall 2011 Instructor: Jianer Chen Office: HRBB 315C Phone: 845-4259 Email: chen@cse.tamu.edu Notes #15.

Managing

• Metadata

• Warehouse Design

• Tools

18

Client Client

Warehouse

Source Source Source

Query & Analysis

Integration

Metadata

Page 19: CPSC-608 Database Systems Fall 2011 Instructor: Jianer Chen Office: HRBB 315C Phone: 845-4259 Email: chen@cse.tamu.edu Notes #15.

Warehouse Design

• What data is needed?

• Where does it come from?

• How to clean data?

• How to represent in warehouse (schema)?

• What to summarize?

• What to materialize?

• What to index?

19

Page 20: CPSC-608 Database Systems Fall 2011 Instructor: Jianer Chen Office: HRBB 315C Phone: 845-4259 Email: chen@cse.tamu.edu Notes #15.

Conclusions

• Massive amounts of data and complexity of queries will push limits of current warehouses

• Need better systems:– easier to use

– provide quality information

– scalability

CS 245 Notes12 20

Page 21: CPSC-608 Database Systems Fall 2011 Instructor: Jianer Chen Office: HRBB 315C Phone: 845-4259 Email: chen@cse.tamu.edu Notes #15.

Data Mining (Overview)

What is data mining?

A process of examining data and finding simple rules or models that summarize the data.

Mining Techniques:

• Decision Trees

• Clustering

• Association Rules

21

Page 22: CPSC-608 Database Systems Fall 2011 Instructor: Jianer Chen Office: HRBB 315C Phone: 845-4259 Email: chen@cse.tamu.edu Notes #15.

Decision Trees

22

sale custId car age city newCarc1 taurus 27 sf yesc2 van 35 la yesc3 van 40 sf yesc4 taurus 22 sf yesc5 merc 50 la noc6 taurus 25 la no

Example:• Conducted survey to see what customers were interested in new model car• Want to select customers for advertising campaign

trainingset

trainingset

Page 23: CPSC-608 Database Systems Fall 2011 Instructor: Jianer Chen Office: HRBB 315C Phone: 845-4259 Email: chen@cse.tamu.edu Notes #15.

One Possibility

23

sale custId car age city newCarc1 taurus 27 sf yesc2 van 35 la yesc3 van 40 sf yesc4 taurus 22 sf yesc5 merc 50 la noc6 taurus 25 la no

car=taurus

city=sf age<45

likely likelyunlikely unlikely

YY

Y

NN

N

Page 24: CPSC-608 Database Systems Fall 2011 Instructor: Jianer Chen Office: HRBB 315C Phone: 845-4259 Email: chen@cse.tamu.edu Notes #15.

Another Possibility

24

sale custId car age city newCarc1 taurus 27 sf yesc2 van 35 la yesc3 van 40 sf yesc4 taurus 22 sf yesc5 merc 50 la noc6 taurus 25 la no

age<30

city=sf car=van

likely likelyunlikely unlikely

YY

Y

NN

N

Page 25: CPSC-608 Database Systems Fall 2011 Instructor: Jianer Chen Office: HRBB 315C Phone: 845-4259 Email: chen@cse.tamu.edu Notes #15.

Issues

• Decision tree should not be “too deep”– would not have statistically significant amounts of data for

lower decisions

• Need to select tree that most reliably predicts outcomes– automatic decision tree construction from training data

(“unsupervised learning”)

– exploit training data statistics to detect most ”discriminative” attribute/value conditions at each level

25

Page 26: CPSC-608 Database Systems Fall 2011 Instructor: Jianer Chen Office: HRBB 315C Phone: 845-4259 Email: chen@cse.tamu.edu Notes #15.

Clustering

26

age

inco

me

educ

ation

Page 27: CPSC-608 Database Systems Fall 2011 Instructor: Jianer Chen Office: HRBB 315C Phone: 845-4259 Email: chen@cse.tamu.edu Notes #15.

Another Example: Text

• Each document is a vector

• Clusters contain “similar” documents

• Useful for understanding, searching documents

27

internationalnews

sports

business

Page 28: CPSC-608 Database Systems Fall 2011 Instructor: Jianer Chen Office: HRBB 315C Phone: 845-4259 Email: chen@cse.tamu.edu Notes #15.

Issues

• Given desired number of clusters?

• Finding “best” clusters

• Are clusters semantically meaningful?

• Using clusters for disk storage

28

Page 29: CPSC-608 Database Systems Fall 2011 Instructor: Jianer Chen Office: HRBB 315C Phone: 845-4259 Email: chen@cse.tamu.edu Notes #15.

Association Rule Mining

29

tran1 cust33 p2, p5, p8tran2 cust45 p5, p8, p11tran3 cust12 p1, p9tran4 cust40 p5, p8, p11tran5 cust12 p2, p9tran6 cust12 p9

transa

ction

id custo

mer

id products

bought

salesrecords:

• Trend 1) Products p5, p8 often bought together• Trend 2) Customer 12 likes product p9

market-basketdata

market-basketdata

Page 30: CPSC-608 Database Systems Fall 2011 Instructor: Jianer Chen Office: HRBB 315C Phone: 845-4259 Email: chen@cse.tamu.edu Notes #15.

Association Rule

• Rule: {p5, p8}, {cust12, p9}, …

• Support: number of “baskets” where these products appear

• High-support set: support threshold s

• Problem: find all high support sets

30

Page 31: CPSC-608 Database Systems Fall 2011 Instructor: Jianer Chen Office: HRBB 315C Phone: 845-4259 Email: chen@cse.tamu.edu Notes #15.

Association Rules

• How do we perform rule mining efficiently?

• Observation: – If set X has support t, then each X subset must have

at least support t

• For 2-sets:– if we need support s for {i, j}

– then each i, j must appear in at least s baskets

• A-Priori Algorithm31

Page 32: CPSC-608 Database Systems Fall 2011 Instructor: Jianer Chen Office: HRBB 315C Phone: 845-4259 Email: chen@cse.tamu.edu Notes #15.

32

CSCE-608 Course Summary

• Overview of DB and DBMS systems;

• The memory architecture;

• Indexing and hashing;

• Query processing;

• Crash recovery;

• Concurrency control;

• Transaction processing;

• Data integrity and data mining;

Page 33: CPSC-608 Database Systems Fall 2011 Instructor: Jianer Chen Office: HRBB 315C Phone: 845-4259 Email: chen@cse.tamu.edu Notes #15.

33

CSCE-608 Course Summary

• Overview of DB and DBMS systems;

• The memory architecture;

• Indexing and hashing;

• Query processing;

• Crash recovery;

• Concurrency control;

• Transaction processing;

• Data integrity and data mining;

Page 34: CPSC-608 Database Systems Fall 2011 Instructor: Jianer Chen Office: HRBB 315C Phone: 845-4259 Email: chen@cse.tamu.edu Notes #15.

34

Indexing and Hashing

• B+ trees

structure

operations: search, insert, delete

• Hashing

hash table and hash function

operations: search, insert, delete

extensible hashing

linear hashing

Page 35: CPSC-608 Database Systems Fall 2011 Instructor: Jianer Chen Office: HRBB 315C Phone: 845-4259 Email: chen@cse.tamu.edu Notes #15.

35

Query Processing

• Query compiler, parse tree

• Logic query plan, physical query plan

• Disk I/O efficient algorithms

• Cost estimation of query plans

Page 36: CPSC-608 Database Systems Fall 2011 Instructor: Jianer Chen Office: HRBB 315C Phone: 845-4259 Email: chen@cse.tamu.edu Notes #15.

36

Crash Recovery

• Undo logging• Redo logging

• Undo/redo logging

• Recovery algorithms

• Checkpoints

Page 37: CPSC-608 Database Systems Fall 2011 Instructor: Jianer Chen Office: HRBB 315C Phone: 845-4259 Email: chen@cse.tamu.edu Notes #15.

37

Concurrent Control

• Serialization• Locking systems

• Timestamp

• Validation

Page 38: CPSC-608 Database Systems Fall 2011 Instructor: Jianer Chen Office: HRBB 315C Phone: 845-4259 Email: chen@cse.tamu.edu Notes #15.

38

Transaction processing

• Recoverability • Handling deadlocks