Week2

24
Data Warehouse Understanding Data Warehouse Amalia Anjani A. [email protected] 1

description

Week2

Transcript of Week2

Page 1: Week2

1

Data WarehouseUnderstanding Data Warehouse

Amalia Anjani A. [email protected]

Page 2: Week2

2

What is Data Warehouse?

• Defined in many different ways, but not rigorously.• A decision support database that is maintained separately from the

organization’s operational database• Support information processing by providing a solid platform of

consolidated, historical data for analysis.

• “A single, complete, and consistent source of data obtained from a variety of sources and made available to end users in a way that they can understand and use in business context” – Barry Devlin

• “A data warehouse is a subject-oriented, integrated, time-variant, and nonvolatile collection of data in support of management’s decision-making process.”—W. H. Inmon

• Data warehousing:• The process of constructing and using data warehouses

Page 3: Week2

3

DW – Subject Oriented

Operational Systems Data Warehouse System

Sales system

Payroll system

Purchasing system

Customer data

Employee data

Vendor data

Page 4: Week2

4

DW – Subject Oriented• Oriented to the major subject areas of the organization

defined in the data model.• Insurance company: customer, product, claim, account, etc

• Operational database organized differently• Based on type of insurance: auto, life, medical, etc

• Giving information about a particular subject rather than the details regarding the on-going operations of the company

• Focusing on the modeling and analysis of data for decision makers, not on daily operations or transaction processing.

Page 5: Week2

5

DW – Integrated

Operational Systems Data Warehouse System

Marketing system

Order system

Billing system

Customer data

Page 6: Week2

6

DW - Integrated• Constructed by integrating multiple, heterogeneous data

sources• relational databases, flat files, on-line transaction

records• Data cleaning and data integration techniques are

applied.• Ensure consistency in naming conventions, encoding

structures, attribute measures, etc. among different data sources• E.g., Hotel price: currency, tax, breakfast covered, etc.

• When data is moved to the warehouse, it is converted.

Page 7: Week2

7

DW – Time Variant

Operational Systems Data Warehouse System

60-90 days 5-10 years

Order system Customer data

Page 8: Week2

8

DW – Time Variant• The time horizon for the data warehouse is significantly

longer than that of operational systems.• Operational database: current value data.• Data warehouse data: provide information from a

historical perspective (e.g., past 5-10 years)• Every key structure in the data warehouse

• Contains an element of time, explicitly or implicitly

Page 9: Week2

9

DW – Non Volatile

Operational Systems Data Warehouse System

Order system

Customer data

create

insert

deleteupdate

load access

Page 10: Week2

10

DW – Non Volatile• A physically separate store of data transformed from the

operational environment.

• Operational update of data does not necessarily occur in the data warehouse environment.

• Often requires only two operations in data accessing: • initial loading of data and access of data.

Page 11: Week2

11

Whyuse Data Warehouse?

• “We collect tons of data, but we can’t access it.”• “We need to slice and dice the data every which way.”• “Business people need to get at data easily.”• “Just show me what is important.”• “We spend entire meeting arguing about who has the right

number rather than making decisions.”• “We want people to use information to support more fact-

based decision making.”

Page 12: Week2

12

The Data Related Problem• Data in organizations often has the following

characteristics:• Massive volume• Dispersed• Difficult to access• Badly integrated• Complex data structures• Not suitable for high level business queries

Page 13: Week2

13

The Information Needs Behind the Data Warehouse

• Organization need information which is:• More holistic in its coverage of the business• Selected and enriched• Easily accessible• More easily understandable• High quality• Directly applicable to the decision situation

Page 14: Week2

14

Data Sources• Production data

Data from transactional processes• Internal data

Spreadsheets, document, customer profiles, transactional databases

• Archived data• External data

Data from external system

Page 15: Week2

15

Data Warehouse vs. Heterogeneous DBMS

• Traditional heterogeneous DB integration: • Build wrappers/mediators on top of heterogeneous databases • Query driven approach

• A query posed to a client site is translated into queries appropriate for individual heterogeneous sites; The results are integrated into a global answer set

• Involving complex information filtering• Competition for resources at local sources

• Data warehouse: update-driven, high performance• Information from heterogeneous sources is integrated in advance

and stored in warehouses for direct query and analysis

Page 16: Week2

16

The integration problem

Page 17: Week2

17

The Integrated Data Warehouse

DataWarehouse

Page 18: Week2

18

Data Warehouse vs. Operational DBMS

• OLTP (on-line transaction processing)• Major task of traditional relational DBMS• Day-to-day operations: purchasing, inventory, banking,

manufacturing, payroll, registration, accounting, etc.• OLAP (on-line analytical processing)

• Major task of data warehouse system• Data analysis and decision making

• Distinct features (OLTP vs. OLAP):• User and system orientation: customer vs. market• Data contents: current, detailed vs. historical, consolidated• Database design: ER + application vs. star + subject• View: current, local vs. evolutionary, integrated• Access patterns: update vs. read-only

Page 19: Week2

19

OLTP VS OLAP OLTP OLAP users clerk, IT professional knowledge worker function day to day operations decision support DB design application-oriented subject-oriented data current, up-to-date

detailed, flat relational isolated

historical, summarized, multidimensional integrated, consolidated

usage repetitive ad-hoc access read/write

index/hash on prim. key lots of scans

unit of work short, simple transaction complex query # records accessed tens millions #users thousands hundreds DB size 100MB-GB 100GB-TB metric transaction throughput query throughput, response

Page 20: Week2

20

Why Separate Data Warehouse?

• High performance for both systems• DBMS— tuned for OLTP: access methods, indexing,

concurrency control, recovery• Warehouse—tuned for OLAP: complex OLAP queries,

multidimensional view, consolidation.

• Different functions and different data:• Missing data: Decision support requires historical data which

operational DBs do not typically maintain• Data consolidation: Decision Support requires consolidation

(aggregation, summarization) of data from heterogeneous sources

• Data quality: Different sources typically use inconsistent data representations, codes and formats which have to be reconciled

Page 21: Week2

21

Requirement• The DW system must make information easily accessible.• The DW system must present information consistently.• The DW system must adapt to change• The DW system must present information in a timely way• The DW system must be a secure bastion that protect the

information assets• The DW system must serve as the authoritative and

trustworthy foundation for improved decision making.• The business community must accept the DW system to

deem it successful.

Page 22: Week2

22

Page 23: Week2

23

What to learn next?• Multi-dimensional data model

• Cube• Scheme

• Architecture DW

Page 24: Week2

24

Individual AssignmentCreate a report that explain: (A4 page)

• Multidimensional data model (cube, fact table, dimension, etc)• Scheme (star, snowflake, etc)• Architecture data warehouse system

• Print the report, and bring it to next class.