Datawarehouse Tutorials

Post on 01-Nov-2014

348 views 2 download

Tags:

description

Datawarehouse Tutorials

Transcript of Datawarehouse Tutorials

Case study

Insurance domain

Agenda

• Introduction to Insurance domain

• What is Health insurance

• Identify Dimensions

• ETL

• Reporting

• Identify Facts

• Historical data maintanance

3

Introduction to Insurance

• Insurance provides protection against financial losses resulting from a variety of threats. By purchasing insurance policies, individuals and businesses can receive reimbursement for losses due to car accidents, theft of property, and fire and storm damage, medical expenses, and loss of income due to disability or death.

• Types of Insurance: Auto Insurance Home Insurance Health Insurance Life Insurance etc…

4

Health Insurance

• Health insurance (popularly known as Medical Insurance or Mediclaim) protects one against any financial constraints arising on account of a medical emergency. It sometimes includes disability and long term medical needs.

• In Mediclaim, you pay a premium and in return the insurer commits to pay a predetermined sum of money to meet the claims.

• Health insurance is available to individual and groups.

5

Health Insurance

• Health insurance (popularly known as Medical Insurance or Mediclaim) protects one against any financial constraints arising on account of a medical emergency. It sometimes includes disability and long term medical needs.

• In Mediclaim, you pay a premium and in return the insurer commits to pay a predetermined sum of money to meet the claims.

• Health insurance is available to individual and groups.

Which are our lowest/highest margin

customers ?

Which are our lowest/highest margin

customers ?

Who are my customers and what products are they buying?

Who are my customers and what products are they buying?

Which customers are most likely to go to the competition ?

Which customers are most likely to go to the competition ?

What impact will new products/services

have on revenue and margins?

What impact will new products/services

have on revenue and margins?

What product prom--otions have the biggest

impact on revenue?

What product prom--otions have the biggest

impact on revenue?

What is the most effective distribution

channel?

What is the most effective distribution

channel?

A customer wants to know….

Data, data everywhere yet...

• I can’t find the data I need data is scattered over the network many versions, subtle differences

• I can’t get the data I need need an expert to get the data

• I can’t understand the data I found available data poorly documented

• I can’t use the data I found results are unexpected data needs to be transformed from

one form to other

What the knowledge users want is...

• Data should be integrated across the enterprise

• Summary data has a real value to the organization

• Historical data holds the key to understanding data over time

• What-if capabilities are required

But the Present Systems are…

• Run mission critical applications

• Need to work with stringent performance requirements for routine tasks

• Used to run a business

• Online Transaction Processing (OLTP) systems a.k.a. Operational Systems

Operational Systems…

• Run the business in real time

• Optimized to handle large numbers of simple read/write transactions

• Based on up-to-the-second data

• Optimized for fast response to predefined transactions

• Used by people who deal with customers, products - clerks, salespeople etc.

…What they need is a ‘Data Warehouse’

A single, complete and consistent store of data obtained from a variety of different sources made available to end users in a way they can understand and use in a business context

[Barry Devlin]

12

Evolution

• 60’s: Batch reports• hard to find and analyze information• inflexible and expensive, reprogram every new

request• 70’s: Terminal-based DSS and EIS (executive

information systems)• still inflexible, not integrated with desktop tools

• 80’s: Desktop data access and analysis tools• query tools, spreadsheets, GUIs• easier to use, but only access operational databases

• 90’s: Data warehousing with integrated OLAP engines and tools

Data Warehouse

A data warehouse is a

• subject-oriented

• integrated

• time-varying

• non-volatile

collection of data that is used primarily in organizational decision-making

- Bill Inmon, Building the Data Warehouse 1996

14

Application-Orientation vs. Subject Orientation

Application-Orientation

Operational Database

LoansCredit Card

Trust

Savings

Subject-Orientation

DataWarehouse

Customer

VendorProduct

Activity

OLTP vs. Data Warehouse

OLTP Warehouse (DSS)

• Application Oriented • Subject Oriented

• Used to run business • Used to analyze business

• Current up to date • Summarized and refined

• Detailed data • Snapshot data

• Isolated Data • Integrated Data

• Repetitive access • Ad-hoc access

• Performance Sensitive • Performance relaxed

• Few Records accessed at a time (tens)

• Large volumes accessed at a time (millions)

• Read/Update Access • Mostly Read (Batch Update)

OLTP vs. Data Warehouse

OLTP Warehouse (DSS)

• Clerical User • Knowledge User (Manager)

• No data redundancy • Redundancy present

• DB Size (100MB -100GB) • DB Size (100GB - few terabytes)

• Transaction throughput is the performance

metric

• Query throughput is the performance metric

• 100s – 1000s of users • 10s – 100s of users

To summarize ...

OLTP Systems are used to “run” a business

The Data Warehouse helps to “optimize” the business

Data Warehouse Architecture

activities

19

Data Warehouse Architecture

Data Warehouse Engine

Optimized Loader

ExtractionCleansing

AnalyzeQuery

Metadata Repository

RelationalDatabases

LegacyData

Purchased Data

ERPSystems

Architecture

• Operational database layer The source data for the data warehouse — An organization's Enterprise Resource Planning systems fall into this layer.

• Data access layer The interface between the operational and informational access layer — Tools to extract, transform, load data into the warehouse fall into this layer

• Metadata layer The data directory - This is usually more detailed than an operational system data directory. There are dictionaries for the entire warehouse and sometimes dictionaries for the data that can be accessed by a particular reporting and analysis tool.

• Informational access layer The data accessed for reporting and analyzing data— Business intelligence tools fall into this layer.

Dimensional Modeling

Database organization• must look like business• must be recognizable by business user• approachable by business user

Schema Types• Star Schema• Snowflake schema

Dimension Tables

• Define business in terms already familiar to users

• Wide rows with lots of descriptive text• Small tables (generally, few thousands)• Joined to fact table by a foreign key• Typical dimensions include time periods,

geographic regions (markets, cities), products, customers, salesperson etc.

Dimensions are similar set of members upon which the user wants to base an analysis.

• Definition Descriptions of the business. The “which, who, how, where, or when that describes or explains the fact.”

• Characteristics - Discretely valued descriptions of variables which are more or less constant. - Enables “slicing and dicing” the facts by different variables.

• Examples Time Customer Product

Sales Fact

RevenueQtyCostGross margin

CustomerDimension

TimeDimension

BranchDimension

ProductDimension

Dimensions

Fact Table

• Central table• Mostly raw numeric items• Narrow rows, a few columns at most• Large number of rows (millions to billions)• Accessed via dimensions

Facts are measurements that users will apply arithmetic calculations to.• What is the average of…

• What is the total of…

Dimensional Model Schemas

• Dimensional Data Models majorly fall into two types of models:• Star Schema• Snowflake Schema

• Several factors influence schema choice:• Presentation restrictions• Inconsistency of data• Complex queries and analysis

Star Schema

• Star schema represents a compromise between the fully normalized model and the denormalized model.

• Descriptive ‘dimension’ information is maintained in a set of denormalized dimension tables.

STAR SCHEMA

A database design that stores a central fact table surrounded by multiple dimension tables.

Star Schema

• A single fact table and for each dimension one dimension table

• Does not capture hierarchies directly

T ime

prod

cust

city

fact

date, custno, prodno, city, ...

Star Schema

Snowflake Schema

• Snowflake schemas are most often used when dealing with large hierarchies that are static.

• Snow flaked tables (look-up tables) may increase the speed of queries depending on the presentation tool

SNOWFLAKE SCHEMA

A database design that stores a central fact table surrounded by multiple dimension tables decomposed or normalized into one or more hierarchies.

Snowflake schema

• Represents dimensional hierarchy directly by normalizing tables.

T ime

prod

cust

city

fact

date, custno, prodno, city, ...

region

Snowflake Schema

Slowly-Changing Dimensions

• Most dimensions change over time. • Products change offered coverage or limits

and deductibles. • Employees are promoted, fired, or change

departments.• Customers change names and addresses.

• What are our choices for tracking these changes over time?

Slowly-Changing Dimensions

• There are three types of slowly changing dimensions:• Type 1: Overwrites the old data for a

record with new data. This eliminates the ability to track history over time.

• Type 2: Creates a new record with the new data at the type of the change. Accurately tracks history, but requires generalized key.

• Type 3: Tracks new and original values in separate fields at time of change. Intermediate values are lost.

Type 1 - Overwrite Old Values

• Customer Lynnette Groves is changing her name to ?

• If there is no value in tracking this change, we will overwrite the First Name and Last Name fields with the new values.

• ‘UPDATE’ statement; 1 record is maintained.

Type 2 - Create New Record

• Lynnette Groves is changing her name and we want to track both values

• Add a second record with a new Customer Key and make it the active row

• ‘INSERT’ statement for new, ‘UPDATE’ for active; 2 records are maintained

• New record for each change up to n records

Type 3 - Original and Current

• We decide that no matter how many times she changes her name, we only want to track the original and the current.

• Before any changes, original and current are the same. Any name change updates ‘current’ fields.

• UPDATE’ statement; 1 record is maintained

Degenerate Dimensions

• Certain attributes are tracked that don’t necessarily belong in their own dimension - orphan attributes.

• This may occur when fact tables are designed to reflect the actual working document.

DEGENERATE DIMENSIONS

Dimensions that are so small and have no attributes of their own that they have been added to the fact table.

Degenerate Dimensions

• Examples include ‘order_number’, ‘bill_of_lading_num’, and ‘invoice_number’.

• While these fields seem very transaction oriented, they are helpful in grouping things such as all line items on an invoice.

• Including these fields on the fact table amounts to denormalizing the attribute due to the granularity of the fact table being the document itself or a line item of the document.

Types of Facts

• Understanding which facts can be added across which dimensions is an important data design issue.

• Three Types of Facts:• Additive• Non-Additive• Semi-Additive

Additive Facts

• Since aggregation is a key element in the usefulness of the dimensional model, its best utilized for facts that are additive, numeric values.

• We can add revenue, cost, and quantity sold for all products, all stores, and any time period.

ADDITIVE FACTS

Measurements in a fact table that can be added across all dimensions.

Semi-Additive Facts

• Current Balance is a semi-additive fact, as it makes sense to add them up for all accounts (what's the total current balance for all accounts in the bank?)

• It does not make sense to add them up through time (adding up all current balances for a given account for each day of the month does not give us any useful information

SEMI-ADDITIVE FACTS

Measurements in a fact table that can be added across some dimensions but not others.

Non-Additive Facts

• A new value will need to be calculated at each level, for each set of data.

• Ratios Averages & Variance facts.

NON-ADDITIVE FACTS

Measurements in a fact table that cannot be added across any dimensions, like ratios.

Dimensional Modeling Process

Step 1: Choose the grain of each fact table.

• Granularity defines the level of detailed data.

• It must be determined prior to going forward in the modeling process.

• Typical grains are individual transactions, time-based aggregation, and/or aggregations along a commonly used dimension.

Dimensional Modeling Process

Step 2: Choose the dimension attributes.

• For example, what should our time dimension look like? Should it have just ‘January for month’, or also ‘Jan’ and ‘1’?

• Should we store the code and the description, just the code, or just the description?

• What values will our users need to filter or report on?

Dimensional Modeling Process

Step 3: Identify dimensional hierarchies.

• A dimension such as time may have days rolling into months and then quarters, as well as days rolling into weeks which may cross months and quarters.

• Sales geography may differ from physical geography.

• Zip codes can cross city boundaries and cities are made up of multiple zip codes.

Dimensional Modeling Process

Step 4: Choose the dimensions that apply to each fact table.

• Typical dimensions include time, product, policyholder, agent, and geography.

• Remember to evaluate granularity when applying dimensions to facts.

Dimensional Modeling Process

Step 5: Choose the measured facts, including pre calculated facts.

• Each aggregated and derived fact will need to be evaluated for inclusion in the model or calculation in the application.

• Trade-offs include storage and indexing and must be weighed against the access requirements.

Dimensional Modeling Process

Step 6: Determine slowly changing dimensions

• These are the dimensions that change over time.

• If tracking these changes is important, the method must be decided.

• Options: overwrite the existing record, store all records with effective dates, or a historical and current value tables.