Abinitio Vijay - 8553385664

28
DATAWARE HOUSING FUNDAMENTALS  

Transcript of Abinitio Vijay - 8553385664

Page 1: Abinitio Vijay - 8553385664

8/3/2019 Abinitio Vijay - 8553385664

http://slidepdf.com/reader/full/abinitio-vijay-8553385664 1/28

DATAWARE HOUSINGFUNDAMENTALS

 

Page 2: Abinitio Vijay - 8553385664

8/3/2019 Abinitio Vijay - 8553385664

http://slidepdf.com/reader/full/abinitio-vijay-8553385664 2/28

Definition of Data warehouse

Inmon

 

“ A data warehouse is a subject-oriented, integrated, non-volatile, and time-variant collection of data in support of management’s decisions”. 

OR 

The Dataware House is an informational environment that

• Provides an integrated and total view of the enterprise

• Makes the enterprise’s Current Historical and Information easily available for decisionmaking

• Makes the decision-support transactions possible without hindering operational systems

• Renders the organization’s information consistent

• Presents a flexible and interactive source of strategic information

OR 

“A copy of the transactional data specially structured for reporting and analysis”

Page 3: Abinitio Vijay - 8553385664

8/3/2019 Abinitio Vijay - 8553385664

http://slidepdf.com/reader/full/abinitio-vijay-8553385664 3/28

Organizations Use of Dataware Housing

Retail

Customer Loyalty

Market Planning

Financial

Risk Management

Fraud Detection

Manufacturing

Cost Reduction

Logistics Management

Utilities

Asset Management

Resource Management

Airlines

Route Profitability

Yield Management

 

Page 4: Abinitio Vijay - 8553385664

8/3/2019 Abinitio Vijay - 8553385664

http://slidepdf.com/reader/full/abinitio-vijay-8553385664 4/28

Dataware House – Subject Oriented

Organized around major subjects, such as customer, Sales, Account.

Focusing on the modeling and analysis of data for decision makers, not on daily operations ortransaction processing.

Provide a simple and concise view around particular subject issues by excluding data that arenot useful in the decision support process

CustomerBilling

OrderProcessing

AccountsReceivable

Operational Systems

CustomerData

Sales

Account

Data Warehouse

REGData

Page 5: Abinitio Vijay - 8553385664

8/3/2019 Abinitio Vijay - 8553385664

http://slidepdf.com/reader/full/abinitio-vijay-8553385664 5/28

Dataware House - Integrated

Constructed by integrating multiple, heterogeneous data sources

Relational or other databases, flat files, external data Data cleaning and data integration techniques are applied

Ensure consistency in naming conventions, encoding structures, attribute measures, etc.among different data sources

When data is moved to the warehouse, it is converted

Operational Systems

Subject =Account

Data Warehouse

Sa

vingsAccount

Loans

Account

CheckingAccount

Page 6: Abinitio Vijay - 8553385664

8/3/2019 Abinitio Vijay - 8553385664

http://slidepdf.com/reader/full/abinitio-vijay-8553385664 6/28

Dataware House – Non Volatile

A physically separate store of data transformed from the operational environment Operational update of data does not occur in the data warehouse environment

Does not require transaction processing, recovery, and concurrency control mechanisms

Requires only : loading and access of data.

Access

OrderProcessing

Operational Systems

SalesData

Data Warehouse

Create

Update

Delete

Insert

Load

Page 7: Abinitio Vijay - 8553385664

8/3/2019 Abinitio Vijay - 8553385664

http://slidepdf.com/reader/full/abinitio-vijay-8553385664 7/28

Dataware House – Time Variant

The time horizon for the data warehouse is significantly longer than that of operationalsystems Operational database: current value data  Data warehouse data: provide information from a historical perspective (e.g., past 5-10

years) Every key structure in the data warehouse

Contains an element of time But the key of operational data may or may not contain “time element”

DepositSystem

Operational Systems

CustomerData

Data Warehouse

60 - 90 days 5 - 10 years

Page 8: Abinitio Vijay - 8553385664

8/3/2019 Abinitio Vijay - 8553385664

http://slidepdf.com/reader/full/abinitio-vijay-8553385664 8/28

Dataware House – OLTP Vs OLAP

OLTP (On-line Transaction Processing) holds current data Useful for end users stores detailed data data is dynamic repetitive processing (One

record process at a time) high level of transaction throughput predictable pattern of usage transaction driven application oriented support day-to-day decisions

Response time is very quick serves large number of operational

users

OLAP (On-line Analytical Processing) holds historic and integrated data Useful for EIS And DSS stores detailed and summarized data data is largely static (A group of records

processed in a batch) ad-hoc, unstructured and heuristic

processing medium or low-level of transaction

throughput unpredictable pattern of usage analysis driven subject oriented supports strategic decisions Response time is optimum serves relatively lower level of managerial

users

Page 9: Abinitio Vijay - 8553385664

8/3/2019 Abinitio Vijay - 8553385664

http://slidepdf.com/reader/full/abinitio-vijay-8553385664 9/28

Dataware House Architecture

Staging Area

Page 10: Abinitio Vijay - 8553385664

8/3/2019 Abinitio Vijay - 8553385664

http://slidepdf.com/reader/full/abinitio-vijay-8553385664 10/28

Dataware House Vs Data Mart

Dataware House

Corporate/Enterprise wide

Union of all data marts

Data received from staging area

Structure for corporate view of data

Queries on presentation resource

Organized an ER model

Data Mart

Departmental

A single Business process

Star join (Facts & Dimensions)

Structure to view the departmental viewof the data

Technology optimal for data access andanalysis

Structure to suit the departmental view of data

Page 11: Abinitio Vijay - 8553385664

8/3/2019 Abinitio Vijay - 8553385664

http://slidepdf.com/reader/full/abinitio-vijay-8553385664 11/28

Dataware To Meet Requirements within Dataware House

• The data is organized differently in Dataware House(e.g. : Multidimenssional)

-Star Schema

-Snow Flake Schema

• The data is viewed differently

• Data is stored differently

-Vector (array) storage

• Data is Indexed Differently

-Bitmap indexes

-Join indexes

Page 12: Abinitio Vijay - 8553385664

8/3/2019 Abinitio Vijay - 8553385664

http://slidepdf.com/reader/full/abinitio-vijay-8553385664 12/28

Star Schema

Star Schema : “A modeling technique used to map multidimensional decision support datainto a relational database with the purpose of performing advantage data analysis”

OR 

 

“A relational database schema organized around a central table (Fact table) joined to fewsmaller tables (dimension tables) using foreign key references”

  Types of star schema 

1)Basic star schema or Star Schema

2)Extended star schema or Snowflake schema.

 

Page 13: Abinitio Vijay - 8553385664

8/3/2019 Abinitio Vijay - 8553385664

http://slidepdf.com/reader/full/abinitio-vijay-8553385664 13/28

Multidimensional modeling

Multidimensional modeling is based on the concept of star schema.

 

Star schema consists of two types of tables.

 

1)Fact table

2)Dimension table

Fact Table :

“Fact table contains the transactional data generated out of business transactions”

Dimension Table :

  “Dimension table contains master data or referential data used to analyze transactional

data” 

Page 14: Abinitio Vijay - 8553385664

8/3/2019 Abinitio Vijay - 8553385664

http://slidepdf.com/reader/full/abinitio-vijay-8553385664 14/28

Fact Table contains two types of columns 

1)Measures

2)Key section

Dataware House 3 types of measures

1)Additive measures2)Non-additive measures3)semi –additive measures

 

Fact Table 

Additive measures :“Measures that can involve in the calculation inorder to derive new measures”

Non-additive measures :

“Measures that can’t participate in the calculations”

Semi-additive measures :“Measures that can be participate in the calculations depend on the context “Measures that can be added across few dimensions and not with others. 

Key SectionDate

Prod_id

Cust_id

Measures

Sales_revenue

Tot_quantity

Unit_cost

Sale_price

Page 15: Abinitio Vijay - 8553385664

8/3/2019 Abinitio Vijay - 8553385664

http://slidepdf.com/reader/full/abinitio-vijay-8553385664 15/28

Types of Star Schema

Dataware House supports 2 types of star schemas

1)Basic star schema or Star schema

2)Extended star schema or Snow flake schema

Star Schema :

“Fact tables existing in normalized format where as dimension tables existing in thedemoralized format”

Snowflake Schema :

“Fact and dimension tables are existed in the normalized format”

Fact less fact table or Coverage tables:

“Events of the transactions can occur without the measures. Resulting in a fact table withoutthe measures”

Page 16: Abinitio Vijay - 8553385664

8/3/2019 Abinitio Vijay - 8553385664

http://slidepdf.com/reader/full/abinitio-vijay-8553385664 16/28

Example of Star Schema

 

Page 17: Abinitio Vijay - 8553385664

8/3/2019 Abinitio Vijay - 8553385664

http://slidepdf.com/reader/full/abinitio-vijay-8553385664 17/28

Example Of Snow Flake Schema

 

Page 18: Abinitio Vijay - 8553385664

8/3/2019 Abinitio Vijay - 8553385664

http://slidepdf.com/reader/full/abinitio-vijay-8553385664 18/28

Dataware House –Slowly Changing Dimensions

Slowly Changing Dimensions :

 

Dimensions that change over time are called Slowly Changing Dimensions. For instance,aproduct price changes over time; People change their names for some reason; Country andState names may change over time. These are a few examples of Slowly Changing Dimensionssince some changes are happening to them over a period of time.

Type1:Over writing the old values

  Type2:Creating an another additional record

  Type3:Creating new fields

Page 19: Abinitio Vijay - 8553385664

8/3/2019 Abinitio Vijay - 8553385664

http://slidepdf.com/reader/full/abinitio-vijay-8553385664 19/28

SCD Type1

Type1 : Overwriting the old values

Product price in 2004

In the year 2005, if the price of the product changes to $250, then the old values of the columns "Year" and"Product Price" have to be updated and replaced with the new values. In this Type 1, there is no way to find outthe old value of the product "Product1" in year 2004 since the table now contains only the new price and year

information.

Product

Product ID (PK) year Prod Name Price

1 2004 Product1 150

Product ID(PK)

Year Prod

Name

Price

1 2005 Product1 250

Page 20: Abinitio Vijay - 8553385664

8/3/2019 Abinitio Vijay - 8553385664

http://slidepdf.com/reader/full/abinitio-vijay-8553385664 20/28

SCD Type2

Type 2: Creating an another additional record. 

PRODUCTProduct ID (PK) Effective Date

time (PK)Year Product Name Price Expiry Date time

1 01-01-200412.00AM

2004 Product1 150 12-31-200411.59PM

1 01-01-200512.00AM

2005 Product1 250

Page 21: Abinitio Vijay - 8553385664

8/3/2019 Abinitio Vijay - 8553385664

http://slidepdf.com/reader/full/abinitio-vijay-8553385664 21/28

SCD Type3

Type3 : Creating new fields

In this Type 3, the latest update to the changed values can be seen. Example mentioned below illustrates how to addnew columns and keep track of the changes. From that, we are able to see the current price and the previous priceof the product, Product1.

The problem with the Type 3 approach, is over years, if the product price continuously changes, then the completehistory may not be stored, only the latest change will be stored. For example, in year 2006, if the product1's pricechanges to $350, then we would not be able to see the complete history of 2004 prices, since the old values wouldhave been updated with 2005 product information 

Product ID(PK) CurrentYear

ProductName

CurrentProduct Price

Old ProductPrice

Old Year

12005 Product1 250 150 2004

Product ID(PK) Year ProductName

Product Price Old ProductPrice

Old Year

1 2006 Product1 350 250 2005

Page 22: Abinitio Vijay - 8553385664

8/3/2019 Abinitio Vijay - 8553385664

http://slidepdf.com/reader/full/abinitio-vijay-8553385664 22/28

• Extract, transform, and load (ETL) is a process in database

usage and

especially in data warehousing that involves:• Extracting data from outside sources

• Transforming it to fit operational needs (which can include quality

levels)

• Loading it into the end target (database or data warehouse)Extract :-

• The first part of an ETL process involves extracting the data from

the source

systems.• Most data warehousing projects consolidate data from different

source systems. Common data source formats are relational

databases and flat files, but may include non-relational database

structures such as Information Management System (IMS) or other

Page 23: Abinitio Vijay - 8553385664

8/3/2019 Abinitio Vijay - 8553385664

http://slidepdf.com/reader/full/abinitio-vijay-8553385664 23/28

Transform

• The transform stage applies a series of rules or functions to the

extracted data from the source to derive the data for loading into theend target. Some data sources will require very little or even nomanipulation of data. In other cases, one or more of the followingtransformation types may be required to meet the business andtechnical needs of the target database:• Generating surrogate-key values

• Transposing or pivoting (turning multiple columns into multiple rowsor vice versa)

• Splitting a column into multiple columns (e.g., putting a comma-separated list specified as a string in one column as individual values in

different columns)

• Disaggregation of repeating columns into a separate detail table (e.g.,moving a series of addresses in one record into single addresses in a setof records in a linked address table)

• Lookup and validate the relevant data from tables or referential filesfor slowl chan in dimensions.

Page 24: Abinitio Vijay - 8553385664

8/3/2019 Abinitio Vijay - 8553385664

http://slidepdf.com/reader/full/abinitio-vijay-8553385664 24/28

Load

• The load phase loads the data into the end target, usually the datawarehouse (DW). Depending on the requirements of the organization,

this process varies widely.

• Some data warehouses may overwrite existing information withcumulative, frequently updating extract data is done on daily, weeklyor monthly. while other DW (or even other parts of the same DW) mayadd new data in a historicized form, for example, hourly.

• As the load phase interacts with a database, the constraints definedin the database schema — as well as in triggers activated upon dataload — apply (for example, uniqueness, referential integrity,mandatory fields), which also contribute to the overall data qualityperformance of the ETL process.

Page 25: Abinitio Vijay - 8553385664

8/3/2019 Abinitio Vijay - 8553385664

http://slidepdf.com/reader/full/abinitio-vijay-8553385664 25/28

Real-life ETL cycle

 The typical real-life ETL cycle consists of the following execution

steps:

• Cycle initiation

• Build reference data

• Extract (from sources)• Validate

• Transform (clean, apply business rules, check for data integrity,

create

aggregates or disaggregates)

• Stage (load into staging tables, if used)

• Audit reports (for example, on compliance with business rules.

Also, in case

of failure, helps to diagnose/repair)

Page 26: Abinitio Vijay - 8553385664

8/3/2019 Abinitio Vijay - 8553385664

http://slidepdf.com/reader/full/abinitio-vijay-8553385664 26/28

A recent development in ETL software is the implementation of parallel

processing. This has enabled a number of methods to improve overall

performance of ETL processes when dealing with large volumes of data.

ETL applications implement three main types of parallelism:

Data: By splitting a single sequential file into smaller data files to provide

parallel access.

Pipeline: Allowing the simultaneous running of several components on the

same data stream. For example: looking up a value on record 1 at the

same time as adding two fields on record 2.

Component: The simultaneous running of multiple processes on different

data streams in the same job, for example, sorting one input file while

removing duplicates on another file.

Page 27: Abinitio Vijay - 8553385664

8/3/2019 Abinitio Vijay - 8553385664

http://slidepdf.com/reader/full/abinitio-vijay-8553385664 27/28

ata processing tool from Ab Initio software corporation (

http://www.abinitio.com)

atin for “from the beginning”

esigned to support largest and most complex business

applications

raphical, intuitive, and “fits the way your business works”

text

AB INITIO INTRODUCTION

Page 28: Abinitio Vijay - 8553385664

8/3/2019 Abinitio Vijay - 8553385664

http://slidepdf.com/reader/full/abinitio-vijay-8553385664 28/28

Importance of Ab Initio When Compared to other ETL’s

1)Able to Process huge amount of data in a less span of time

2)Easy to write complex and custom ETL logics especially in case of 

Banking and Financial Applications.

Ex :- Amortization.

3)Ab Initio follows all three types of parallelism , which an ETL tool

needs to handle.

4)Data Parallelism of Ab Initio is one feature which makes it distinct

from the other ETL tools.

5)When Handling complex logics , you can write custom code , as it is

Pro C based code.