d Wh Concepts
Transcript of d Wh Concepts
-
7/30/2019 d Wh Concepts
1/79
Data Warehousing Concepts
-
7/30/2019 d Wh Concepts
2/79
2
Course Overview
What is Data Warehouse
OLTP Vs. Data Warehousing
Data Warehousing Architecture
Data Warehousing Schemas & Objects
Physical Design in Data Warehouse
Definition of Data Warehousing
-
7/30/2019 d Wh Concepts
3/79
3
Course Overview
Data Warehousing basic DesignApproaches
Data Warehousing OperationalProcesses
Technical Problems in DataWarehousing
Representative DSS Tools
Business Intelligence
-
7/30/2019 d Wh Concepts
4/79
4
What is a Data Warehouse?
A data warehouse is a relational database that is designed for query and analysisrather than for transaction processing. It usually contains historical data derivedfrom transaction data.
A data warehouse environment includes an extraction, transportation,transformation, and loading (ETL) solution, online analytical processing (OLAP)
and data mining capabilities, client analysis tools, and other applications thatmanage the process of gathering data and delivering it to business users.
It is a series of processes, procedures and tools (h/w & s/w) that help theenterprise understand more about itself, its products, its customers and themarket it services
-
7/30/2019 d Wh Concepts
5/79 5
NOT possible to
purchase a DataWarehouse, but it ispossible to build one.
Data Warehouse is
NOT a specifictechnology
Facts !
-
7/30/2019 d Wh Concepts
6/79 6
Who are the potentialCustomers ?
Which Products are sold themost ?
What are the region-wisepreferences ?What are the competitorproducts ?
What are the projectedsales ?
What if you sale morequantity of a particularproduct ?
What will be the impacton revenue ?Results of promotionschemes introduced ?
Why Data Warehousing?
Need of Intelligent Information in Competitive Market
-
7/30/2019 d Wh Concepts
7/797
William Imon
Defining Data warehouse
-
7/30/2019 d Wh Concepts
8/798
Subject Oriented
The data in datawarehouse is
organized around themajor subject of theenterprise ( i.e. thehigh level entities).
The orientation aroundthe major subject areascauses the data
warehouse design tobe data driven.
The operationalsystems are designedaround the applicationand functions. e.g.
Loans , savings , creditcards in case of aBank. Where DataWarehouse is designedaround a subject likeCustomer , Product ,Vendor etc.
OperationalSystems
DataWarehouse
Customer
Supplier
Product
Organized by processesor tasks
Organized bysubject
-
7/30/2019 d Wh Concepts
9/799
Data Warehouse Data
Time Data
{
Key
Time Variant
Data is stored as a series of snapshots or views which record how it is
collected across time.
It helps in Business trend analysis
In contrast to OLTP environment, data warehouses focus
on change over time that is what we mean by time variant.
-
7/30/2019 d Wh Concepts
10/79
10
Integrated
Data is stored once in a single integrated location
Data WarehouseDatabase
Subject = Customer
Auto Policy
Processing
System
Customerdata
storedin several
databases
Fire Policy
Processing
System
FACTS, LIFE
Commercial, Accounting
Applications
It is closely related with subject orientation.
Data from disparate sources need to be put in a consistent format.
Resolving of problems such as naming conflicts andinconsistencies
-
7/30/2019 d Wh Concepts
11/79
11
Non-Volatile
Existing data in the warehouse is not overwritten or updated.
External
Sources
Read-Only
Data
WarehouseDatabaseData
Warehouse
Environment
Production
Databases
Production
Applications
Update
InsertDelete
Load
This is logical because the purpose of a data warehouse is to enable you toanalyze what has occurred.
-
7/30/2019 d Wh Concepts
12/79
12
So, whats different between OLTP
and Data Warehouse?
-
7/30/2019 d Wh Concepts
13/79
13
OLTP vs. Data Warehouse
OLTP systems are tuned for known transactions and workloads while workload is
not known in a data warehouse
Special data organization, access methods and implementation methods areneeded to support data warehouse queries (typically multidimensional queries)
e.g., average amount spent on phone calls between 9AM-5PM in Pune duringthe month of December
-
7/30/2019 d Wh Concepts
14/79
14
OLTP vs. Data Warehouse
OLTP
Application Oriented
Used to run business
Detailed data
Current up to date
Isolated DataRepetitive access
Clerical User
WAREHOUSE (DSS)
Subject Oriented
Used to analyze business
Summarized and refined
Snapshot data
Integrated DataAd-hoc access
Knowledge User (Manager)
-
7/30/2019 d Wh Concepts
15/79
15
OLTP vs Data Warehouse
OLTP
Performance Sensitive
Few Records accessed at a time (tens)
Read/Update Access
No data redundancy
Database Size 100MB -100 GB
DATA WAREHOUSE
Performance relaxed
Large volumes accessed at atime(millions)
Mostly Read (Batch Update)
Redundancy present
Database Size 100 GB -few terabytes
-
7/30/2019 d Wh Concepts
16/79
16
OLTP vs Data Warehouse
OLTP
Transaction throughput is theperformance metric
Thousands of users
Managed in entirety
Data Warehouse
Query throughput is theperformance metric
Hundreds of users
Managed by subsets
-
7/30/2019 d Wh Concepts
17/79
17
To summarize ...
OLTP Systems are
used to runa business
The Data Warehouse helps tooptimizethe business
-
7/30/2019 d Wh Concepts
18/79
18
Data Warehouse Architectures
Centralized
In a centralized architecture, there exists only one data warehouse which storesall data necessary for business analysis. As already shown in the previous section,the disadvantage is the loss of performance in opposite to distributed approaches.
Central Architecture
-
7/30/2019 d Wh Concepts
19/79
19
Federated
In a federated architecture the data is logically consolidated but stored inseparate physical databases, at the same or at different physical sites. The localdata marts store only the relevant information for a department.The amount of data is reduced in contrast to a central data warehouse. The levelof detail is enhanced.
Federated Architecture
Data Warehouse Architectures Contd
-
7/30/2019 d Wh Concepts
20/79
20
Tiered:
A tiered architecture is a distributed data approach. This processcan not be done in one step because many sources have to beintegrated into the warehouse.On a first level, the data of all branches in one region is collected, inthe second level the data from the regions is integrated into onedata warehouse.
Advantages:
Faster response timebecause the data islocated closer to the clientapplications and
Reduced volume of datato be searched.
Tiered Architecture
Data Warehouse Architectures Contd
-
7/30/2019 d Wh Concepts
21/79
21
Metadata
Data Sources Data Management Access
Complete Warehouse Solution Architecture
Operational Data
Legacy Data
The Post
VISA
External DataSources
EnterpriseData
Warehouse
Organizationally
structured
Extract
Transform
Load
Data Information Knowledge
Asset Assembly (and Management) Asset Exploitation
DataMart
DataMart
Departmentallystructured
Data
Mart
Sales
Inventory
Purchase
-
7/30/2019 d Wh Concepts
22/79
22
Data Sources:
Legacy data
Operational data
External data resources
Data Management :
Metadata - At all levels of the data warehouse, information is required to supportthe maintenance and use of the Data Warehouse.
Data Mart A data mart is a subject oriented data warehouse.
Data Warehouse Architecture Components
Disparate datasources
-
7/30/2019 d Wh Concepts
23/79
23
Introduction To Data Marts
What is a Data Mart
From the Data Warehouse , atomic data flows to various departments for theircustomized needs. If this data is periodically extracted from data warehouse
and loaded into a local database, it becomes a data mart. The data in Data Mart
has a different level of granularity than that of Data Warehouse. Since the data
in Data Marts is highly customized and lightly summarized , the departments cando whatever they want without worrying about resource utilization. Also thedepartments can use the analytical software they find convenient. The cost ofprocessing becomes very low.
-
7/30/2019 d Wh Concepts
24/79
24
Data Mart Overview
Data Marts
Satisfy 80% of
the local end-
users requests
Sales Representatives
and Analysts
Human
Resources
Financial Analysts,
Strategic Planners,
and Executives
DM Marketing
DM Finance
DM SalesDM HR
Data Warehouse
DM Sales
DM HR
DM Marketing
-
7/30/2019 d Wh Concepts
25/79
25
From TheData Warehouse To Data Marts
DepartmentallyStructured
IndividuallyStructured
Data WarehouseOrganizationallyStructured
Less
More
HistoryNormalizedDetailed
Data
Information
-
7/30/2019 d Wh Concepts
26/79
26
Operational Data Store (ODS)
What is an ODSAn Operational Data Store (ODS) integrates data from multiple business operation
sources to address operational problems that span one or more business functions.
An ODS has the following features:
Subject-oriented Organized around major subjects of an organization(customer, product, etc.), not specific applications (order entry, accounts
receivable, etc.).
Integrated Presents an integrated image of subject-oriented data which ispulled from fragmented operational source systems.
Current Contains a snapshot of the current content of legacy source systems.History is not kept, and might be moved to the data warehouse for analysis.
Volatile Since ODS content is kept current, it changes frequently. Identicalqueries run at different times may yield different results.
Detailed ODS data is generally more detailed than data warehouse data.Summary data is usually not stored in an ODS; the exact granularity depends on thesubject that is being supported.
-
7/30/2019 d Wh Concepts
27/79
27
Operational Data Store (ODS) Contd
The ODS provides an integrated view of data in operational systems.
As the figure below indicates, there is a clear separation between the ODS and thedata warehouse.
A
B
C
EIS
DSS
Apps
PC
Operational
Data Store
Current or near
current data
Detailed data
Updates allowed
Historical data
Summary and detail
Non-volatile
snapshots only
Data Warehouse
-
7/30/2019 d Wh Concepts
28/79
28
Benefits Of ODS
Supports operational reporting needs of the organization
Provides a complete view of customer relationships, the data for which might bestored in several operational databases -- this data can include data from anorganizations internal systems, as well as external data from third-party vendors.
Operates as a store for detailed data, updated frequently and used for drill-downs
from the data warehouse which contains summary data.
Reduces the burden placed on other operational or data warehouse platforms byproviding an additional data store for reporting.
Provides more current data than in a data warehouse and more integrated than an
OLTP system
Feeds other operational systems in addition to the data warehouse
-
7/30/2019 d Wh Concepts
29/79
29
Data Warehousing SCHEMAS & OBJECTS
A schema is a collection of database objects, including tables, views,indexes, and synonyms.
There is a variety of ways of arranging schema objects in the schema
models designed for data warehousing. The are:
Star Schema
Snowflake Schema
Galaxy Schema
-
7/30/2019 d Wh Concepts
30/79
30
Star Schema: It Consists of a fact table connected to a set of dimensional
tables
Data is in Dimension tables is De-Normalized
Snowflake Schema:
It is refinement of star schema where some dimensional
hierarchy is normalized in to a set of dimensional tables
Galaxy Schema:Multiple fact tables share dimension tables viewed as a
collection of stars, therefore called galaxy schema
-
7/30/2019 d Wh Concepts
31/79
31
Star Schema
A star schema a highly De-Normalized, query-centric model where
information is broken into two groups: facts and dimensions.
Time_DimTimeKeyTheDate...
Sales_FactTimeKeyEmployeeKeyProductKeyCustomerKeyShipperKey
Required Data
(Business Metrics)
or (Measures)...
Employee_DimEmployeeKeyEmployeeID...
Branch_DimBranchIDBranchno...
Customer_Dim
CustomerKeyCustomerID...
Shipper_DimShipperKeyShipperID...
S fl k S h
-
7/30/2019 d Wh Concepts
32/79
32
Sales_fact
timeID {FK}
propertyID {FK}
branchID {FK}
clientID {FK}
promotionID {FK}
staffID {FK}
ownerID {FK}
offerPrice
sellingPrice
saleCommission
saleRevenue
Branch_Dim
branchID {PK}
branchNo
branchType
city {FK}
City
city {PK}
region {FK}
Regionregion {PK}
country
Figure32.2
Fact Table
Dimension
Tables
Snowflake Schema
-
7/30/2019 d Wh Concepts
33/79
33
Multiple Groups of Facts links by few common dimensions
Fact1
Fact2 Fact3
Dimension2Dimension1
Dimension4
Dimension5
Dimension3
Dimension7Dimension6
Galaxy Schema
-
7/30/2019 d Wh Concepts
34/79
34
Data Warehousing Objects
All the three types of Schemas are described in the Data Modeling section
Various Objects used in Data Warehousing are:
Fact Tables
Dimension Tables
Hierarchies
Unique Identifiers
Relationships
-
7/30/2019 d Wh Concepts
35/79
35
Data Warehousing Objects
Fact Tables:
Represent a business process, i.e., models the business process as an artifact inthe data model
Contain the measurements or metrics or facts of business processes
"monthly sales number" in the Sales business process
most are additive (sales this month), some are semi-additive (balance as of),some are not additive (unit price)
The level of detail is called the grain of the table
Contain foreign keys for the dimension tables
F t T
-
7/30/2019 d Wh Concepts
36/79
36
Additive facts:
Additive facts are facts that can be summed up through all of the dimensions
in the fact table
Semi-Additive facts:
Semi-additive facts are facts that can be summed up for some of the dimensions
in the fact table
Non-additive facts:
Non-additive facts are facts that cannot be summed up for any of the
dimensions Present in the fact table
Fact Types :
Examples to illustrate Additive, Semi-Additive& Non-Additive facts:
-
7/30/2019 d Wh Concepts
37/79
37
& Non-Additive facts:
Date
Store
Product
Sales_Amount
The purpose of this table is to record the Sales_Amount for each product in each storeOn a daily basis. Sales_Amount is the fact.
In this case, Sales_Amount is an additive fact, because we can sum up this fact alongwith any of the 3 dimensions present in the fact table date, store, and product
Fact table:
Eg for semi Additive & Non Additive facts:
-
7/30/2019 d Wh Concepts
38/79
38
Eg for semi-Additive & Non-Additive facts:
Date
Account
Current_Balance
Profit_Margin
Fact table:
The purpose of this table is to record the current balance for each account at the end ofeach day, as well as the profit margin for each account for each day
Current_Balance & Profit_Margin are the facts
Current_Balance is a semi additive fact, as it makes sense to add them up for allaccounts (whats the total current balance for all accounts in the bank?), but it does not
make sense to add them up through time
Profit_Margin is a non additive fact, for it does not make sense to add them up for theaccount level or the day level
types of fact tables :
-
7/30/2019 d Wh Concepts
39/79
39
Based on the above classifications, there are two types of fact tables
Cumulative Snapshot
Cumulative: This type of fact table describes what has happened over a period of timeFor example this fact table may describe the total sales by product by store by dayThe facts for this type of fact tables are mostly additive. The first example is a
Cumulative fact table.
Snapshot: This type of fact table describes the state of things in a particular instanceOf time, and usually includes more semi additive and non-additive facts.
The second example presented is a snapshot fact table
types of fact tables :
D t W h i Obj t C td
-
7/30/2019 d Wh Concepts
40/79
40
Data Warehousing Objects Contd.
Dimension Tables:
Dimension tables
Define business in terms already familiar to users
Wide rows with lots of descriptive text
Small tables (about a million rows)
Joined to fact table by a foreign key
heavily indexed
typical dimensions
time periods, geographic region (markets, cities), products, customers,salesperson, etc.
Dimension tables Types
-
7/30/2019 d Wh Concepts
41/79
41
Dimension tables Types
Dimension tables Types
Slowly Changing dimensions
Junk Dimensions
Confirmed Dimensions
Degenerated Dimensions.
Slowly Changing Dimensions :(SCD)
-
7/30/2019 d Wh Concepts
42/79
42
Various data elements in the dimension undergo changes (e.g. changes in
attributes, hierarchical structures) which need to be captured for analysis.
SCD problem is a common one particular to data warehousing.
In a nutshell, this applies to cases where the attribute for a record varies over time.
For eg:Customer key Name State
1001 Christina Illinois
Christina is a customer who first lived in chicago,illinois. At a later date, she moved to
Los Angeles,California. Now how to modify the table to reflect this change?
This is a Slowly Changing Dimension problem
Slowly Changing Dimensions :(SCD)
Types of SCD
-
7/30/2019 d Wh Concepts
43/79
43
There are in general 3 ways to solve this type of problem, and they are
categorized as follows:
Type 1
Type 2
Type 3
Type 1: New record places the original record. No trace of the old record exists
Type 2:A new record is added to the customer dimension table
Type 3: The Original record is modified to reflect the change
Types of SCD
TYPE 1:
-
7/30/2019 d Wh Concepts
44/79
44
New record places the original record. No trace of the old record exists
Eg: Customer key Name State
1001 Christina Illinois
After Christina moved from illinois to California, the new information replaces the
new record and we have the following table:
Customer key Name State
1001 Christina California
Advantages:This is the easiest way to handle the Slowly Changing dimension, Since there
is no need to keep track of the old information.
Disadvantages:All the history is lost. By applying this methodology, it is not possible to
track back in history. Foreg In the above case, the company would not able to knowthat Christina lived in Illinois before.
TYPE 1:
TYPE 2:
-
7/30/2019 d Wh Concepts
45/79
45
In type 2 SCD a new record is added to the table to represent the new Information.Therefore both the original & the new record will be present
Eg:
After Christina moved from illinois to California, we add the new information as a
new row into the tableAdvantages:
This allows us to accurately keep all historical information
Disadvantages:
This will cause the size of the table to grow fast where the number of rows for the
table is very high to start with, storage and performance can become a concern
Customer key Name State
1001 Christina Illinois
1005Christina California
TYPE 2:
TYPE 3:
-
7/30/2019 d Wh Concepts
46/79
46
In type 3 SCD there will be two columns to indicate the particular attribute ofinterest, one indicating the original value, and one indicating the current value.There will also be a column that indicates when the current value becomes active.
Eg:
After Christina moved from illinois to California, the original information gets updated,
And we have the above table (Assuming the effective date of change is January 15,2003Advantages: This does not increase the size of the table, since new information is updated
This allows us to keep some part of history
Disadvantages:Type 3 will not be able to keep all history where an attribute is changed more than
Once. For eg, if Christina later moves from to Texas on December 15,2003 theCalifornia information is lost
Customer key Name Original State Current State Effective Date
1001 Christina Illinois California 15-Jan-03
TYPE 3:
Degenerated Dimension:
-
7/30/2019 d Wh Concepts
47/79
47
Degenerate dimension is a dimension which is derived from the fact tableand doesn't have its own dimension table.
Degenerate dimensions are often used when a fact table's grain representstransactional level data and one wishes to maintain system specific identifierssuch as order numbers, invoice numbers and the like without forcing their
inclusion in their own dimension.
Degenerated Dimension:
Confirmed Dimensions :
-
7/30/2019 d Wh Concepts
48/79
48
Dimension which is fixed and reusable.
It is also called as fixed dimension. It is a dimension which doesn't effectwith respect to time.
Ex : if the name of the city is changed from Bombay to Mumbai, the name
will not change from time to time, once the change is done ,The change is permanent.This type of dimensions are called confirmed or fixed dimensions.
Confirmed Dimensions :
Junk dimensions:
-
7/30/2019 d Wh Concepts
49/79
49
A dimension where one can store random transactional codes,flags and text attributes that are not related to other dimensionsand which provides a simple way for users to easily find thoseunrelated attributes.
Ex: Martial Status : (Yes or No)
Gender : (M or F) e.t.c.
Junk dimensions:
Data Warehousing Objects Contd.
-
7/30/2019 d Wh Concepts
50/79
50
Data Warehousing Objects Contd.
Hierarchies:
Hierarchies are logical structures that use ordered levels as a meansof organizing data. A hierarchy can be used to define data aggregation.For example, in a time dimension, a hierarchy might aggregate data fromthe month level to the quarter level to the year level. A level represents aposition in a hierarchy.
Unique Identifiers:
Unique identifiers are specified for one distinct record in a dimension table.Artificial unique identifiers are often used to avoid the potential problem ofunique identifiers changing.
Relationships:
Relationships guarantee business integrity. Designing a relationship betweenthe sales information in the fact table and the dimension tables products andcustomers enforces the business rules in databases.
Physical Design In Datawarehouse
-
7/30/2019 d Wh Concepts
51/79
51
y g
Physical design is the creation of the database with SQL statements. During the
physical design process, you convert the data gathered during the logical designphase into a description of the physical database structure.
Physical Design Structures:
Table spaces: A tablespace consists of one or more data files, which are physical
structures within the operating system you are using. A data file is associatedwith only one tablespace. From a design perspective, table spaces are containersfor physical design structures.
Tables and Partitioned Tables: Tables are the basic unit of data storage. They arethe container for the expected amount of raw data in your data warehouse. Usingpartitioned tables instead of non-partitioned ones addresses the key problem of
supporting very large data volumes by allowing you to decompose them intosmaller and more manageable pieces.
Physical Design In Data Warehouse Contd.
-
7/30/2019 d Wh Concepts
52/79
52
y g
Views:
A view is a tailored presentation of the data contained in one or more tables orother views. A view takes the output of a query and treats it as a table. Views donot require any space in the database.
Integrity Constraints:
Integrity constraints are used to enforce business rules associated with yourdatabase and to prevent having invalid information in the tables. Integrityconstraints in data warehousing differ from constraints in OLTP environments. InOLTP environments, they primarily prevent the insertion of invalid data into arecord, which is not a big problem in data warehousing environments becauseaccuracy has already been guaranteed.
Indexes:
Indexes are optional structures associated with tables or clusters. In addition tothe classical B-tree indexes, bitmap indexes are very common in datawarehousing environments.
Definition Of Data Warehouse
-
7/30/2019 d Wh Concepts
53/79
53
Ralph Kimball's paradigm:
Data warehouse is the conglomerate of all data marts within the
enterprise. Information is always stored in the dimensional model.
Bill Inmon's paradigm:
Data warehouse is one part of the overall business intelligence system.
An enterprise has one data warehouse, and data marts source their
information from the data warehouse. In the data warehouse, information
is stored in 3rd normal form
Basic Design Approaches of Data Warehouse
-
7/30/2019 d Wh Concepts
54/79
54
There are two major types of approaches to building or designing the
Data Warehouse.
The Top-Down Approach
The Bottom-Up Approach
The Top Down Approach
-
7/30/2019 d Wh Concepts
55/79
55
The Dependent Data Mart structure or Hub & Spoke: The Top-Down Approach
Inmon advocated a dependent data mart structure
The data flow in the top down OLAP environment begins with data extractionfrom the operational data sources. This data is loaded into the staging area andvalidated and consolidated for ensuring a level of accuracy and then transferredto the Operational Data Store (ODS).
Detailed data is regularly extracted from the ODS and temporarily hosted in thestaging area for aggregation, summarization and then extracted and loaded intothe Data warehouse.
Once the Data warehouse aggregation and summarization processes arecomplete, the data mart refresh cycles will extract the data from the Datawarehouse into the staging area and perform a new set of transformations on
them. This will help organize the data in particular structures required by datamarts. Then the data marts can be loaded with the data and the OLAPenvironment becomes available to the users.
The Top Down Approach Contd
-
7/30/2019 d Wh Concepts
56/79
56
Inmon Approach
The data marts are treated as sub sets of the data warehouse. Eachdata mart is built for an individual department and is optimized for
analysis needs of the particular department for which it is created.
The Bottom-Up Approach
-
7/30/2019 d Wh Concepts
57/79
57
1. The Data warehouse Bus Structure: The Bottom-Up Approach
Ralph Kimball designed the data warehouse with the data marts connectedto it with a bus structure.
The bus structure contained all the common elements that are used by datamarts such as conformed dimensions, measures etc defined for the enterpriseas a whole.
This architecture makes the data warehouse more of a virtual reality than aphysical reality
All data marts could be located in one server or could be located on differentservers across the enterprise while the data warehouse would be a virtualentity being nothing more than a sum total of all the data marts
In this context even the cubes constructed by using OLAP tools could beconsidered as data marts.
The Bottom-Up Approach Contd
-
7/30/2019 d Wh Concepts
58/79
58
Kimball Approach
The bottom-up approach reverses the positions of the Data warehouseand the Data marts. Data marts are directly loaded with the data from theoperational systems through the staging area.
The data flow in the bottom up approach starts with extraction of datafrom operational databases into the staging area where it is processedand consolidated and then loaded into the ODS.
The Bottom-Up Approach Contd
-
7/30/2019 d Wh Concepts
59/79
59
The data in the ODS is appended to or replaced by the fresh data being
loaded. After the ODS is refreshed the current data is once again
extracted into the staging area and processed to fit into the Data mart
structure. The data from the Data Mart, then is extracted to the staging
area aggregated, summarized and so on and loaded into the Data Warehouse andmade available to the end user for analysis.
DW Operational Processes (Overview ofExtraction, Transformation & Loading)
-
7/30/2019 d Wh Concepts
60/79
60
Typically host based, legacy applications
Customized applications, COBOL, 3GL, 4GL
Point of Contact Devices
POS, ATM, Call switches
External Sources
Nielsens, Acxiom, CMIE, Vendors, Partners
Sequential Legacy Relational ExternalOperational/Source Data
SourceData
DW Operational Processes (Overview ofExtraction, Transformation & Loading) Contd
-
7/30/2019 d Wh Concepts
61/79
61
These tools try to automate or support tasks such as:-
Data Extraction (accessing diff source data bases)
Data Cleansing (finding and resolving inconsistencies in the source data)
Data Transformation (between different data formats, languages, etc.)
Data Loading
Replication (replicating source databases into the data warehouse)
Analyzing & Checking of Data Quality (for correctness and completeness)
Building derived data & views
DW Operational Processes (Overview ofExtraction, Transformation & Loading) Contd
-
7/30/2019 d Wh Concepts
62/79
62
Elements of a Data Warehouse
DW Operational Processes (Overview ofExtraction, Transformation & Loading) Contd
-
7/30/2019 d Wh Concepts
63/79
63
Loading the Warehouse
Cleaning the data before it is loaded
DW Operational Processes (Overview ofExtraction, Transformation & Loading) Contd
-
7/30/2019 d Wh Concepts
64/79
64
These processes have been discussed in details in the ETL section.
Some important definitions:
Data Scrubbing: http://www.wisegeek.com/what-is-data-scrubbing.htm
Data Cleansing: http://www.wisegeek.com/what-is-data-cleansing.htm
Row level security: http://www.securityfocus.com/infocus/1743
Staging Types: http://esj.com/Columns/article.aspx?EditorialsID=55
Technical Problems in Data Warehouse
http://www.wisegeek.com/what-is-data-scrubbing.htmhttp://www.wisegeek.com/what-is-data-cleansing.htmhttp://www.securityfocus.com/infocus/1743http://esj.com/Columns/article.aspx?EditorialsID=55http://esj.com/Columns/article.aspx?EditorialsID=55http://www.securityfocus.com/infocus/1743http://www.wisegeek.com/what-is-data-cleansing.htmhttp://www.wisegeek.com/what-is-data-cleansing.htmhttp://www.wisegeek.com/what-is-data-cleansing.htmhttp://www.wisegeek.com/what-is-data-cleansing.htmhttp://www.wisegeek.com/what-is-data-cleansing.htmhttp://www.wisegeek.com/what-is-data-cleansing.htmhttp://www.wisegeek.com/what-is-data-cleansing.htmhttp://www.wisegeek.com/what-is-data-scrubbing.htmhttp://www.wisegeek.com/what-is-data-scrubbing.htmhttp://www.wisegeek.com/what-is-data-scrubbing.htmhttp://www.wisegeek.com/what-is-data-scrubbing.htmhttp://www.wisegeek.com/what-is-data-scrubbing.htmhttp://www.wisegeek.com/what-is-data-scrubbing.htmhttp://www.wisegeek.com/what-is-data-scrubbing.htm -
7/30/2019 d Wh Concepts
65/79
65
Managing large amounts of data:
The explosion of data volume came about because the data warehouse required
that both detail and history be mixed in the same environment.Large amounts of data need to be managed in many ways-through flexibility ofaddressability of data stored inside the processor and stored inside diskstorage, through indexing, through extensions of data, through the efficientmanagement of overflow, and so forth. To be effective, the technology usedmust satisfy the requirements for both volume and efficiency.
Index/Monitor Data:
If data in the warehouse cannot be easily and efficiently indexed, the datawarehouse will not be a success. Monitoring data warehouse data determinessuch factors as the following:
If a reorganization needs to be done
If an index is poorly structured
If too much or not enough data is in overflow
The statistical composition of the access of the data
Available remaining space
Technical Problems in Data Warehouse Contd
-
7/30/2019 d Wh Concepts
66/79
66
Interfaces to many technologies:
Data passes into the data warehouse from the operational environment
and the ODS, and from the data warehouse into data marts, DSS applications,exploration and data mining warehouses, and alternate storage.
This passage must be smooth and easy.
The interface to different technologies requires several considerations:
Does the data pass from one DBMS to another easily?
Does it pass from one operating system to another easily?
Does it change its basic format in passage (EBCDIC, ASCII, etc.)?
Technical Problems in Data Warehouse Contd
-
7/30/2019 d Wh Concepts
67/79
67
Meta Data Management:
The data warehouse operates under a heuristic, iterative development life cycle.To be effective, the user of the data warehouse must have access to meta datathat is accurate and up-to-date.
Several types of meta data need to be managed in the data warehouse: distrib-uted meta data, central meta data, technical meta data, and business meta data.
Technical Problems in Data Warehouse Contd
-
7/30/2019 d Wh Concepts
68/79
68
Efficient Loading of Data
Data is loaded into a data warehouse in two fundamental ways:
a record at a time through a language interface or en masse with a utility.
Indexes must be efficiently loaded at the same time the data is loaded. As theburden of the volume of loading becomes an issue, the load is often parallelized.
Another related approach to the efficient loading of very large amounts of data isstaging the data prior to loading.
As a rule, large amounts of data are gathered into a buffer area before beingprocessed by extract/transfer/load (ETL) software. The staged data is merged,perhaps edited, summarized, and so forth, before it passes into the ETL layer.
Technical Problems in Data Warehouse Contd
-
7/30/2019 d Wh Concepts
69/79
69
Lock Management:
The lock manager ensures that two or more people are not updating the
same record at the same time. But update is not done in the data warehouse;instead, data is stored in a series of snapshot records. When a change occurs
a new snapshot record is added, rather than an update being done.
Steps in Building a Data Warehouse:
-
7/30/2019 d Wh Concepts
70/79
70
Identify key business drivers, sponsorship, risks, ROI
Survey information needs and identify desired functionality and definefunctional requirements for initial subject area.
Architect long-term, data warehousing architecture
Evaluate and Finalize DW tool & technology
Conduct Proof-of-Concept
Design target data base schema
Build data mapping, extract, transformation, cleansing andaggregation/summarization rules
Build initial data mart, using exact subset of enterprise data warehousingarchitecture and expand to enterprise architecture over subsequent phases
Maintain and administer data warehouse
Representative DSS Tools
-
7/30/2019 d Wh Concepts
71/79
71
Tool Category Products
ETL Tools ETI Extract, Informatica, IBM Visual WarehouseOracle Warehouse Builder
OLAP Server Oracle Express Server, Hyperion Essbase,IBM DB2 OLAP Server, Microsoft SQL Server
OLAP Services, Seagate HOLOS, SAS/MDDB
OLAP Tools Oracle Express Suite, Business Objects,Web Intelligence, SAS, Cognos Powerplay/Impromtu, KALIDO, MicroStrategy, Brio Query,MetaCube
Data Warehouse Oracle, Informix, Teradata, DB2/UDB, Sybase,Microsoft SQL Server, RedBricks
Data Mining & Analysis SAS Enterprise Miner, IBM Intelligent Miner,SPSS/Clementine, TCS Tools
Business Intelligence
-
7/30/2019 d Wh Concepts
72/79
72
How intelligent can you make your business processes?
What insight can you gain into your business?
How integrated can your business processes be?
How much more interactive can your business be with customers, partners,
employees and managers?
What is Business Intelligence (BI)?
-
7/30/2019 d Wh Concepts
73/79
73
Business Intelligence is a generalized term applied to a broad category ofapplications and technologies for gathering, storing, analyzing and providingaccess to data to help enterprise users make better business decisions
Business Intelligence applications include the activities of decision supportsystems, query and reporting, online analytical processing (OLAP), statisticalanalysis, forecasting, and data mining
An alternative way of describing BI is: the technology required to turn raw datainto information to support decision-making within corporations and businessprocesses
Why BI?
-
7/30/2019 d Wh Concepts
74/79
74
BI technologies help bring decision-makers the data in a form they can quicklydigest and apply to their decision making.
BI turns data into information for managers and executives and in general, peoplemaking decisions in a company.
Companies want to use technology tactically to make their operations moreeffective and more efficient - Business intelligence can be the catalyst for thatefficiency and effectiveness.
Benefits
-
7/30/2019 d Wh Concepts
75/79
75
The benefits of a well-planned BI implementation are going to be closely tied tothe business objectives driving the project.
Identify trends and anomalies in business operations more quickly, allowingfor more accurate and timelier decisions.
Deliver actionable insight and information to the right place with less effort .
Identify and operate based on a single version of the truth, allowing allanalysis to be completed on a core foundation with confidence.
Business Intelligence Platform Requirements
-
7/30/2019 d Wh Concepts
76/79
76
Data Warehouse Databases
OLAP
Data Mining
Interfaces
Build and Manage Capabilities
The business intelligence platform should provide good integration across thesetechnologies. It should be a coherent platform, not a set of diverse andheterogeneous technologies.
Business Intelligence Components
-
7/30/2019 d Wh Concepts
77/79
77
TRANSFORM
LOAD
EXTRACT
OLAPDATAMINING
DataWarehouse
Operational Data
Business Intelligence Architecture
-
7/30/2019 d Wh Concepts
78/79
78
Business Intelligence Technologies
-
7/30/2019 d Wh Concepts
79/79
79
Data Sources
Paper, Files, Information Providers, Database Systems, OLTP
Data Warehouses / Data Marts
Data Exploration
OLAP, DSS, EIS, Querying and Reporting
Data Mining
Information discovery
Data Presentation
Visualization Techniques
Decision Making
Increasing potential to
support business decisions End User
Business Analyst
Data Analyst
DB Admin