Data Warehouse Overview Done
Transcript of Data Warehouse Overview Done
-
8/6/2019 Data Warehouse Overview Done
1/112
1
1
Introduction toData Warehouse
( s l ides in th is sec t ion
a r e u s e d c o u r t e s y o f C a r r i g E m e r g i n g T e c h n o l og y
P h : 4 1 0 - 5 5 3 - 6 7 6 0www.ca r r ige t .com)
2
I n t r o d u c t i on t o D a t a Wa r e h o u s i n g a n d D a t a
M i n i n g
I n t r o d u c t i on t o D a t a Wa r e h o u s i n g a n d D a t a
M i n i n g
1) Data Warehouse Introduction
2) Engineering Conflicts
3) OLTP and DSS
4) Stovepipe vs. Integration
5) Data Warehouse Solution
6) Enterprise Information System
7) Security in a Data Warehouse
8) Moving Data to a Data Warehouse
9) Data Marts
10) Data Mining
-
8/6/2019 Data Warehouse Overview Done
2/112
2
3
I n t r o d u c t i o nI n t r o d u c t i o n
Key topics for this course include:
Data Warehouse
Data Mart
Data Mining
Background and review of relational database
systems
Main focus on data warehouse and data mining
4
D a t a Wa r e h o u s e I n t r o d u c t i onD a t a Wa r e h o u s e I n t r o d u c t i on
A data warehouse is a single source for key,corporate information needed to enable businessdecisions
A database applicationis a piece of software that
provides a user interface for users to add, delete,query and update data
Typically, a database management systemisused to actually do the work of adding, deleting,
querying or updating data
DatabaseSystemApplication
Data
-
8/6/2019 Data Warehouse Overview Done
3/112
3
5
E n g i n e e r i n g Co n f li ct s , Q u e r y a n d U p d a t eE n g i n e e r i n g Co n f li ct s , Q u e r y a n d U p d a t e
It is often an engineering problem when data isupdated and long-running queries occur at thesame time
In some cases, the users who are doing updatesmust wait for queries to complete
One way to avoid this is to make a read-only
copy of data
ApplicationDatabase System
Data
for update
Datafor query
6
Database System
OLTPApplication
DSSDataDSS
Application
OLTPData
O L TP a n d D S S D e fi n e dO L TP a n d D S S D e fi n e d
An application that updates is called an on-linetransaction processing(OLTP) application
An application that issues queries to the read-only database is called a decision support system
(DSS)
-
8/6/2019 Data Warehouse Overview Done
4/112
4
7
Ap p l ic a t i o n s i n a T y p i c a l E n t e r p r i s eAp p l ic a t i o n s i n a T y p i c a l E n t e r p r i se
Most organizations have several disparateOLTP/DSS applications in several databases
InventoryOLTP
Application
FinanceOLTP
Application
FinanceDSS
Application
InventoryDSS
Application
SalesOLTP
Application
SalesDSS
Application
F i n a n c e
DS SData
F i n a n c eOLTP
Data
I n v e n t o r yDS S
Data
I n v e n t o r yOLTP
Data
Sa lesDS S
Data
Sa lesOLTP
Data
DATABASE SYSTEM
8
S t o v e p ip e v s I n t e g r a t i o nS t o v e p ip e v s I n t e g r a t i on
When systems stand by themselves they areoften referred to as stovepipes
Systems that easily share data are called wellintegrated systems
FinanceDSS
Application
InventoryOLTP
Application
FinanceOLTP
Application
InventoryDSS
Application
-
8/6/2019 Data Warehouse Overview Done
5/112
5
9
Problems: Users who wish to access data must query several different DSS to
find it
Data may have fundamental conflicts between DSS
a department code table in one DSS may differ in another DSS
a measurement may be stored in meters in one DSS and yards inanother
Solution:
Use a data warehouse, where data is integrated
from the several different stovepipe systems
Data warehouse is really sharing-lite -- youdont have to co-ordinate as much when applications are built
and you still reap the benefits of data sharing
P r o b l e m s w i t h S t o ve p i p e Ar c h i t e ct u r eP r o b l e m s w i t h S t o ve p i p e Ar c h i t e ct u r e
10
D a t a W a r e h o u s e S o lu t i onD a t a Wa r e h o u s e S ol u t i o n
A data warehouse is an attempt to integrate
separate DSS so that users can query one place
to find the answers to their questions
A data warehouse has the key, corporate datainthe organization
A data warehouse tracks historical
data
-
8/6/2019 Data Warehouse Overview Done
6/112
6
11
D a t a Wa r e h o u s e - A S u c c e ss S t or yD a t a Wa r e h o u s e - A S u c c e ss S t or y
Largest data warehouse is Wal-Mart (9 TB)
Uses for Wal-Mart data warehouse
Identifies where a new store should be built based on customer demand
Identifies how stores are performing across the nation
Contains every scan from every purchase
Benefits Wal-Mart gained from their data warehouse
Provided competitive advantage over K-Mart Reduced excess inventory in individual stores
Avoided wasted funds in building stores which would fail
12
S e ll in g t h e D a t a Wa r e h o u s eS e lli n g t h e D a t a Wa r e h o u s e
A data warehouse project will fail withoutcorporate sponsorship
Preferably, the project should be sponsored by the CEO
The CEO must be sold on the value to the business to
improve competitive advantage by deploying a data warehouse
If an active, corporate sponsor does not exist,data sources will be very difficult to identify
Only add data to the warehouse
that will answer key,corporate questions askedby the corporate sponsor.
Otherwise, you will have a data dump
-
8/6/2019 Data Warehouse Overview Done
7/112
7
13
B u i ld i n g a U s e fu l D a t a Wa r e h o u s eB u i ld i n g a U s e fu l D a t a Wa r e h o u s e
You really need: strong executive sponsorship
good knowledge of the data
sound software engineering
stability from source systems
users who want a success
A 75 percent failure rate is often cited
It is WORTH the effort!!!
14
E n t e r p r i se I n fo r m a t i o n S y st e mE n t e r p r i se I n fo r m a t i o n S y st e m
Data Warehouse
EnterpriseInformation
System
An EIS (Enterprise Information System) allowsusers to query data in a data warehouse
Users can access key, corporate data in the datawarehouse
-
8/6/2019 Data Warehouse Overview Done
8/112
8
15
U s er s o f a n E n t e r p r i s e I n fo r m a t i o n S y s t e mU s er s o f a n E n t e r p r i s e I n fo r m a t i o n S y s t e m
Frequently, multiple EIS are needed to satisfydifferent types of users
Some users only want a system that has pre-defined reports so they
only need to click one button to see data they need. These users
want the system to be no harder to use than a coffee pot
Other users want to delve into the data and build their own queries
Executives want a high-level, summary
data and a simple tool Must be VERY easy to use, users want to click a few
buttons and get data they want
Results must be graphs
Users should be able to drill-down into key areas.
16
U s er s o f a n E n t e r p r i s e I n fo r m a t i o n S y s t e mU s er s o f a n E n t e r p r i s e I n fo r m a t i o n S y s t e m
Analysts want a flexible, more detailed tool
Often very knowledgeable about the data
Willing to do more work to learn about the data
Sometimes even learn SQL to issue their own
ad-hoc queries
General users want a tool that provides detaileddata, but is very easy to use
Want access to the data warehouse to do
routine tasks such as
Find me Hanks phone number, etc. Simple application, but not so focused
on large reports
-
8/6/2019 Data Warehouse Overview Done
9/112
9
17
D a t a Wa r e h o u s e / E I SD a t a Wa r e h o u s e / E I S
Data Warehouse
FinanceOLTP
Application
SalesOLTP
Application
F i n a n c eS u b j e c t
Area
F i n a n c e
OLTPData
Inventory
S u b j e c tArea
I n v e n t o r yOLTP
Data
Sales
Subjec tArea
S a l e s
OLTPData
InventoryOLTP
Application
EnterpriseInformation
System
Sa lesOLTP
Data
18
N e ed fo r D a t a Wa r e h o u s e sN e e d fo r D a t a W a r e h o u s e s
Data warehouses provide a single place to store key
corporate data
The idea is that users can go one place to find this key data using an
enterprise information system (EIS)
Data warehouse is also a place to store and accesshistorical data
Users measure performance goals for their company over a period of
time
Company statistics are available
Data not stored in the same place is difficult to locateand compare, easily lost
Single query can be used to access key data
-
8/6/2019 Data Warehouse Overview Done
10/112
10
19
S e cu r i t y in D a t a Wa r e h o u s eS e cu r i t y in D a t a Wa r e h o u s e
Building a data warehouse does increase securityrisk because key, corporate information is all inone place
To mitigate that risk, database systemcomponents can be used to protect the data
warehouse. These include
Views
Access control
Security Administration
Encryption
Audit
20
M ov in g D a t a i n t o t h e D a t a W a r e h o u s eM ov in g D a t a i n t o t h e D a t a W a r e h o u s e
Moving data from source OLTP systems to thedata warehouse is the hard part of datawarehousing
Updates to the data warehouse are performed
periodically
weekly
nightly
monthly
Occasionally, real-timedata is needed in a data warehouse, but this
is not very common
-
8/6/2019 Data Warehouse Overview Done
11/112
11
21
U s in g Mi d d l e w a r e t o M ov e D a t aU s in g Mi d d l e w a r e t o M ov e D a t a
SourceOLTP
System
DataWarehouseMigrationSoftware
Middleware
DataWarehouse
Data can be moved to the warehouse via datamigration software
This is often called middleware because it sitsbetween the source OLTP and the datawarehouse
22
N e ed f or a D a t a M a r tN e e d fo r a D a t a M a r t
A data martis a subset of the data warehousethat may make it simpler for users to access keycorporate data
Sometimes, users only need a piece of data from the data
warehouse
The data martis typically fed from the datawarehouse
Data Warehouse
F i n a n c eS u b j e c t
Area
InventoryS u b j e c t
Area
SalesSubjec t
Area
New YorkD a t a M a r t
California
D a t a M a r t
-
8/6/2019 Data Warehouse Overview Done
12/112
12
23
D a t a M a r t i n Ac t i onD a t a M a r t i n Ac t i on
New York
D a t a M a r t
California
D a t a M a r t
Data Warehouse
FinanceOLTP
Application
SalesOLTP
Application
F i n a n c eS u b j e c t
Area
F i n a n c e
OLTPData
Inventory
S u b j e c tArea
I n v e n t o r yOLTP
Data
Sales
Subjec tArea
S a l e s
OLTPData
InventoryOLTP
Application
EnterpriseInformation
System
Sa lesOLTP
Data
24
D a t a M in i n g I n t r o d u c t io nD a t a M in i n g I n t r o d u c t io n
Data Mining is done by running software thatexamines a database and looks for patterns in thedata
A data warehouse by itself will respond to queriesfrom users
It will nottell users about patterns in data that users may not have
thought about
To find patterns in data, data mining is
used to try and mine key information from
a data warehouse
-
8/6/2019 Data Warehouse Overview Done
13/112
13
25
Ad v a n t a g e s o f D a t a M in i n gAd v a n t a g e s o f D a t a M in i n g
Data mining allows companies to collectinformation and make them more productive andbeat their competition
Data mining helps identify
why customers buy certain products
ideas for very direct marketing
ideas for shelf placement
training of employees vs. employee retention
employee benefits vs. employee retention
26
I m p l e m e n t i n g D a t a M in i n gI m p l e m e n t i n g D a t a M in i n g
Apply data mining tools to run data miningalgorithms against data
There are two approaches:
Copy data from the Data Warehouse and mine it
Mine the data in the Data Warehouse
Popular tools use a variety of different datamining algorithms:
association rules
genetic algorithms
decision trees
neural networks
-
8/6/2019 Data Warehouse Overview Done
14/112
14
27
D a t a M in i n g u s in g S e p a r a t e D a t aD a t a M in i n g u s in g S e p a r a t e D a t a
You can move data from the data warehouse todata mining tools
Advantages
Data mining tools may organize data so they can run faster
Disadvantages
Could be very expensive to move largeamounts of data
Data Warehouse
Data Mining Too lCopy of data made
by theData Mining Tool
28
D a t a M i n i n g Ag a i n s t t h e D a t a W a r e h o u s eD a t a M i n i n g Ag a i n s t t h e D a t a W a r e h o u s e
Data mining tools can access data directly in theData Warehouse
Advantages
No copy of data is needed for data mining
Disadvantages
Data may not be organized in a way that is
efficient for the tool
Data Warehouse
Data Mining Tool
-
8/6/2019 Data Warehouse Overview Done
15/112
15
29
D a t a M in i n g : S u m m a r yD a t a M in i n g : S u m m a r y
Data miningattempts to find patterns in data thatwe did not know about
Often data mining is just a new buzzword for
statistics
Data mining differs from statistics in that largevolumes of data are used
Many different data mining algorithms exist andwe will discuss them in the course
Examples identify users who are most likely to commit credit card fraud
identify what attributes about a person most results in them buying
productx.
30
SQL Review
( s l ides in th is sec t ion
a r e u s e d c o u r t e s y o f C a r r ig E m e r g in g T e c h n o l o gy
P h : 4 1 0 - 5 5 3 - 6 7 6 0www.ca r r ige t .com)
-
8/6/2019 Data Warehouse Overview Done
16/112
16
31
I n t r o d u c t i on t o S Q LI n t r o d u c t i on t o S QL
1) Introduction to SQL
2) Data Definition Language (DDL)
3) Data Manipulation Language (DML)
4) SELECT Construct
5) SELECT Operators
6) Wildcard Searches
7) Aggregate Operators
8) Calculated Attributes9) Sorting Results
32
I n t r o d u c t i on t o S t r u c t u r e d Q u e r y L a n g u a g eI n t r o d u c t i on t o S t r u c t u r e d Qu e r y L a n g u a g e
Structured Query Language (SQL) is the languageused to communicate with a relational database
Industry standard
Based on set theory
SQL composed of two types of constructs:
Data Definition Language (DDL)
Defines the structure of the database
Data Manipulation Language (DML)
Provides the constructs to input and retrieve data
-
8/6/2019 Data Warehouse Overview Done
17/112
17
33
S QL O ve r v i ew - D D LS Q L O ve r v i e w - D D L
Data Definition Language (DDL) is used todescribe the structure of the database
Create tables, indexes, etc.
Typical Operations are:
CREATE TABLE defines what columns are in the table and
establishes the table
CREATE INDEX defines an index for the table. Indexes are used
to improve database performance
34
S Q L O v e r v i e w - D M LS Q L O v e r v i e w - D M L
Data Manipulation Language (DML) is used forstoring, updating, and retrieving data.
Typical operations include: SELECT is used to retrieve data.
Ex: SELECT * FROM PRODUCTS
INSERT is used to add new rows to the database.
INSERT INTO PRODUCTS VALUES ('food',
'hardware', 'housewares')
UPDATE is used to change rows that already exist in the database.
UPDATE PRODUCTS SET PRICE = PRICE + 4
DELETE is used to eliminate rows of data from the database.
DELETE FROM PRODUCTS
-
8/6/2019 Data Warehouse Overview Done
18/112
18
35
SELECT O v e r v i e wSELECT O v e r v i e w
SELECT is used to retrieve records from thedatabase.
Single table SELECT constructs:
WHERE
IN
BETWEEN
LIKE
Aggregate Operators
DISTINCT
ORDER BY
36
SELECT E x a m p l e sSELECT E x a m p l e s
Query Purpose: Retrieve names and prices of allproducts
SELECT ProductName, Price
FROM TinyProducts
Query Purpose: Retrieve all information for allemployees from the TinyProducts table
SELECT *
FROM TinyProducts
-
8/6/2019 Data Warehouse Overview Done
19/112
19
37
SELECT w i t hWHERESELECT w i t hWHERE
TheWHERE clause is used to filter whichinformation is returned from aSELECT
Query Purpose: Retrieve all information only forproduct type of food
SELECT *
FROM TinyProducts
WHERE ProductType = Food
38
U s e of B o ol e a n O p e r a t o r sU s e of B o ol e a n O p e r a t o r s
Conditions can be separated by Booleanoperators: AND, OR, NOT
Query Purpose: List all information about foodproducts that are either cereal or fruit
SELECT *
FROM TinyProducts
WHERE (ProductName = 'Cereal')OR (ProductName = 'Fruit')
-
8/6/2019 Data Warehouse Overview Done
20/112
20
39
B oo le a n O p e r a t o r E x a m p l eB oo le a n O p e r a t o r E x a m p l e
Query Purpose: List the names of all productsthat the type is fruit and the price is less than$2.00
SELECT ProductType, ProductName
FROM TinyProducts
WHERE Price < 2
AND ProductName = 'Fruit'
40
IN O p e r a t o rIN O p e r a t o r
The IN operator allows a search for records that
match one value in a set of unordered values
Example questions to use IN:
'Find all products whose type is Food, Hardware, or Housewares'
'Find all food whose type is Meat, Fish, Vegetables, or Fruit'
-
8/6/2019 Data Warehouse Overview Done
21/112
21
41
IN E x a m p l eIN E x a m p l e
Query Purpose: List the name of Housewares thatare Cookware, Linens, or Dishes
SELECT ProductName, ProductType
FROM TinyProducts
WHERE ProductName in
('Cookware', 'Linens', 'Dishes')
instead of:
SELECT ProductName, ProductType
FROM TinyProducts
WHERE (ProductName = Cookware')
OR (ProductName = 'Linens')
OR (ProductName = 'Dishes')
42
BETWEEN O p e r a t o rBETWEEN O p e r a t o r
The BETWEEN operator allows a search for a range
of values
Example Queries:
'Find all fruit between Bananas and Grapes'
'Find all cereals whose price is between $1.50 and $4.00 a box
1.50 4.00
-
8/6/2019 Data Warehouse Overview Done
22/112
22
43
BETWEEN E x a m p l eBETWEEN E x a m p l e
Query Purpose: Find all products whose price isbetween $2.00 and $8.00
SELECT ProductName, Price
FROM TinyProducts
WHERE Price BETWEEN 2.00 AND 8.00
instead of:
SELECT ProductName, Hardware
FROM TinyProducts
WHERE (Price >= 2.00) OR (Price
-
8/6/2019 Data Warehouse Overview Done
23/112
23
45
Wi ld c a r d S e a r c h E x a m p l e sWi ld c a r d S e a r c h E x a m p l e s
Query Purpose: List all products whose namestarts with an C'
SELECT *
FROM TinyProducts
WHERE ProductName LIKE 'C%'
Query Purpose: List all products that have a SKUnumber with the last 2 characters of 23' whenyou don't know the first character
SELECT *
FROM TinyProducts
WHERE SKUNumber LIKE '_23'
46
Ag g r e g a t e O p e r a t o r sAg gr e g a t e O p e r a t o r s
MIN,MAX, andAVERAGE are used when computing
statistics on a range of data
Query Examples:
'What is the highest batting average on the team?'
'What is the average number of hits for all the little league teams in
the National League?'
'What are the names of the players that had the lowest average on
the little league team?'
-
8/6/2019 Data Warehouse Overview Done
24/112
24
47
Ag g r e g a t e O p e r a t o r s E x a m p l eAg g r e g a t e O p e r a t o r s E x a m p l e
Query Purpose: Find the minimum, maximum,and average batting average of all players in theNational League of Little League
SELECTMIN(Average),MAX(Average),
AVG(Average)
FROM PLAYERS
WHERE League = 'National'
48
SUMa n d COUNT O p e r a t o r sSUMa n d COUNT O p e r a t o r s
Use the SUMoperator to total the results of a
query
COUNT will count the total number of occurrences
of an item in a search
11 ++22 ++33 ++44
-
8/6/2019 Data Warehouse Overview Done
25/112
25
49
SUMAn d COUNT E x a m p l e sSUMAn d COUNT E x a m p l e s
Query Purpose: Find the total number ofhomeruns hit by all players in the AmericanLeague?
SELECT SUM(HomeRuns)
FROM PLAYERS
WHERE League='American'
Query Purpose: List the names of players thathave hit 3 home runs in the National League?
SELECT COUNT(*)FROM PLAYERS
WHERE HomeRuns = '3'
AND League = 'National'
50
C a l c u l a t e d At t r i b u t e sC a l cu l a t e d At t r i b u t e s
A new attribute can be obtained by usingarithmetic operators (+,-, *, /) on other
numeric attributes
All operators follow standard precedence:
Multiplication and division are computed first left to right
Addition and subtraction are computed last left to right
Use parenthesis to override the standard precedence
(( ++ ,, -- ,, ** ,, //))
-
8/6/2019 Data Warehouse Overview Done
26/112
26
51
C a l cu l a t e d At t r i b u t e s E x a m p l eC a l cu l a t e d At t r i b u t e s E x a m p l e
Query Purpose: List all players with their hits, atbats, and their batting average
SELECT Name, Hits, AtBats,
(Hits / AtBats)
FROM PLAYERS
52
DISTINCT O p e r a t o rDISTINCT O p e r a t o r
DISTINCT is used to exclude duplicate
occurrences in the result of a query
Query Purpose: List all distinct batting averages
SELECT DISTINCT(Average)
FROM PLAYERS
-
8/6/2019 Data Warehouse Overview Done
27/112
27
53
S o r t i n g Q u e r y R e s u lt sS o r t i n g Q u e r y R e s u lt s
The ORDER BY clause is used at the end of theSELECT statement to sort the results of a query
Use DESC on the end of the ORDER BY clause to
sort the data in descending order. Otherwise, theresult will be in ascending order
54
S o r t i n g E x a m p leS o r t i n g E x a m p l e
Query Purpose: List all players in ascendingorder of their batting average
SELECT Name, Average
FROM PLAYERS
ORDER BY Average
For descending order add the keyword DESC
SELECT Name, Average
FROM PLAYERS
ORDER BY Name DESC
-
8/6/2019 Data Warehouse Overview Done
28/112
28
55
S o r t i n g C a l cu l a t e d At t r i b u t e sS o r t i n g C a l cu l a t e d At t r i b u t e s
To refer to a computed attribute in the ORDER BY,use its position in the list of columns followingSELECT
Query Purpose: List all players in descendingorder of their batting average (here we assumebatting average is computed at the time of thequery)
SELECT Name, Hits, AtBats,
Hits / AtBats
FROM PLAYERS
ORDER BY 3 DESC
56
M o re SQ LM o re SQ L
1) GROUP BY Construct
2) HAVING Filter
3) Multiple Tables
4) Joins
5) Equijoins
6) Cartesian Product7) Nulls
8) OUTER JOIN
-
8/6/2019 Data Warehouse Overview Done
29/112
29
57
GROUP BY C l a u s eGROUP BY C l a u s e
GROUP BY will partition a table into multiplegroups of related rows.
As an example, consider the EMPLOYEE tablewhere Department partitions the EMPLOYEE set
into subsets:
Engineering
Marketing Customer
Finance
58
GROUP BY E x a m p l eGROUP BY E x a m p l e
Query Purpose: For each department, list theaverage salary using the EMPLOYEE table
SELECT Department, AVG(Salary)
FROM EMPLOYEE
GROUP BY Department
-
8/6/2019 Data Warehouse Overview Done
30/112
30
59
To filter data further, we can use theWHEREclause with GROUP BY clause
Query Purpose: For each department, list thehighest salary of their administrative assistants.
SELECT Department, MAX(Salary)
FROM EMPLOYEE
WHERE Title='administrative assistant'
GROUP BY Department
GROUP BY WithWHEREGROUP BY WithWHEREGROUP BY WithWHEREGROUP BY WithWHERE
60
HAVING C o n s t r u c tHAVING C o n s t r u c t
HAVING is used to restrict the output of aggregatefunctions, such as SUM,MIN,MAX andAVG, to only
those groups of rows that meet some condition.
Query Purpose: List the average salary for all
departments that have more than threeemployees.
SELECT Department, AVG(Salary)
FROM EMPLOYEE
GROUP BY Department
HAVING COUNT(*) > 3
-
8/6/2019 Data Warehouse Overview Done
31/112
31
61
EmpID Name Salary
1 Fred 200
2 Ethel 300
3 Mike 400
4 David 100
EMPLOYEE
Mult i -Tab le SQLMul t i -Tab le SQL
It is often necessary to combine data into multipletables.
ATTENDS
EmpID Name
1 Harvard
2 GMU
2 Yale
3 MIT
3 Stanford
3 GMU
62
J o i n sJ o i n s
Joins are the means by which multiple tables canbe combined.
A join allows us to combine data from differenttables. A join operation is done through theSELECT construct.
Types of Joins: Equijoin, Outer Join, Inner Join
-
8/6/2019 Data Warehouse Overview Done
32/112
32
63
Equi jo inEqui jo in
Joins only those rows where a foreign keymatches the primary key
Allows information from multiple tables to belinked together in a single query
Can be used to link as many tables as needed in asingle query
64
Query Purpose: List the names of all collegesattended by Ethel
SELECT b.Name
FROM EMPLOYEE a, ATTENDS b
WHERE a.EmpID = b.EmpID
AND a.Name = 'Ethel'
E q u i jo in Q u e r y E xa m p l eE q u i jo in Q u e r y E x a m p l e
-
8/6/2019 Data Warehouse Overview Done
33/112
33
65
E q u ijo in E x a m p leE q u ijo in E x a m p le
EmpID College GPA
1 Harvard 2.45
2 GMU 3.79
2 Nova 3.65
3 Yale 2.853 Nova 2.65
3 GMU 4.0
EmpID Name Salary
1 Fred 200
2 Ethel 300
3 Mike 400
EMPLOYEE
ATTENDS
66
Wa r n i n g a b o u t J o in i n g T a b l esWa r n i n g a b o u t J o in i n g T a b le s
A join is really just a subset of a cartesianproduct. When no fields are 'joined' in theWHERE
clause, a cartesian product is produced
Restated in English: When the linking condition is omitted fromthe WHERE clause, you get a lot of excess garbage that you
probably do not want.
Sample Query:
SELECT b.Name
FROM EMPLOYEE a, ATTENDS b
WHERE a.Name = 'Ethel'
-
8/6/2019 Data Warehouse Overview Done
34/112
34
67
C a r t e si a n P r o d u c tC a r t e sia n P r o d u c t
Each row in one table with every other row in othertable
a.EmpID a.Name a.Salary b.EmpID b.GPA
2 Ethel 300 1 3.4
2 Ethel 300 2 2.8
2 Ethel 300 3 3.7
2 Ethel 300 4 3.5
....
68
Nul l sNul l s
An attribute may be defined as null.
This indicates that the value is unknown andavoids the need for user-defined specialindicators.
To prevent a column from having nulls, specifyNOT NULL on the column in the CREATE TABLE
statement when setting up the database.
-
8/6/2019 Data Warehouse Overview Done
35/112
35
69
N u l ls E x a m p l e sN u l ls E x a m p l e s
Statement Purpose: Add an employee whose salaryis unknown
INSERT INTO EMPLOYEE (3,'Hank',NULL)
Query Purpose: Find all employees whose salary isunknown (or null)
SELECT *
FROM EMPLOYEEWHERE Salary IS NULL
70
An OUTER JOIN is used when the query should
return a result row even for rows that do not havecorresponding data in one of the tables.
A LEFT OUTER JOIN returns all rows from the
'left' table.
Nulls are returned when a row in the 'left' tablehas no corresponding rows in the right table.
OUTER JOINOUTER JOIN
-
8/6/2019 Data Warehouse Overview Done
36/112
36
71
LEFT OUTER JOIN E x a m p l eLEFT OUTER JOIN E x a m p l e
Query Purpose: List the college GPAs for eachemployee. Include employees who have notattended any colleges
SELECT a.Name, b.GPA
FROM EMPLOYEE a
LEFT OUTER JOIN ATTENDS b
on a.EmpID = b.EmpID
72
L E F T O UT E R J O I N E x a m p l eL E F T O UT E R J O I N E x a m p l e
Result of the outer join
All employees are listed.
For an equijoin, only those who attended a college would be listed
Here, employee number 4 did not attend college, but is still
retrieved by the outer join.
Name GPA
---------- -----
Fred 2.45
Ethel 3.79Ethel 3.65
Mike 2.85
Mike 2.65
Mike 4.00
David NULL
-
8/6/2019 Data Warehouse Overview Done
37/112
37
73
Advanced SQL
( s l ides in th is sec t ion
a r e u s e d c o u r t e s y o f C a r r i g E m e r g i n g T e c h n o l og y
P h : 4 1 0 - 5 5 3 - 6 7 6 0www.ca r r ige t .com)
74
Ad v a n c e d SQ LAd v a n c e d SQ L
1) Finding the nth element in a list
2) Finding the median
3) Correlated subquery
4) Data Definition Language Constructs
-
8/6/2019 Data Warehouse Overview Done
38/112
38
75
F i n d t h e N t h E le m e n tF i n d t h e N t h E le m e n t
It is very common to try to find the nth element ina list.
Examples:
Who makes the second highest salary in marketing department?
What is the fifth best product in sales?
This can be done with a program that uses SQL to access the
database: SQL is sent to the database and the program keeps
retrieving the result set until the threshold is crossed.
We show another way of
doing this using standard SQL.
76
F i n d t h e Nt h E l e m e n t : E x a m p le Ta b l eF i n d t h e Nt h E l e m e n t : E x a m p le Ta b l e
Consider a table, called TEST, with just one
column, x, with the following values:
X
4
5
8
-
8/6/2019 Data Warehouse Overview Done
39/112
39
77
First join TEST with itself, this yields each
element matched with every other element:
F i n d t h e N t h E le m e n t : S t e p 1F i n d t h e N t h E le m e n t : S t e p 1
4
4
4
5
5
5
88
8
4
5
8
4
5
8
45
8
78
F i n d t h e N t h E le m e n t : S t e p 2F i n d t h e N t h E le m e n t : S t e p 2
Next keep only those rows where the first columnis greater than or equal the second column.
Notice the pattern that just developed, each number on thelist now has a certain number of values that match on theright. This number matches the position of this value inthe list. For example, 4 has only one match as it is the firstnumber in the list, 5 has two matches, 8 has three matches.
4
5
5
8
8
8
4
4
5
4
5
8
4
4
4
5
5
5
88
8
4
5
8
4
5
8
45
8
-
8/6/2019 Data Warehouse Overview Done
40/112
40
79
F i n d t h e N t h E le m e n t : S t e p 3F i n d t h e N t h E le m e n t : S t e p 3
Now group by the column on the left and identifythe size of each group.
The same ideas can be applied to any SELECT
statement output.
4
5
5
8
8
8
4
4
5
4
5
8
4
5
8
1
2
3
80
F i n d i n g t h e Nt h E le m e n t : E x a m p leF i n d i n g t h e N t h E le m e n t : E x a m p le
Query Purpose: Find the information about theproduct with the second highest price.
SELECT a.ProductName, a.ProductType,
a.Price, a.SKUNumber
FROM TinyProducts a, TinyProducts b
WHERE a.Price >= b.Price
GROUP BY a.ProductName,a.ProductType,
a.Price, a.SKUNumber
HAVING COUNT(*) =
(SELECT COUNT(*)-1 FROM TinyProducts)
-
8/6/2019 Data Warehouse Overview Done
41/112
41
81
F i n d i n g t h e T o p N E le m e n t s : E x a m p l eF i n d i n g t h e T o p N E le m e n t s : E x a m p l e
To ask for the top nvalues instead of the nthvalue, specify a range (>=) instead of just anequality (=) in the HAVING.
Query Purpose: Find information about theproducts with the two highest prices.
SELECT a.ProductName, a.ProductType, a.Price, a.SKUNumber
FROM TinyProducts a, TinyProducts b
WHERE a.Price >= b.Price
GROUP BY a.ProductName,a.ProductType,
a.Price, a.SKUNumber
HAVING COUNT(*) >=
(SELECT COUNT(*)-1 FROM TinyProducts)
ORDER BY a.Price
82
F i n d i n g t h e M ed i a nF i n d i n g t h e M ed i a n
The median is defined as the element in themiddle of the list.
Query Purpose: Find the median price inTinyProducts.
SELECT a.ProductName, a.ProductType, a.Price, a.SKUNumber
FROM TinyProducts a, TinyProducts b
WHERE a.Price >= b.Price
GROUP BY a.ProductName,a.ProductType, a.Price, a.SKUNumber
HAVING COUNT(*) = (SELECT (COUNT(*)/2)+1 FROM TinyProducts)
-
8/6/2019 Data Warehouse Overview Done
42/112
42
83
U s in g S u b q u e r i e sU s in g S u b q u e r i e s
A subquery may be used in the middle of a query.
Query Purpose: Find the information about thehighestpriced product, using a simple subquery.
SELECT a.ProductName, a.ProductType, a.Price, a.SKUNumber
FROM TinyProducts a
WHERE Price = (SELECT MAX(PRICE) FROM TinyProducts)
84
C or r e la t e d S u b q u e r yC or r e la t e d S u b q u e r y
If the subquery references a data element fromoutside of the subquery, it is called a correlatedsubquery.
For each row in the outer part of the query, the correlated subquery
is executed.
The following query will indicate who makes more money than Ethel
SELECT a.Name, a.Salary
FROM Employee a WHERE EXISTS
(SELECT b.Salary
FROM Employee b
WHERE a.Salary > b.Salary
AND b.Name = 'Ethel')
-
8/6/2019 Data Warehouse Overview Done
43/112
43
85
INSERT
Add rows to a single table
UPDATE
Modify rows in a single table
DELETE
Remove rows from a single table
O t h e r D a t a M a n i p u l a t i onO t h e r D a t a M a n i p u la t i o n
86
INSERT E x a m p l e sINSERT E x a m p l e s
Statement Purpose: Add a record for employee#1, Fred' with a salary of 200 to the EMPLOYEEtable
INSERT INTO Employee VALUES
(1, Fred', 200)
Statement Purpose: Copy all rows in theEMPLOYEE table and place them inNEW_EMPLOYEE
INSERT INTO New_Employee
SELECT * FROM Employee
-
8/6/2019 Data Warehouse Overview Done
44/112
44
87
UPDATE E x a m p l eUPDATE E x a m p l e
Statement Purpose: Modify Freds salary to 150
UPDATE Employee
SET Salary = 150.00
WHERE Name = 'Fred'
Statement Purpose: Give all employees a tenpercent raise
UPDATE Employee
SET Salary = Salary * 1.10
88
DELETE E x a m p l e sDELETE E x a m p l e s
Statement Purpose: Remove all employees whohave a salary higher than 100.
DELETE FROM Employee
WHERE Salary > 100
To remove all employees:
DELETE FROM Employee
-
8/6/2019 Data Warehouse Overview Done
45/112
45
89
CREATE TABLE E x a m p l eCREATE TABLE E x a m p l e
Statement Purpose: Create a table to storeemployee information
CREATE TABLE EMPLOYEE
(EmpId SMALLINT,
Name CHAR(10),
Salary DECIMAL(5,2))
To drop the EMPLOYEE table
DROP TABLE EMPLOYEE
90
Data WarehouseSecurity
( s l ides in th is sec t ion
a r e u s e d c o u r t e s y o f C a r r i g E m e r g i n g T e c h n o l og y
P h : 4 1 0 - 5 5 3 - 6 7 6 0www.ca r r ige t .com)
-
8/6/2019 Data Warehouse Overview Done
46/112
46
91
D a t a Wa r e h o u s e S e cu r i t yD a t a Wa r e h o u s e S e cu r i t y
1) Key Security Services
2) Views
3) Access Control
4) Roles
5) Encryption
6) Audit Trails
7) Security Holes
8) Intrusion Detection9) Misuse Detection
92
I n t r o d u c t i o nI n t r o d u c t i o n
A key feature provided by database systems isgood security services.
In a database system with good security, applications do not have
to worry about problems that arise with security violations.
A data warehouse also requires good securityservices because it holds key, corporate data.
EIS
Database System
SecurityServices
-
8/6/2019 Data Warehouse Overview Done
47/112
47
93
K e y Se c u r i t y Se rv i c e sK e y Se c u r i t y Se rv i c e s
Access Control Controls who accesses what data
Administration of Access Control
Used to give access to users as well as track who has various
accesses and what kind of accesses are given to a user or group of
users
Audit tracks the usage of the data warehouse
94
S e cu r i t y in a D a t a Wa r e h o u s eS e cu r i t y in a D a t a Wa r e h o u s e
A data warehouse consolidates organizations keydata in one place.
A data warehouse increases the security risk that unauthorized
users will try to obtain this data
Security aspects of EIS applications must bedesigned and implemented very thoroughly.
Access control and audits aretwo of the critical components
of security.
-
8/6/2019 Data Warehouse Overview Done
48/112
48
95
D a t a Wa r e h o u s e S e cu r i t y C om p o n e n t sD a t a Wa r e h o u s e S e cu r i t y C om p o n e n t s
Database system components that can be usedto protect a data warehouse include:
Views
Allow users to only see certain rows or columns of data
Access control
Indicate which users have access to what data
Administration
This component is used to actually give access to groups of usersand to define the accesses given to either an individual or a group.
Encryption
Protect data from access outside of the DBMS
Audit Track what users are doing
96
Vi e w s in D a t a Wa r e h o u s eVi e w s in D a t a Wa r e h o u s e
A view is a logical viewinto one or more tables.Users may be given access to the view withoutaccess to the base table.
Views provide some security assistance because
they can hidedata from users.
Name Address Salary
H a n k
E s t h e rTo m
Su eDave
Pete
Ka thy
1 S o u t h S t r e e t
2 N o r t h S t r e e t
3 4 M a i n S t r e e t
4 5 E a s y S t r e e t5 6 5 t h A v e n u e7 Broadway
8 9 W e s t e r n A v e n u e
$ 5 0 , 0 0 0
$ 8 0 , 0 0 0$ 9 0 , 0 0 0
$ 2 8 , 5 0 0$ 3 5 , 0 0 0
$ 6 0 , 0 0 0$ 8 5 , 0 0 0
EMPLOYEE
-
8/6/2019 Data Warehouse Overview Done
49/112
49
97
Vi e w E x a m p leVi e w E x a m p l e
VIEW (SAFE_EMPLOYEE)
Salary is ef fect ively hidden
A view called SAFE_EMPLOYEE may be createdas:CREATE VIEW SAFE_EMPLOYEE AS
(SELECT name, address FROM EMPLOYEE)
Now users of the view SAFE_EMPLOYEE will not
even know that salaryexists.
Name Address Salary
H a n k
E s t h e rTo m
Su eDave
Pete
Ka thy
1 S o u t h S t r e e t
2 N o r t h S t r e e t
3 4 M a i n S t r e e t
4 5 E a s y S t r e e t5 6 5 t h A v e n u e7 Broadway
8 9 W e s t e r n A v e n u e
SAFE_EMPLOYEE
98
U p d a t in g Vie w sU p d a t i n g Vie w s
Restrictions exist on updating views. For theEMPLOYEE table, it is possible to insert into theSAFE_EMPLOYEE view.
Example:
INSERT INTO SAFE_EMPLOYEE VALUES (Hank, 300)
This will insert aNULL into the SALARY column of the base table
EMPLOYEE.
Other restrictions to view updates exist:
Cannot update a view that is defined with an aggregate
Cannot update a view that is defined with a GROUP BY
-
8/6/2019 Data Warehouse Overview Done
50/112
50
99
D a t a Wa r e h o u s e Ac c e ss C on t r o lD a t a Wa r e h o u s e Ac c e ss C on t r o l
Access control is implemented in a datawarehouse with the SQL Grant and Revokecommands.
Syntax
GRANT ON
TO
Example: GRANT SELECT ON EMPLOYEE TO MARY
Access control is done by DBAs and creators oftables.
To remove access the REVOKE command is used.
Example: REVOKE SELECT ON EMPLOYEE FROM MARY
100
D a t a b a s e R o le sD a t a b a s e R o le s
Roles provide security administration by allowingusers to be grouped into roles. Accesses maythen be given to a group of users. As an example, some roles for a company might be:
Administrative assistant
Loan officer
Salesperson
Accesses may be assigned based on roles. This dramatically simplifies administration.
If new tables are created, it is not necessary to add thousands of
new accesses. Examples:
CREATE ROLE loan_officer AS (Hank, John, Mike)
GRANT SELECT ON LOAN TO LOAN_OFFICER
-
8/6/2019 Data Warehouse Overview Done
51/112
51
101
E x a m p le of Ap p l ic a t i o n -b a s e d R o le sE x a m p le of Ap p l ic a t i o n -b a s e d R o le s
Consider:
If the database system controls accesses than itdoes not matter what the application does,accesses are controlled consistently (same forSALES as MARKETING)
However, more fine-grained access control canbe granted in the application.
DatabaseSystem
Applications DataUsers
102
Ap p l ic a t i o n R o le sAp p l ic a t i o n R o le s
The application can restrict:
Data entry screens
Reports
Care must be taken to restrict users in aconsistent fashion so that a user cannot jump toa different application and avoid security set up
by another application.
-
8/6/2019 Data Warehouse Overview Done
52/112
52
103
R o le B a s e d S e cu r i t y in a D a t a W a r e h o u s eR o le B a s e d S e cu r i t y in a D a t a W a r e h o u s e
Both application and database level security areuseful in a data warehouse.
Database level security is needed so that usersare only allowed to see data they
need to see.
Application level security can
be used to control access tocertain menus so that users donot even know what reports exist.
104
E n c r y p t i o nE n c r y p t i o n
Encryption is the process of coding data so that itcan only be read by users who have the key thatallows them to decrypt the data.
Example:A message sell 500 shares would appear as xyzzy
without the key. Once the key is paired with the encrypted string
xyzzy, it can then be decrypted.
The size of the key is a factor in how difficult it is to attack the
encryption scheme.
Three places where encryption might be used in adata warehouse:
Network
Data
Tape backups
-
8/6/2019 Data Warehouse Overview Done
53/112
53
105
N et w o r k E n c r y p t i o nN e t w or k E n c r y p t i o n
In a data warehouse application, data and queriesare transmitted through a network.
Attackers might be able to steal network traffic just by breaking
into the network medium.
One way to reduce the risk of this threat is toencrypt traffic on the network.
User
DatabaseSystem
Application
Data
Warehouse
N e t w o r k
Tape Backup
106
Network encryption is critical because thenetwork connects all of the key components in adata warehouse.
Encrypting network traffic mitigates the risk that
an attacker could succeed with the man in themiddle attack.
Without this, it may be possible for theman in the middle to masquerade as
another user and circumvent existingapplication and database security.
N et w o r k E n c r y p t i o nN e t w or k E n c r y p t i o n
-
8/6/2019 Data Warehouse Overview Done
54/112
54
107
D a t a E n c r y p t i onD a t a E n c r y p t i on
Data encryption refers to encrypting the actualdata in the data warehouse.
If the attackers were to retrieve data fromthe warehouse, they would have to
decrypt it in order to read it.
EIS DatabaseSystem
DataWarehouse
108
B a c k u p E n c r y p t i onB a c k u p E n c r y p t i on
Periodically, databases are copied to some kindof long-term storage (usually tapes).
If the database is encrypted, but the tapes are notencrypted, the risk exists of someone walking off
with the tapes.
Tape Backup
EIS DatabaseSystem
DataWarehouse
-
8/6/2019 Data Warehouse Overview Done
55/112
55
109
Au d i t T ra i lsAu d i t T ra i ls
Audit trails are a means of tracking queries,updates, deletes, and additions of new data to thedata warehouse.
Audit trails are turned on when the DBMS is started and all
activity that uses the data warehouse is tracked in the audit trail.
If a user is suspected of an evil deed, the audittrail can be examined to identify what data hasbeen accessed by users.
110
Det a i l s o f DW Au d i t T r a i l sDe t a i l s o f DW Au d i t T r a i l s
An audit trail of a database system typicallyincludes the following information:
User ID, Date, Time, Object that has been accessed (table or view),Action that accessed the object (INSERT, UPDATE, DELETE,
SELECT)
For UPDATE, the old value and new value is tracked.
For data warehouses, the SELECT is often usedto track the queries that have
been run against the warehouse.
-
8/6/2019 Data Warehouse Overview Done
56/112
56
111
O th e r U s e s for D W Au d i t T r a i l sO th e r U s e s fo r D W Au d i t T r a i l s
Audit trails can be used to identify the mostpopular data in the warehouse.
This information can be used to optimize queries
An additional use for audit trails is performancetuning of the data warehouse.
Administrators know where to focus their efforts
Reduces administrative overhead
112
D e a li n g w i t h K n o w n S e cu r i t y H o le sD e a li n g w i t h K n o w n S e cu r i t y H o le s
Commercial database systems and operatingsystems are often filled with holes that allowusers to obtain unauthorized access.
To reduce the risk of these known holes, vendors often provide
fixes to their products as soon as these holes become public.
It is important to constantly keep up with knownsecurity holes and apply the latest fixes as soonas they are released.
One of the key risks surrounding a data
warehouse is that privileged usershave the keysto the kingdom.
-
8/6/2019 Data Warehouse Overview Done
57/112
57
113
T h e R i s k o f P r iv i le g e d U s e r s T h e R i s k o f P r iv il e g e d U s e r s
"Privileged users" include: Data warehouse administrators
Operating system programmers
Operators in the computer center
These users can:
Modify, delete and query any data in the warehouse
Modify the audit trail to mask their actions
Give other users unauthorized access
Numbers of "privileged users" could
be anywhere from 20 to 30 in someorganizations.
114
R e d u c i n g t h e R i s k o f P r i v i le g e d U s e r sR e d u c in g t h e R i s k o f P r i v i le g e d U s e r s
One way to reduce the risk of privileged users isto separate security administration from databaseadministration.
This would separate the task of giving accesses and managing the
audit trail from the task of making sure the data in the warehouse
was correct and properly optimized.
Secur i t y Serv icesAccess Contro l
Audit
Dat abas e Serv icesDatabase Tuning
Q u e r y O p t i m i z a t i o nBackups
Security
Services
Access Contro lAudi t
Database
ServicesDatabase Tuning
Q u e r y O p t i m i z a t i o nBackups
-
8/6/2019 Data Warehouse Overview Done
58/112
58
115
I n f or m a t i o n S e c u r i t y At t a c k sI n f or m a t i o n S e c u r i t y At t a c k s
Two types of Information security attacks on datawarehouses are:
Intrusion
An intrusion occurs when an unauthorized user gains access to thedata warehouse.
Misuse
Misuse, often referred to as the insiderproblem occurs when a user who has access
to the warehouse uses that access for anunauthorized purpose
Audit Trails can be used toidentify either type of attack, but
identification of misuse is typically MUCH harder
to do than intrusion.
116
I n t r u s io n D et e c t i o nI n t r u s i o n De t e c t i o n
An intrusionis defined as an unauthorizedaccess to a system. The assumption is the user isexternal to the environment (e.g.; a hacker).
To reduce the risk of intrusion, intrusiondetection tools are used.
These tools monitor access to the data warehouse and sound an
alarm if unauthorized accesses are detected.
DATAWAREHOUSEUSER
INTRUSION DETECTION SYSTEM
-
8/6/2019 Data Warehouse Overview Done
59/112
59
117
M is u s e D e t e c t i o nM is u s e D e t e c t i o n
Unwanted access by a user that has the ability toaccess data is referred to as misuse.
This is also known as the insider problem.
Some estimates have shown that 80 % of computer crime is a
result of misuse.
For data warehouses the threat of misuse is high
especially by privileged users.
118
S u m m a r yS u m m a r y
DBMS Security is useful for data warehouses tohide data from users with viewsand to restrictaccess to data with GRANT and REVOKE.
Application Level Security assists EIS thataccess data warehouses by hiding certain reportsfrom users.
Encryption can be used to further protect againstthe risk of someone walking off with the datawarehouse.
Audit Trails are useful for:
Catching attackers
Identifying usage trends of the data warehouse
-
8/6/2019 Data Warehouse Overview Done
60/112
60
119
Moving Datato the
Data Warehouse
( s l ides in th is sec t ion
a r e u s e d c o u r t e s y o f C a r r i g E m e r g i n g T e c h n o l og y
P h : 4 1 0 - 5 5 3 - 6 7 6 0www.ca r r ige t .com)
120
M ov in g D a t a t o t h e D a t a Wa r e h o u s eM ov in g D a t a t o t h e D a t a Wa r e h o u s e
1) Moving Data into the Data Warehouse
2) Updating the Data Warehouse
3) Full Refresh
4) Copy Only the Changes
5) BCP
6) Simple Transformations
7) Complex Transformations
8) Commercial ETL Tools
-
8/6/2019 Data Warehouse Overview Done
61/112
61
121
M ov in g D a t a i n t o t h e D a t a W a r e h o u s eM ov in g D a t a i n t o t h e D a t a Wa r e h o u s e
Data must be moved to the data warehouse fromsource systems.
Some key issues:
Determine the frequency of data updates -- how often should data
be moved from source systems to the data warehouse.
Various means of updating data in the warehouse exist:
SQL Commands
Database system load programs (e.g.; SQL Servers BCP)
Commercial tools
122
U p d a t i n g t h e D a t a Wa r e h o u s eU p d a t i n g t h e D a t a Wa r e h o u s e
OLTP (On-Line Transaction Processing) Systemshave to send their updates to the data warehouse.
InventoryOLTP
Application
FinanceOLTP
Application
SalesOLTP
Application
Data Warehouse
F i n a n c e
Subjec tArea
InventorySubjec t
Area
S a l e sSubjec t
Area
-
8/6/2019 Data Warehouse Overview Done
62/112
62
123
F r e q u e n c y of U p d a t e s t o t h e D a t a
W a r e h o u s e
F r e q u e n c y o f U p d a t e s t o t h e D a t a
W a r e h o u s e
Updates may occur daily, weekly, monthly, or inreal-time.
DailyU
pdate
WeeklyUp
date
Mon
thlyU
pdate
InventoryOLTP
Application
FinanceOLTP
Application
SalesOLTP
Application
Data Warehouse
F i n a n c e
Subjec tArea
InventorySubjec t
Area
S a l e sSubjec t
Area
124
D e t er m i n i n g t h e F r e q u e n c y o f U p d a t e sD e t er m i n in g t h e F r e q u e n c y o f U p d a t e s
Requirements should drive update frequency
Range of updates runs from real-time, toquarterly.
Real time update
Expensive
Requires update of warehouse while users arequerying
Daily update
Somewhat cheaper than real time, but significantmaintenance required if the warehouse has lots of tables.
Monthly or weekly update Much more manageable
-
8/6/2019 Data Warehouse Overview Done
63/112
63
125
U p d a t i n g t h e Wa r e h o u s eU p d a t i n g t h e Wa r e h o u s e
Full Refresh vs. Only the Changes
InventoryOLTP
ApplicationFinanceOLTP
Application
SalesOLTP
Application
FullR
efresh
Changessin
cela
stup
date
Fullre
fresh
ofso
meta
bles
chang
esfo
roth
erta
bles
Data Warehouse
F i n a n c e
Subjec tArea
InventorySubjec t
Area
S a l e sSubjec t
Area
126
F u l l R e f r e s hF u l l R e f r e s h
Copy the entire source table in the OLTP systemto the destination table in the Data Warehouse.
SourceTable
Source OLTP
TargetTable
Target Data Warehouse
-
8/6/2019 Data Warehouse Overview Done
64/112
64
127
C o p y On l y t h e C h a n g e sC o p y On l y t h e C h a n g e s
Copy only the changes to the source table in theOLTP system to the destination table in the datawarehouse.
SourceTable
Source OLTP
TargetTable
Target Data Warehouse
Modif ied da ta
s i n c e l a s t u p d a t et o t h e w a r e h o u s e
D a t a f r o m t w o u p d a t e s a g o .
His tor ica l da ta no longer insource OLTP.
128
F u l l R e f r e s h v s . O n l y t h e C h a n g e sF u l l R e f r e s h v s . O n l y t h e C h a n g e s
Full Refresh
Pros
Much easier to implement
Less chance of messing up your database (good data integrity)
Cons
Can take a lot longer to actually do -- may run out of night
Can lose out on warehouse ability to track historical data.
Only the Changes (DELTA)
Pros
Tracks historical data
Cons
Can be very hard to implement
Can require changes in source applications (more on this later)
-
8/6/2019 Data Warehouse Overview Done
65/112
65
129
One way to move data from one table to anotheris via the INSERT-SELECT. Syntax: INSERT INTO
Example:INSERT INTO DW_EMPLOYEE
SELECT *
FROM EMPLOYEE
Fu l l R e f r e s h U s in g IN SE R T -SE L E C TFu l l R e f r e s h U s in g IN SE R T -SE L E C T
TARGET
130
U p d a t i n g Ch a n g e s Us in g INSERT-SELECTU p d a t i n g Ch a n g e s Us in g INSERT-SELECT
Changes may be moved by adding aWHEREclause to the INSERT-SELECT.
Example: INSERT INTO DW_EMPLOYEE
SELECT *
FROM EMPLOYEE
WHERE DATE-UPDATED =
DATEPART(m, CURRENT_TIMESTAMP)
-
8/6/2019 Data Warehouse Overview Done
66/112
66
131
U p d a t i n g U s in g B C PU p d a t i n g U s in g B C P
BCP is the bulk copy program that comes with MSSQL Server.
Bulk copy (BCP) moves data to or from a flat file to a SQL table.
Syntax:bcp [in | out]
Source
Table
Source OLTP
Target
Table
Target DataWarehouse
TemporaryF l a t
File
Unload Load
132
B CP E x a m p leB CP E x a m p le
To bulk copy data from thepublishers table inthepubs database to thepublishers.txt data file
in ASCII text format, execute from the commandprompt:
bcp pubs..publishers out publishers.txt -c
-Sservername -Usa -Ppassword
To bulk copy data from thepublishers.txt fileinto thepub2 table in thepubs database, execute
from the command prompt:
bcp pubs..pub2 in publishers.txt -c
-Sservername -Usa -Ppassword
-
8/6/2019 Data Warehouse Overview Done
67/112
-
8/6/2019 Data Warehouse Overview Done
68/112
68
135
Key tools in the marketplace Informatica
Ardent
DecisionBase (Platinum)
Microsoft Data Transformation Services
All provide libraries of common transformations.
All provide the ability to
code complex transformations.
C o mm e r c i a l E T L T o ol sC o mm e r c i a l E T L T o ol s
136
D a t a T r a n s fo r m a t i o n S e r v i ce sD a t a T r a n s fo r m a t i o n S e r v i ce s
-
8/6/2019 Data Warehouse Overview Done
69/112
69
137
C h o o se a S o u r c eC h o o se a S o u r c e
138
C h o o s e a D e st i n a t i o nC h o o s e a D e st i n a t i on
-
8/6/2019 Data Warehouse Overview Done
70/112
70
139
C h o o s e t o u s e a Q u e r y fo r T r a n s fe rC h o o s e t o u s e a Q u e r y fo r T r a n s fe r
140
E n t e r S Q L Q u e r yE n t e r S Q L Q u e r y
-
8/6/2019 Data Warehouse Overview Done
71/112
71
141
C h o o s e De s t i n a t i o n T a b l eN a m eC h o o s e De s t i n a t i o n T a b l e Na m e
142
V e r i fy T ra n s fo rma t io nVe r i fy T ra n s fo rm a t io n
-
8/6/2019 Data Warehouse Overview Done
72/112
72
143
D e ci d e Wh e n t o R u n T r a n s fo r m a t i o nD e ci d e Wh e n t o R u n T r a n s fo r m a t i o n
144
Fina l Ver i f i ca t ionFina l Ver i f i ca t ion
-
8/6/2019 Data Warehouse Overview Done
73/112
73
145
R u n T r a n s fo r m a t i onR u n T r a n s fo r m a t i on
146
C h e c k R e s u l t sC h e c k R e s u l t s
orderid orderdate productid productname quantity unitprice discount
10248 1996-07-04 00:00:00.000 11 Queso Cabrales 12 14.0000 0.0
10248 1996-07-04 00:00:00.000 42 Singaporean Hokkien Fried 10 9.8000 0.0
10248 1996-07-04 00:00:00.000 72 Mozzarel la di Giovanni 5 34.8000 0.0
10249 1996-07-05 00:00:00.000 14 Tofu 9 18.6000 0.0
10249 1996-07-05 00:00:00.000 51 Manjimup Dried Apples 40 42.4000 0.0
select *
from orderfact
-
8/6/2019 Data Warehouse Overview Done
74/112
74
147
S u m m a r yS u m m a r y
ETL is one of the hard parts of building a datawarehouse.
Either full refreshes of data or just the changesmay be done.
Doing full refresh is easy, but historical data islost and it may take a lot of time.
Tracking changes is a tough business.
ETL commercial tools are beginning to matureand can lessen the pain of this task.
148
More Ways ofMoving Data to
the
Data Warehouse( s l ides in th is sec t iona r e u s e d c o u r t e s y o f C a r r i g E m e r g i n g T e c h n o l og y
P h : 4 1 0 - 5 5 3 - 6 7 6 0www.ca r r ige t .com)
-
8/6/2019 Data Warehouse Overview Done
75/112
75
149
M o re W a y s o f Mo v in g D a t a
t o t h e D a t a Wa r e h o u s e
M o re W a y s o f Mo v in g D a t a
t o t h e D a t a Wa r e h o u s e
1) Determining What Data Has Changed
2) Recovery Logs
3) Triggers
4) Insert Triggers
5) Delete Triggers
6) Update Triggers
7) Manual Detection
150
There is a need to move data into the datawarehouse from OLTP and DSS applications
The problem is detecting what data needs to bemoved into the data warehouse
Three methods:
Recovery Logs
Triggers
Manual Techniques
M o re W a y s o f Mo v in g D a t a
t o t h e D a t a Wa r e h o u s e
M o re W a y s o f Mo v in g D a t a
t o t h e D a t a Wa r e h o u s e
-
8/6/2019 Data Warehouse Overview Done
76/112
76
151
D e t e r m i n i n g Wh a t D a t a H a s C h a n g e dD e t e r m i n i n g Wh a t D a t a H a s C h a n g e d
Problem: How to get updates made to the sourceto the same information in the data warehouse?
TABLE
A
SOURCE
OLTP
DATA WAREHOUSE
TABLE
B
??UPDATES
How t o get updat es f rom
Source Tab le A t o
Data Warehouse Table B
152
??
D e t e r m i n i n g Wh a t D a t a H a s Ch a n g e d (cont . )D e t e r m i n i n g Wh a t D a t a H a s Ch a n g e d (cont . )
Problem: How to get updates made to multiplesources to the same information in the data
warehouse?
TABLE
A
SOURCE
OLTP
UP
DATES
ROW X
NAME DEPT. SALARY
Fred Mktg 35000
Hank Sales 60000
Sue IT 71000
Joe Sales 50000
Employee
ROWX
UPDATES
Insert into
Employee Values
(Joe,Sales,50000)
DATA WAREHOUSE
TABLE
A
ROW X
TABLE
B
ROW X
DEPT COUNT
Mktg 1
Sales 1
IT 1
HR 0??EmployeeCount
SalaryInfo
2DEPT AVG SAL TOT SAL
Mktg 35000 35000
IT 71000 71000
HR 0 0
Sales 60000 60000
55000 110000
-
8/6/2019 Data Warehouse Overview Done
77/112
77
153
Wh a t i s t h e R e c o v e r y L o g ?Wh a t i s t h e R e c o v e r y L o g ?
Recovery log is used for transaction processing Used to handle errors
Does contain before and afterimage.
Recovery log can be used toidentify the data to be updated
in the data warehouse.
Change Data Capture Utility
This scans the database log and identifies all changes that the useris interested in and either writes them to a file or stores themin
another table.
154
C h a n g e D a t a C a p t u r e U t il it y i n A ct i onC h a n g e D a t a C a p t u r e U t il it y i n Ac t i on
DBMS
DATA
LOG
All changes
to DBMS
SOURCE
OLTP
CHANGEDATA
CAPTURE
UTILITY
DATA WAREHOUSE
READS
READS
WRITESWRITES
RECOVERY LOG
-
8/6/2019 Data Warehouse Overview Done
78/112
78
155
E x a m p l e o f U si n g R e c o v e r y L o gE x a m p l e o f U si n g R e c o v e r y L o g
UPDATE EMPLOYEE
Where SSN=10
SET Salary=Salary*2.0
LOGTABLE=EMPLOYEE
SSN=10
OldSalary=100,
NewSalary=200
CHANGE
DATACAPTURE
RECONSTRUCTS
UPDATE
DATAWAREHOUSE
Consider an update to the Employee table The information is recorded in the log
The change data capture reconstructs update
Can then be sent to the data warehouse
156
U s in g t h e R e c o v e r y L o gU s in g t h e R e c o v e r y L o g
Recovery logs are usually in proprietary format.Use commercial tools to read the log and identifythe changes.
Commercial tools such as CAs log analyzercan
place the results of their work in a table.
-
8/6/2019 Data Warehouse Overview Done
79/112
79
157
S u m m a r y o f C h a n g e D a t a Ca p t u r eS u m m a r y o f C h a n g e D a t a Ca p t u r e
Pro
Log exists anyway, might as well use it to find what has changed
Con
Some difficult scenarios may occur where it is hard to see what the
new update should be in the Data Warehouse.
Proprietary format, may not be supported in many DBMS and will
always lag behind DBMS development.
Many tables will be in the source that have nothing to do with the
data warehouse, but change data capture will process their changesas well.
158
T r i g g e r sT r i g g e r s
Triggers allow DBAs to specify that when anevent such as an INSERT, UPDATE , orDELETE
occurs on a table, another event is triggered.
Triggers are used to identify changes that are needed by the
warehouse.
A trigger can be added to a source table and whenever the source
table is updated, an update can be placed either directly in the
warehouse or in a staging table that tracks all updates.
Triggers can be used to detect the
changes and perform datawarehouse updates.
A different trigger might be run on key updates so that the data
warehouse nightly process would know what data has changed.
-
8/6/2019 Data Warehouse Overview Done
80/112
80
159
E x a m p l e o f a T r i gg e rE x a m p l e o f a T r i gg e r
TABLE
A
INSERT intoTABLE A
VALUES (X, Y)
STEP 1
STEP 2
Values (X, Y) a re inse r ted
STEP 4
When va lues a reinse r ted , se ts of f
the TRIGGER
Night ly P rocess inse r ts
values (X, Y) intot h e D a t a W a r e h o u s e
DATA WAREHOUSE
TABLEA
Values (X, Y)
X, Y
STAGING
STEP 3
TRIGGER inserts
values (X, Y) intoa STAGING area
NightlyProcess
160
R e a l -L i fe T r ig g e r E x a m p leR e a l -L i fe T r ig g e r E x a m p le
OLTP/DSS Data - Employee table:
Employee (ssn, name, salary)
DW Data - Summary table:
EmployeeStatistics (total number employees,
total salary paid, average salary).
When a row is inserted in the employee table, weneed to do an insert into the EmployeeStatistics
table. Shown on the next page
-
8/6/2019 Data Warehouse Overview Done
81/112
81
161
I n s e r t T r i gg e r E x a m p l eI n s e r t T r i gg e r E x a m p l e
CREATE TRIGGER EmployeeInsertTriggerON Employee
FORINSERTAS
BEGIN
UPDATE EmployeeStatistics
SET NoEmployee = NoEmployee +
(SELECT COUNT(*) FROM INSERTED)
UPDATE EmployeeStatistics
SET TotSalary = TotSalary +
(SELECT SUM(Salary) FROM INSERTED)
UPDATE EmployeeStatistics
SET AvgSalary = TotSalary / NoEmployee
END
162
I n s e r t T r i g ge r i n Ac t i onI n s e r t T r i g ge r i n A ct i o n
INSERT INTO EMPLOYEEVALUES (1, 'John', 300) (1 ROW(S) AFFECTED)
INSERT INTO EMPLOYEE
VALUES (2,'Mike', 400) (1 ROW(S) AFFECTED)
SELECT * FROMEMPLOYEE
Employee
EmpId Name Salary------ --------------------------1 John 300.00
2 Mike 400.00
SELECT * FROM
EMPLOYEESTATISTICS
EmployeeStatisticsNoEmployee TotSalary AvgSalary
---------- ---------- ---------2 700.00 350.00
COMMANDS RESULTS
-
8/6/2019 Data Warehouse Overview Done
82/112
82
163
D e le t e T r i g g e r E x a m p l eD e le t e T r i g g e r E x a m p l e
CREATE TRIGGER EmployeeDeleteTriggerON Employee
FOR DELETE AS
BEGIN
DECLARE @numberEmployee int
UPDATE EmployeeStatistics
SET NoEmployee = NoEmployee - (SELECT COUNT(*) FROM DELETED)
UPDATE EmployeeStatistics
SET TotSalary = TotSalary - (SELECT SUM(Salary) FROM DELETED)
SELECT @numberEmployee = NoEmployee FROM EmployeeStatistics
IF @numberEmployee > 0
BEGIN
UPDATE EmployeeStatistics
SET AvgSalary = TotSalary / NoEmployeeEnd
ELSE
UPDATE EmployeeStatistics SET AvgSalary = 0.0
END
164
U p d a t e T r i gg e r E x a m p leU p d a t e T r i gg e r E x a m p le
CREATE TRIGGER EmployeeUpdateTrigger
ON Employee
FORUPDATEAS
BEGIN
IF UPDATE (Salary)
UPDATE EmployeeStatistics
SET TotSalary = TotSalary -
(SELECT SUM(Salary) FROM DELETED) +
(SELECT SUM(Salary) FROM INSERTED)
UPDATE EmployeeStatistics
SET AvgSalary = TotSalary / NoEmployee
END
-
8/6/2019 Data Warehouse Overview Done
83/112
83
165
S u m m a r y o f U s in g T r i g g e r sS u m m a r y o f U s in g T r i g g e r s
Pro Only needed for tables whose data is going to go to the DW
Con
Additional work needed to create detailed triggers
Non-trivial to generate a trigger to implement appropriate action
May not be acceptable for commercial software on source system
166
O t h e r Wa y s t o De t e r m i n e Wh a t H a s C h a n g e dO t h e r Wa y s t o De t e r m i n e Wh a t H a s C h a n g e d
There are other manual ways of detecting thechange and doing DW updates
Look at each row of OLTP and the data in the warehouse
Compare the differences between the two files, if the data is not in
the warehouse, add it!
Hank
J o h n
Mike
S a m
OLTP DATA WAREHOUSE
Hank
J o h n
Mike
COMPARE
ADD THE DIFFERENCES ADD THE DIFFERENCES
-
8/6/2019 Data Warehouse Overview Done
84/112
84
167
M a n u a l ly I d e n t i fy in g W h a t H a s C h a n g e dM a n u a l ly I d e n t i fy in g W h a t H a s C h a n g e d
Pro Flexible
Con
Very expensive
Could take a long time
168
S u m m a r yS u m m a r y
Recovery Logs
Triggers
Manual Detection
-
8/6/2019 Data Warehouse Overview Done
85/112
85
169
Data WarehouseDesign
( s l ides in th is sec t ion
a r e u s e d c o u r t e s y o f C a r r i g E m e r g i n g T e c h n o l og y
P h : 4 1 0 - 5 5 3 - 6 7 6 0www.ca r r ige t .com)
170
D a t a Wa r e h o u s e D e si gnD a t a Wa r e h o u s e D e si gn
1) Overview
2) Describing a Design - ER Diagrams
3) Design Normalization
4) Star Schema Design
-
8/6/2019 Data Warehouse Overview Done
86/112
86
171
O v e rv i e wO v e rv i e w
How to describe a design
Entity Relationship (ER) Diagram
Types of Designs
Normalized
Star Schema
Snowflake
172
D e s c r ib in g a D e s ig nD e s c r ib in g a D e s ig n
Different techniques exist, the most prevalent isthe ER (Entity-Relationship) Diagram
Entities
Things that occur in the real world, usually nouns e.g.; employee,
part, product, etc.
Relationships
How entities interact, example: one employee may attendmany
colleges -- usually verbs
Types of relationships
1-1
1-Many
Many-1
Many-Many
-
8/6/2019 Data Warehouse Overview Done
87/112
87
173
E x a m p l e s o f R e l a t i o n s h i p sE x a m p l e s o f R e l a t i o n s h i p s
1-MANY
MANY-1
1-1
MANY-MANY
174
N o rm a l i ze d D e s ig nN o rm a l i ze d D e s ig n
Methodology
All 1-1 relationships are placed in a single table.
Many-many relationships require two tables that store the single-
valued relationships and one linking table that indicates how the
entities are related. The relationship is represented in the linking
table by referencing keys in the two tables that represent each
entity in the relationship.
Checking the design
In a Normalized Design, there are many different normalized
forms. Each normal form (NF) builds on the previous one so that a
table in 2NF is, by definition, in 1NF. 1NF
2NF
3NF
-
8/6/2019 Data Warehouse Overview Done
88/112
88
175
D e a l in g Wi th M a n y-M a n y R e l a t i o n s h ip sD e a l i n g Wi th M a n y-M a n y R e l a t i o n s h ip s
For Many-Many Two 1-1 Tables (SUPPLIER, PARTS)
One linking table (SP)
Ex: Suppliers, Parts are the 1-1, SP is the linking table that says
who sells whatparts.
S# SNAME
1 SEARS
2 OFFICE DEPOT
SUPPLIER PARTS
P# PNAME
1 HAMMERS
2 NAILS
SP
S# P#
1 1
1 2
2 1
2 2
176
N o r m a l iz e d D e s i gn : E x a m p l eN o r m a l i ze d D e s ig n : E x a m p l e
A store sells a product which is supplied by agiven vendor. The product is purchased by acustomer at a certain time.
Entities: Customer, Product, Store
Relationships: Customer buys Product
Product is located in Store
Product is suppliedBy a Vendor
CUSTOMER PRODUCT
BUYS
STORE
IS-LOCATED-IN
VENDOR
-
8/6/2019 Data Warehouse Overview Done
89/112
89
177
C h e c k i n g a N o r m a l iz e d D e s ig nC h e c k i n g a N o r m a l iz e d D e s ig n
Normalization Used to reduce data insertion, delete, and update anomalies caused
by bad designs.
Enables users to quickly check a design and make sure there are no
glaring holes in the design.
1NF
All cells are atomic -- i.e. each entry in a column contains onlyone value
2NF
All non-key values are functionally dependent upon the entireprimary key -- i.e. if the primary key changes, all other columnschange.
3NF
No transitive dependencies -- i.e. all keys are completely
dependent on the primary key. If the primary key changes, allnon-key columns are affected.
178
O v e rv i e w o f No rm a l i ze d D e s ig nO v e rv i e w o f No rm a l i ze d D e s ig n
Pro
Relatively easy to change
Con
Queries can involve numerous joins
The massive number of tables and links between tables makes it
hard for customers to build their own queries
-
8/6/2019 Data Warehouse Overview Done
90/112
90
179
S t a r S ch e m aS t a r S ch e m a
Methodology Single fact table in the middle describing a key event (e.g. sale)
surrounded by dimension tables (i.e. location, time, employee)
FACT
D5
D1
D3
D2
D4
D = DIMENSIONS
180
S t a r S c h e m a : M e t h o d o l o gyS t a r S c h e m a : M e t h o d o l o gy
Identify a key fact that occurs.
Usually some event creates a real fact. Selling a product in a store
on Wednesday, patient visiting a hospital, etc.
Identify all the dimensions of the data being used.
Think of a dimension as a way to slice the data.
Ex: by time, by product, by customer, etc.
Drill down operations are very well supported
-
8/6/2019 Data Warehouse Overview Done
91/112
91
181
S t a r S ch e m a : E x a m p leS t a r S ch e m a : E x a m p le
A store sells a product which is supplied by agiven vendor. The product is purchased by acustomer at a certain time.
Fact
CustomerPurchase
Dimensions are
Customer
Product
Time
Vendor
182
S t a r S ch e m a : E x a m p le ( con t . )S t a r S ch e m a : E x a m p le ( con t . )
Sale
C u s t o m e r
St ore
Time
Product
SALE ID
1
CUST. ID
3
STORE ID
7
PROD. ID
4
PRICE
$ 3 . 0 0
TIME
4 / 2 4 / 9 9
CUST. ID
3
NAME
FRED
PHONE
1 2 3 4
Buys Apples
Y
Has Big Car
Y
DAY
24
MONTH
4
QT R
2Q
YEAR
99
Price
SALE
CUSTOMER
TIME
-
8/6/2019 Data Warehouse Overview Done
92/112
92
183
S t a r S c h e m a : O v e r v i e wS t a r S c h e m a : O v e r v i e w
Pro
Easy for users to navigate and understand
Con
Performance
Can end up with one monster fact table, millions of rows
Flexibility
Not as easy for customers to change the design
184
Make
Chips
P a r t s
Manu-facturing
PRODUCT
Price
Labor
Cost
S n o w fla k e S c h e m aS n o w fla k e S ch e m a
Several stars can be connected to form a snowflake
Sale
Price
R ev en u e
Product
Marketing
Vendor
SALES
Direct Mail
Pr ice
Ad
Location
MARKETING
Distrib-u t i o n
Sales
-
8/6/2019 Data Warehouse Overview Done
93/112
93
185
S u m m a r yS u m m a r y
Two basic types of design Star Schema
Normalized
Many Data Warehouse vendors sell products builtspecifically for the star schema
Some data warehouses insist that normalizationis the way to build the data warehouse.
186
Building aData Warehouse
( s l ides in th is sec t ion
a r e u s e d c o u r t e s y o f C a r r i g E m e r g i n g T e c h n o l og y
P h : 4 1 0 - 5 5 3 - 6 7 6 0www.ca r r ige t .com)
-
8/6/2019 Data Warehouse Overview Done
94/112
94
187
B u i ld i n g a D a t a W a r e h o u s eB u i ld i n g a D a t a W a r e h o u s e
1) Top Down Approaches
2) Enterprise Data Model Approach
3) "Let Data Users Decide"
4) "Let Data Warehouse Builders Decide"
5) "Let Senior Management Decide"
6) Bottom Up Approach
188
B u i ld i n g t h e D a t a Wa r e h o u s eB u i ld i n g t h e D a t a Wa r e h o u s e
How to decide what data goes into the datawarehouse?
Methods:
Top Down
Using Enterprise Data Models
"Let data users decide" approach
"Let data warehouse builders decide" approach
"Let senior management decide" approach
Bottom Up Combine data marts into a data warehouse
-
8/6/2019 Data Warehouse Overview Done
95/112
95
189
U s in g E n t e r p r i se D a t a M od e l sU si n g E n t e r p r i s e D a t a M od e l s
Use the Enterprise Data Model to decide whatdata goes into the data warehouse.
Model key processes. This approach says let the business decide.
Identify key data used by these processes in an enterprise data
model -- might be a giantEntity-Relationship diagram.
Put data in the warehouse based on theenterprise data model.
190
CHIPRECIPES
An E n t e r p r i s e D a t a M o d e l E x a m p leAn E n t e r p r i s e D a t a M o d e l E x a m p le
MAKECHIPS
PUT INBAGS
SELL CHIPS
COUNT$$
BUY MORE
POTATOES
INGREDIANTS
CHIPSUPPLIERS
-
8/6/2019 Data Warehouse Overview Done
96/112
96
191
"E n te r p r i s e Da t a M o d e l " Ap p r o a c h"E n te r p r i s e Da t a M o d e l " Ap p r o a c h
Pro
All inclusive -- no chance of leaving key data out.
Con
Very difficultto build an EDM.
If the business model changes, you may have to rebuild the
Enterprise Data Model and the data warehouse.
Ways of Avoiding the Con
In some cases you can buy an EDM -- if the business is commonenough the packaged EDM might be very close and then you just
have to modify it to fit your business.
192
USERS
S
OURCE
"L e t D a t a U s e r s D e c id e ""L e t D a t a U s e r s D e c id e "
Let the users of the data warehouse choose whatdata will go into the warehouse.
The data users deciding the data warehouse data and design will
pay for it as well.
Also, you can charge users who
query the data as well.
DATA WAREHOUSE
-
8/6/2019 Data Warehouse Overview Done
97/112
97
193
"Le t Da t a Use r s Dec id e": An Exa m ple"Le t Da t a Use r s Dec id e": An Exa m p le
DATA WAREHOUSE
MARKETING HUMANRESOURCES FINANCE
DATA
d e m o g r a p h i c s
Adver t is ing
?
trends
e d u c a t i o n
Et hn ic
group
Ag e
?
DATA
budget
spendingR e v e n u e
?
DATA
194
"L e t D a t a U s e r s D e c id e " Ap p r o a c h"L e t D a t a U s e r s D e c id e " Ap p r o a c h
Pro
Reduces budget problems
Users know best!
Con
Requires marketing
Could end up with data in the warehouse that is meaningless to the
people who run the place.
Users may not place important data in the warehouse because their
budget is small.
Users who need the data may not use the DW because of budget
concerns.
Ways of Mitigating the Con
Do not just take money -- try to determine if data is really
corporate.
-
8/6/2019 Data Warehouse Overview Done
98/112
98
195
P a y As Yo u G o Wa r e h o u s e An a l o g yP a y As Yo u G o Wa r e h o u s e An a l og y
I-495
196
"L e t D a t a Wa r e h o u s e B u i ld e r s D e c id e ""L e t D a t a Wa r e h o u s e B u i l d e r s D e c i d e "
LETS PUT
INFORMATION ON
HOW TO BUILD
VIRUSES IN THE
DATA WAREHOUSE
DATA WAREHOUSE
The technical staff who is building the warehousedecides what data gets put in the warehouse.
-
8/6/2019 Data Warehouse Overview Done
99/112
99
197
"L e t D a t a Wa r e h o u s e B u i l d e r s D e c i d e "
A p p ro a c h
"L e t D a t a Wa r e h o u s e B u i l d e r s D e c i d e "
A p p ro a c h
Pro Very easy to design
Does not take much time
Do not have to deal with users
Con
Could easily result in data DUMP not data warehouse
Ways to mitigate the con
Talk to lots of users to help you guess what should go in the DW
198
L e t S e n i o r M a n a g e m e n t D e c id e L e t S e n i o r M a n a g e m e n t D e c id e
The senior management decides what data goesinto the warehouse.
Asking the senior management is the safest way
to build a data warehouse.
Identify the key questions on seniormanagements mind and get the data to answer
these questions.
-
8/6/2019 Data Warehouse Overview Done
100/112
100
199
L e t S e n i o r M a n a g e m e n t D e c id e Ap p r o a c hL e t S e n i o r M a n a g e m e n t D e c id e Ap p r o a c h
Pro Ensures executive support for the project
Con
Senior management does not have much time for this -- you will
have to only get a few questions at a time
This dramatically increases visibility - if you do not move quickly
senior management will become very angry with the DW.
Ways to mitigate the con
Do your homework before talking to the senior management -- talk
to the aides of senior management to find out what is on theirmind.
Allocate resources so you can plan to move very quickly once you
hear from the senior management.
200
B o t to m-Up Ap p r o a c hB o t t o m -U p Ap p r o a c h
Move data from existing OLTP Applications todata marts.
Combine data marts into a data warehouse.
DATA
MART25
YARDS
DATA
MART5 0
METERS
DATA
MART
20 0CM
DATA
WAREHOUSE
OLTP
APP
OLTP
APP
OLTP
APP
-
8/6/2019 Data Warehouse Overview Done
101/112
101
201
Pro Data marts are much easier to build than full-fledged DW.
Con
Could end up with a bunch of stove pipe data marts.
Ways to mitigate the con
Develop standards for data when building the data marts so that
you can glue data from different data marts together.
B o t to m-Up Ap p r o a c hB o t t o m -U p Ap p r o a c h
202
R e c om m e n d a t i on s f or a n Ap p r o a c hR e c om m e n d a t i on s f or a n Ap p r o a c h
"Let senior management