Data Warehouse Overview Done

download Data Warehouse Overview Done

of 112

Transcript of Data Warehouse Overview Done

  • 8/6/2019 Data Warehouse Overview Done

    1/112

    1

    1

    Introduction toData Warehouse

    ( s l ides in th is sec t ion

    a r e u s e d c o u r t e s y o f C a r r i g E m e r g i n g T e c h n o l og y

    P h : 4 1 0 - 5 5 3 - 6 7 6 0www.ca r r ige t .com)

    2

    I n t r o d u c t i on t o D a t a Wa r e h o u s i n g a n d D a t a

    M i n i n g

    I n t r o d u c t i on t o D a t a Wa r e h o u s i n g a n d D a t a

    M i n i n g

    1) Data Warehouse Introduction

    2) Engineering Conflicts

    3) OLTP and DSS

    4) Stovepipe vs. Integration

    5) Data Warehouse Solution

    6) Enterprise Information System

    7) Security in a Data Warehouse

    8) Moving Data to a Data Warehouse

    9) Data Marts

    10) Data Mining

  • 8/6/2019 Data Warehouse Overview Done

    2/112

    2

    3

    I n t r o d u c t i o nI n t r o d u c t i o n

    Key topics for this course include:

    Data Warehouse

    Data Mart

    Data Mining

    Background and review of relational database

    systems

    Main focus on data warehouse and data mining

    4

    D a t a Wa r e h o u s e I n t r o d u c t i onD a t a Wa r e h o u s e I n t r o d u c t i on

    A data warehouse is a single source for key,corporate information needed to enable businessdecisions

    A database applicationis a piece of software that

    provides a user interface for users to add, delete,query and update data

    Typically, a database management systemisused to actually do the work of adding, deleting,

    querying or updating data

    DatabaseSystemApplication

    Data

  • 8/6/2019 Data Warehouse Overview Done

    3/112

    3

    5

    E n g i n e e r i n g Co n f li ct s , Q u e r y a n d U p d a t eE n g i n e e r i n g Co n f li ct s , Q u e r y a n d U p d a t e

    It is often an engineering problem when data isupdated and long-running queries occur at thesame time

    In some cases, the users who are doing updatesmust wait for queries to complete

    One way to avoid this is to make a read-only

    copy of data

    ApplicationDatabase System

    Data

    for update

    Datafor query

    6

    Database System

    OLTPApplication

    DSSDataDSS

    Application

    OLTPData

    O L TP a n d D S S D e fi n e dO L TP a n d D S S D e fi n e d

    An application that updates is called an on-linetransaction processing(OLTP) application

    An application that issues queries to the read-only database is called a decision support system

    (DSS)

  • 8/6/2019 Data Warehouse Overview Done

    4/112

    4

    7

    Ap p l ic a t i o n s i n a T y p i c a l E n t e r p r i s eAp p l ic a t i o n s i n a T y p i c a l E n t e r p r i se

    Most organizations have several disparateOLTP/DSS applications in several databases

    InventoryOLTP

    Application

    FinanceOLTP

    Application

    FinanceDSS

    Application

    InventoryDSS

    Application

    SalesOLTP

    Application

    SalesDSS

    Application

    F i n a n c e

    DS SData

    F i n a n c eOLTP

    Data

    I n v e n t o r yDS S

    Data

    I n v e n t o r yOLTP

    Data

    Sa lesDS S

    Data

    Sa lesOLTP

    Data

    DATABASE SYSTEM

    8

    S t o v e p ip e v s I n t e g r a t i o nS t o v e p ip e v s I n t e g r a t i on

    When systems stand by themselves they areoften referred to as stovepipes

    Systems that easily share data are called wellintegrated systems

    FinanceDSS

    Application

    InventoryOLTP

    Application

    FinanceOLTP

    Application

    InventoryDSS

    Application

  • 8/6/2019 Data Warehouse Overview Done

    5/112

    5

    9

    Problems: Users who wish to access data must query several different DSS to

    find it

    Data may have fundamental conflicts between DSS

    a department code table in one DSS may differ in another DSS

    a measurement may be stored in meters in one DSS and yards inanother

    Solution:

    Use a data warehouse, where data is integrated

    from the several different stovepipe systems

    Data warehouse is really sharing-lite -- youdont have to co-ordinate as much when applications are built

    and you still reap the benefits of data sharing

    P r o b l e m s w i t h S t o ve p i p e Ar c h i t e ct u r eP r o b l e m s w i t h S t o ve p i p e Ar c h i t e ct u r e

    10

    D a t a W a r e h o u s e S o lu t i onD a t a Wa r e h o u s e S ol u t i o n

    A data warehouse is an attempt to integrate

    separate DSS so that users can query one place

    to find the answers to their questions

    A data warehouse has the key, corporate datainthe organization

    A data warehouse tracks historical

    data

  • 8/6/2019 Data Warehouse Overview Done

    6/112

    6

    11

    D a t a Wa r e h o u s e - A S u c c e ss S t or yD a t a Wa r e h o u s e - A S u c c e ss S t or y

    Largest data warehouse is Wal-Mart (9 TB)

    Uses for Wal-Mart data warehouse

    Identifies where a new store should be built based on customer demand

    Identifies how stores are performing across the nation

    Contains every scan from every purchase

    Benefits Wal-Mart gained from their data warehouse

    Provided competitive advantage over K-Mart Reduced excess inventory in individual stores

    Avoided wasted funds in building stores which would fail

    12

    S e ll in g t h e D a t a Wa r e h o u s eS e lli n g t h e D a t a Wa r e h o u s e

    A data warehouse project will fail withoutcorporate sponsorship

    Preferably, the project should be sponsored by the CEO

    The CEO must be sold on the value to the business to

    improve competitive advantage by deploying a data warehouse

    If an active, corporate sponsor does not exist,data sources will be very difficult to identify

    Only add data to the warehouse

    that will answer key,corporate questions askedby the corporate sponsor.

    Otherwise, you will have a data dump

  • 8/6/2019 Data Warehouse Overview Done

    7/112

    7

    13

    B u i ld i n g a U s e fu l D a t a Wa r e h o u s eB u i ld i n g a U s e fu l D a t a Wa r e h o u s e

    You really need: strong executive sponsorship

    good knowledge of the data

    sound software engineering

    stability from source systems

    users who want a success

    A 75 percent failure rate is often cited

    It is WORTH the effort!!!

    14

    E n t e r p r i se I n fo r m a t i o n S y st e mE n t e r p r i se I n fo r m a t i o n S y st e m

    Data Warehouse

    EnterpriseInformation

    System

    An EIS (Enterprise Information System) allowsusers to query data in a data warehouse

    Users can access key, corporate data in the datawarehouse

  • 8/6/2019 Data Warehouse Overview Done

    8/112

    8

    15

    U s er s o f a n E n t e r p r i s e I n fo r m a t i o n S y s t e mU s er s o f a n E n t e r p r i s e I n fo r m a t i o n S y s t e m

    Frequently, multiple EIS are needed to satisfydifferent types of users

    Some users only want a system that has pre-defined reports so they

    only need to click one button to see data they need. These users

    want the system to be no harder to use than a coffee pot

    Other users want to delve into the data and build their own queries

    Executives want a high-level, summary

    data and a simple tool Must be VERY easy to use, users want to click a few

    buttons and get data they want

    Results must be graphs

    Users should be able to drill-down into key areas.

    16

    U s er s o f a n E n t e r p r i s e I n fo r m a t i o n S y s t e mU s er s o f a n E n t e r p r i s e I n fo r m a t i o n S y s t e m

    Analysts want a flexible, more detailed tool

    Often very knowledgeable about the data

    Willing to do more work to learn about the data

    Sometimes even learn SQL to issue their own

    ad-hoc queries

    General users want a tool that provides detaileddata, but is very easy to use

    Want access to the data warehouse to do

    routine tasks such as

    Find me Hanks phone number, etc. Simple application, but not so focused

    on large reports

  • 8/6/2019 Data Warehouse Overview Done

    9/112

    9

    17

    D a t a Wa r e h o u s e / E I SD a t a Wa r e h o u s e / E I S

    Data Warehouse

    FinanceOLTP

    Application

    SalesOLTP

    Application

    F i n a n c eS u b j e c t

    Area

    F i n a n c e

    OLTPData

    Inventory

    S u b j e c tArea

    I n v e n t o r yOLTP

    Data

    Sales

    Subjec tArea

    S a l e s

    OLTPData

    InventoryOLTP

    Application

    EnterpriseInformation

    System

    Sa lesOLTP

    Data

    18

    N e ed fo r D a t a Wa r e h o u s e sN e e d fo r D a t a W a r e h o u s e s

    Data warehouses provide a single place to store key

    corporate data

    The idea is that users can go one place to find this key data using an

    enterprise information system (EIS)

    Data warehouse is also a place to store and accesshistorical data

    Users measure performance goals for their company over a period of

    time

    Company statistics are available

    Data not stored in the same place is difficult to locateand compare, easily lost

    Single query can be used to access key data

  • 8/6/2019 Data Warehouse Overview Done

    10/112

    10

    19

    S e cu r i t y in D a t a Wa r e h o u s eS e cu r i t y in D a t a Wa r e h o u s e

    Building a data warehouse does increase securityrisk because key, corporate information is all inone place

    To mitigate that risk, database systemcomponents can be used to protect the data

    warehouse. These include

    Views

    Access control

    Security Administration

    Encryption

    Audit

    20

    M ov in g D a t a i n t o t h e D a t a W a r e h o u s eM ov in g D a t a i n t o t h e D a t a W a r e h o u s e

    Moving data from source OLTP systems to thedata warehouse is the hard part of datawarehousing

    Updates to the data warehouse are performed

    periodically

    weekly

    nightly

    monthly

    Occasionally, real-timedata is needed in a data warehouse, but this

    is not very common

  • 8/6/2019 Data Warehouse Overview Done

    11/112

    11

    21

    U s in g Mi d d l e w a r e t o M ov e D a t aU s in g Mi d d l e w a r e t o M ov e D a t a

    SourceOLTP

    System

    DataWarehouseMigrationSoftware

    Middleware

    DataWarehouse

    Data can be moved to the warehouse via datamigration software

    This is often called middleware because it sitsbetween the source OLTP and the datawarehouse

    22

    N e ed f or a D a t a M a r tN e e d fo r a D a t a M a r t

    A data martis a subset of the data warehousethat may make it simpler for users to access keycorporate data

    Sometimes, users only need a piece of data from the data

    warehouse

    The data martis typically fed from the datawarehouse

    Data Warehouse

    F i n a n c eS u b j e c t

    Area

    InventoryS u b j e c t

    Area

    SalesSubjec t

    Area

    New YorkD a t a M a r t

    California

    D a t a M a r t

  • 8/6/2019 Data Warehouse Overview Done

    12/112

    12

    23

    D a t a M a r t i n Ac t i onD a t a M a r t i n Ac t i on

    New York

    D a t a M a r t

    California

    D a t a M a r t

    Data Warehouse

    FinanceOLTP

    Application

    SalesOLTP

    Application

    F i n a n c eS u b j e c t

    Area

    F i n a n c e

    OLTPData

    Inventory

    S u b j e c tArea

    I n v e n t o r yOLTP

    Data

    Sales

    Subjec tArea

    S a l e s

    OLTPData

    InventoryOLTP

    Application

    EnterpriseInformation

    System

    Sa lesOLTP

    Data

    24

    D a t a M in i n g I n t r o d u c t io nD a t a M in i n g I n t r o d u c t io n

    Data Mining is done by running software thatexamines a database and looks for patterns in thedata

    A data warehouse by itself will respond to queriesfrom users

    It will nottell users about patterns in data that users may not have

    thought about

    To find patterns in data, data mining is

    used to try and mine key information from

    a data warehouse

  • 8/6/2019 Data Warehouse Overview Done

    13/112

    13

    25

    Ad v a n t a g e s o f D a t a M in i n gAd v a n t a g e s o f D a t a M in i n g

    Data mining allows companies to collectinformation and make them more productive andbeat their competition

    Data mining helps identify

    why customers buy certain products

    ideas for very direct marketing

    ideas for shelf placement

    training of employees vs. employee retention

    employee benefits vs. employee retention

    26

    I m p l e m e n t i n g D a t a M in i n gI m p l e m e n t i n g D a t a M in i n g

    Apply data mining tools to run data miningalgorithms against data

    There are two approaches:

    Copy data from the Data Warehouse and mine it

    Mine the data in the Data Warehouse

    Popular tools use a variety of different datamining algorithms:

    association rules

    genetic algorithms

    decision trees

    neural networks

  • 8/6/2019 Data Warehouse Overview Done

    14/112

    14

    27

    D a t a M in i n g u s in g S e p a r a t e D a t aD a t a M in i n g u s in g S e p a r a t e D a t a

    You can move data from the data warehouse todata mining tools

    Advantages

    Data mining tools may organize data so they can run faster

    Disadvantages

    Could be very expensive to move largeamounts of data

    Data Warehouse

    Data Mining Too lCopy of data made

    by theData Mining Tool

    28

    D a t a M i n i n g Ag a i n s t t h e D a t a W a r e h o u s eD a t a M i n i n g Ag a i n s t t h e D a t a W a r e h o u s e

    Data mining tools can access data directly in theData Warehouse

    Advantages

    No copy of data is needed for data mining

    Disadvantages

    Data may not be organized in a way that is

    efficient for the tool

    Data Warehouse

    Data Mining Tool

  • 8/6/2019 Data Warehouse Overview Done

    15/112

    15

    29

    D a t a M in i n g : S u m m a r yD a t a M in i n g : S u m m a r y

    Data miningattempts to find patterns in data thatwe did not know about

    Often data mining is just a new buzzword for

    statistics

    Data mining differs from statistics in that largevolumes of data are used

    Many different data mining algorithms exist andwe will discuss them in the course

    Examples identify users who are most likely to commit credit card fraud

    identify what attributes about a person most results in them buying

    productx.

    30

    SQL Review

    ( s l ides in th is sec t ion

    a r e u s e d c o u r t e s y o f C a r r ig E m e r g in g T e c h n o l o gy

    P h : 4 1 0 - 5 5 3 - 6 7 6 0www.ca r r ige t .com)

  • 8/6/2019 Data Warehouse Overview Done

    16/112

    16

    31

    I n t r o d u c t i on t o S Q LI n t r o d u c t i on t o S QL

    1) Introduction to SQL

    2) Data Definition Language (DDL)

    3) Data Manipulation Language (DML)

    4) SELECT Construct

    5) SELECT Operators

    6) Wildcard Searches

    7) Aggregate Operators

    8) Calculated Attributes9) Sorting Results

    32

    I n t r o d u c t i on t o S t r u c t u r e d Q u e r y L a n g u a g eI n t r o d u c t i on t o S t r u c t u r e d Qu e r y L a n g u a g e

    Structured Query Language (SQL) is the languageused to communicate with a relational database

    Industry standard

    Based on set theory

    SQL composed of two types of constructs:

    Data Definition Language (DDL)

    Defines the structure of the database

    Data Manipulation Language (DML)

    Provides the constructs to input and retrieve data

  • 8/6/2019 Data Warehouse Overview Done

    17/112

    17

    33

    S QL O ve r v i ew - D D LS Q L O ve r v i e w - D D L

    Data Definition Language (DDL) is used todescribe the structure of the database

    Create tables, indexes, etc.

    Typical Operations are:

    CREATE TABLE defines what columns are in the table and

    establishes the table

    CREATE INDEX defines an index for the table. Indexes are used

    to improve database performance

    34

    S Q L O v e r v i e w - D M LS Q L O v e r v i e w - D M L

    Data Manipulation Language (DML) is used forstoring, updating, and retrieving data.

    Typical operations include: SELECT is used to retrieve data.

    Ex: SELECT * FROM PRODUCTS

    INSERT is used to add new rows to the database.

    INSERT INTO PRODUCTS VALUES ('food',

    'hardware', 'housewares')

    UPDATE is used to change rows that already exist in the database.

    UPDATE PRODUCTS SET PRICE = PRICE + 4

    DELETE is used to eliminate rows of data from the database.

    DELETE FROM PRODUCTS

  • 8/6/2019 Data Warehouse Overview Done

    18/112

    18

    35

    SELECT O v e r v i e wSELECT O v e r v i e w

    SELECT is used to retrieve records from thedatabase.

    Single table SELECT constructs:

    WHERE

    IN

    BETWEEN

    LIKE

    Aggregate Operators

    DISTINCT

    ORDER BY

    36

    SELECT E x a m p l e sSELECT E x a m p l e s

    Query Purpose: Retrieve names and prices of allproducts

    SELECT ProductName, Price

    FROM TinyProducts

    Query Purpose: Retrieve all information for allemployees from the TinyProducts table

    SELECT *

    FROM TinyProducts

  • 8/6/2019 Data Warehouse Overview Done

    19/112

    19

    37

    SELECT w i t hWHERESELECT w i t hWHERE

    TheWHERE clause is used to filter whichinformation is returned from aSELECT

    Query Purpose: Retrieve all information only forproduct type of food

    SELECT *

    FROM TinyProducts

    WHERE ProductType = Food

    38

    U s e of B o ol e a n O p e r a t o r sU s e of B o ol e a n O p e r a t o r s

    Conditions can be separated by Booleanoperators: AND, OR, NOT

    Query Purpose: List all information about foodproducts that are either cereal or fruit

    SELECT *

    FROM TinyProducts

    WHERE (ProductName = 'Cereal')OR (ProductName = 'Fruit')

  • 8/6/2019 Data Warehouse Overview Done

    20/112

    20

    39

    B oo le a n O p e r a t o r E x a m p l eB oo le a n O p e r a t o r E x a m p l e

    Query Purpose: List the names of all productsthat the type is fruit and the price is less than$2.00

    SELECT ProductType, ProductName

    FROM TinyProducts

    WHERE Price < 2

    AND ProductName = 'Fruit'

    40

    IN O p e r a t o rIN O p e r a t o r

    The IN operator allows a search for records that

    match one value in a set of unordered values

    Example questions to use IN:

    'Find all products whose type is Food, Hardware, or Housewares'

    'Find all food whose type is Meat, Fish, Vegetables, or Fruit'

  • 8/6/2019 Data Warehouse Overview Done

    21/112

    21

    41

    IN E x a m p l eIN E x a m p l e

    Query Purpose: List the name of Housewares thatare Cookware, Linens, or Dishes

    SELECT ProductName, ProductType

    FROM TinyProducts

    WHERE ProductName in

    ('Cookware', 'Linens', 'Dishes')

    instead of:

    SELECT ProductName, ProductType

    FROM TinyProducts

    WHERE (ProductName = Cookware')

    OR (ProductName = 'Linens')

    OR (ProductName = 'Dishes')

    42

    BETWEEN O p e r a t o rBETWEEN O p e r a t o r

    The BETWEEN operator allows a search for a range

    of values

    Example Queries:

    'Find all fruit between Bananas and Grapes'

    'Find all cereals whose price is between $1.50 and $4.00 a box

    1.50 4.00

  • 8/6/2019 Data Warehouse Overview Done

    22/112

    22

    43

    BETWEEN E x a m p l eBETWEEN E x a m p l e

    Query Purpose: Find all products whose price isbetween $2.00 and $8.00

    SELECT ProductName, Price

    FROM TinyProducts

    WHERE Price BETWEEN 2.00 AND 8.00

    instead of:

    SELECT ProductName, Hardware

    FROM TinyProducts

    WHERE (Price >= 2.00) OR (Price

  • 8/6/2019 Data Warehouse Overview Done

    23/112

    23

    45

    Wi ld c a r d S e a r c h E x a m p l e sWi ld c a r d S e a r c h E x a m p l e s

    Query Purpose: List all products whose namestarts with an C'

    SELECT *

    FROM TinyProducts

    WHERE ProductName LIKE 'C%'

    Query Purpose: List all products that have a SKUnumber with the last 2 characters of 23' whenyou don't know the first character

    SELECT *

    FROM TinyProducts

    WHERE SKUNumber LIKE '_23'

    46

    Ag g r e g a t e O p e r a t o r sAg gr e g a t e O p e r a t o r s

    MIN,MAX, andAVERAGE are used when computing

    statistics on a range of data

    Query Examples:

    'What is the highest batting average on the team?'

    'What is the average number of hits for all the little league teams in

    the National League?'

    'What are the names of the players that had the lowest average on

    the little league team?'

  • 8/6/2019 Data Warehouse Overview Done

    24/112

    24

    47

    Ag g r e g a t e O p e r a t o r s E x a m p l eAg g r e g a t e O p e r a t o r s E x a m p l e

    Query Purpose: Find the minimum, maximum,and average batting average of all players in theNational League of Little League

    SELECTMIN(Average),MAX(Average),

    AVG(Average)

    FROM PLAYERS

    WHERE League = 'National'

    48

    SUMa n d COUNT O p e r a t o r sSUMa n d COUNT O p e r a t o r s

    Use the SUMoperator to total the results of a

    query

    COUNT will count the total number of occurrences

    of an item in a search

    11 ++22 ++33 ++44

  • 8/6/2019 Data Warehouse Overview Done

    25/112

    25

    49

    SUMAn d COUNT E x a m p l e sSUMAn d COUNT E x a m p l e s

    Query Purpose: Find the total number ofhomeruns hit by all players in the AmericanLeague?

    SELECT SUM(HomeRuns)

    FROM PLAYERS

    WHERE League='American'

    Query Purpose: List the names of players thathave hit 3 home runs in the National League?

    SELECT COUNT(*)FROM PLAYERS

    WHERE HomeRuns = '3'

    AND League = 'National'

    50

    C a l c u l a t e d At t r i b u t e sC a l cu l a t e d At t r i b u t e s

    A new attribute can be obtained by usingarithmetic operators (+,-, *, /) on other

    numeric attributes

    All operators follow standard precedence:

    Multiplication and division are computed first left to right

    Addition and subtraction are computed last left to right

    Use parenthesis to override the standard precedence

    (( ++ ,, -- ,, ** ,, //))

  • 8/6/2019 Data Warehouse Overview Done

    26/112

    26

    51

    C a l cu l a t e d At t r i b u t e s E x a m p l eC a l cu l a t e d At t r i b u t e s E x a m p l e

    Query Purpose: List all players with their hits, atbats, and their batting average

    SELECT Name, Hits, AtBats,

    (Hits / AtBats)

    FROM PLAYERS

    52

    DISTINCT O p e r a t o rDISTINCT O p e r a t o r

    DISTINCT is used to exclude duplicate

    occurrences in the result of a query

    Query Purpose: List all distinct batting averages

    SELECT DISTINCT(Average)

    FROM PLAYERS

  • 8/6/2019 Data Warehouse Overview Done

    27/112

    27

    53

    S o r t i n g Q u e r y R e s u lt sS o r t i n g Q u e r y R e s u lt s

    The ORDER BY clause is used at the end of theSELECT statement to sort the results of a query

    Use DESC on the end of the ORDER BY clause to

    sort the data in descending order. Otherwise, theresult will be in ascending order

    54

    S o r t i n g E x a m p leS o r t i n g E x a m p l e

    Query Purpose: List all players in ascendingorder of their batting average

    SELECT Name, Average

    FROM PLAYERS

    ORDER BY Average

    For descending order add the keyword DESC

    SELECT Name, Average

    FROM PLAYERS

    ORDER BY Name DESC

  • 8/6/2019 Data Warehouse Overview Done

    28/112

    28

    55

    S o r t i n g C a l cu l a t e d At t r i b u t e sS o r t i n g C a l cu l a t e d At t r i b u t e s

    To refer to a computed attribute in the ORDER BY,use its position in the list of columns followingSELECT

    Query Purpose: List all players in descendingorder of their batting average (here we assumebatting average is computed at the time of thequery)

    SELECT Name, Hits, AtBats,

    Hits / AtBats

    FROM PLAYERS

    ORDER BY 3 DESC

    56

    M o re SQ LM o re SQ L

    1) GROUP BY Construct

    2) HAVING Filter

    3) Multiple Tables

    4) Joins

    5) Equijoins

    6) Cartesian Product7) Nulls

    8) OUTER JOIN

  • 8/6/2019 Data Warehouse Overview Done

    29/112

    29

    57

    GROUP BY C l a u s eGROUP BY C l a u s e

    GROUP BY will partition a table into multiplegroups of related rows.

    As an example, consider the EMPLOYEE tablewhere Department partitions the EMPLOYEE set

    into subsets:

    Engineering

    Marketing Customer

    Finance

    58

    GROUP BY E x a m p l eGROUP BY E x a m p l e

    Query Purpose: For each department, list theaverage salary using the EMPLOYEE table

    SELECT Department, AVG(Salary)

    FROM EMPLOYEE

    GROUP BY Department

  • 8/6/2019 Data Warehouse Overview Done

    30/112

    30

    59

    To filter data further, we can use theWHEREclause with GROUP BY clause

    Query Purpose: For each department, list thehighest salary of their administrative assistants.

    SELECT Department, MAX(Salary)

    FROM EMPLOYEE

    WHERE Title='administrative assistant'

    GROUP BY Department

    GROUP BY WithWHEREGROUP BY WithWHEREGROUP BY WithWHEREGROUP BY WithWHERE

    60

    HAVING C o n s t r u c tHAVING C o n s t r u c t

    HAVING is used to restrict the output of aggregatefunctions, such as SUM,MIN,MAX andAVG, to only

    those groups of rows that meet some condition.

    Query Purpose: List the average salary for all

    departments that have more than threeemployees.

    SELECT Department, AVG(Salary)

    FROM EMPLOYEE

    GROUP BY Department

    HAVING COUNT(*) > 3

  • 8/6/2019 Data Warehouse Overview Done

    31/112

    31

    61

    EmpID Name Salary

    1 Fred 200

    2 Ethel 300

    3 Mike 400

    4 David 100

    EMPLOYEE

    Mult i -Tab le SQLMul t i -Tab le SQL

    It is often necessary to combine data into multipletables.

    ATTENDS

    EmpID Name

    1 Harvard

    2 GMU

    2 Yale

    3 MIT

    3 Stanford

    3 GMU

    62

    J o i n sJ o i n s

    Joins are the means by which multiple tables canbe combined.

    A join allows us to combine data from differenttables. A join operation is done through theSELECT construct.

    Types of Joins: Equijoin, Outer Join, Inner Join

  • 8/6/2019 Data Warehouse Overview Done

    32/112

    32

    63

    Equi jo inEqui jo in

    Joins only those rows where a foreign keymatches the primary key

    Allows information from multiple tables to belinked together in a single query

    Can be used to link as many tables as needed in asingle query

    64

    Query Purpose: List the names of all collegesattended by Ethel

    SELECT b.Name

    FROM EMPLOYEE a, ATTENDS b

    WHERE a.EmpID = b.EmpID

    AND a.Name = 'Ethel'

    E q u i jo in Q u e r y E xa m p l eE q u i jo in Q u e r y E x a m p l e

  • 8/6/2019 Data Warehouse Overview Done

    33/112

    33

    65

    E q u ijo in E x a m p leE q u ijo in E x a m p le

    EmpID College GPA

    1 Harvard 2.45

    2 GMU 3.79

    2 Nova 3.65

    3 Yale 2.853 Nova 2.65

    3 GMU 4.0

    EmpID Name Salary

    1 Fred 200

    2 Ethel 300

    3 Mike 400

    EMPLOYEE

    ATTENDS

    66

    Wa r n i n g a b o u t J o in i n g T a b l esWa r n i n g a b o u t J o in i n g T a b le s

    A join is really just a subset of a cartesianproduct. When no fields are 'joined' in theWHERE

    clause, a cartesian product is produced

    Restated in English: When the linking condition is omitted fromthe WHERE clause, you get a lot of excess garbage that you

    probably do not want.

    Sample Query:

    SELECT b.Name

    FROM EMPLOYEE a, ATTENDS b

    WHERE a.Name = 'Ethel'

  • 8/6/2019 Data Warehouse Overview Done

    34/112

    34

    67

    C a r t e si a n P r o d u c tC a r t e sia n P r o d u c t

    Each row in one table with every other row in othertable

    a.EmpID a.Name a.Salary b.EmpID b.GPA

    2 Ethel 300 1 3.4

    2 Ethel 300 2 2.8

    2 Ethel 300 3 3.7

    2 Ethel 300 4 3.5

    ....

    68

    Nul l sNul l s

    An attribute may be defined as null.

    This indicates that the value is unknown andavoids the need for user-defined specialindicators.

    To prevent a column from having nulls, specifyNOT NULL on the column in the CREATE TABLE

    statement when setting up the database.

  • 8/6/2019 Data Warehouse Overview Done

    35/112

    35

    69

    N u l ls E x a m p l e sN u l ls E x a m p l e s

    Statement Purpose: Add an employee whose salaryis unknown

    INSERT INTO EMPLOYEE (3,'Hank',NULL)

    Query Purpose: Find all employees whose salary isunknown (or null)

    SELECT *

    FROM EMPLOYEEWHERE Salary IS NULL

    70

    An OUTER JOIN is used when the query should

    return a result row even for rows that do not havecorresponding data in one of the tables.

    A LEFT OUTER JOIN returns all rows from the

    'left' table.

    Nulls are returned when a row in the 'left' tablehas no corresponding rows in the right table.

    OUTER JOINOUTER JOIN

  • 8/6/2019 Data Warehouse Overview Done

    36/112

    36

    71

    LEFT OUTER JOIN E x a m p l eLEFT OUTER JOIN E x a m p l e

    Query Purpose: List the college GPAs for eachemployee. Include employees who have notattended any colleges

    SELECT a.Name, b.GPA

    FROM EMPLOYEE a

    LEFT OUTER JOIN ATTENDS b

    on a.EmpID = b.EmpID

    72

    L E F T O UT E R J O I N E x a m p l eL E F T O UT E R J O I N E x a m p l e

    Result of the outer join

    All employees are listed.

    For an equijoin, only those who attended a college would be listed

    Here, employee number 4 did not attend college, but is still

    retrieved by the outer join.

    Name GPA

    ---------- -----

    Fred 2.45

    Ethel 3.79Ethel 3.65

    Mike 2.85

    Mike 2.65

    Mike 4.00

    David NULL

  • 8/6/2019 Data Warehouse Overview Done

    37/112

    37

    73

    Advanced SQL

    ( s l ides in th is sec t ion

    a r e u s e d c o u r t e s y o f C a r r i g E m e r g i n g T e c h n o l og y

    P h : 4 1 0 - 5 5 3 - 6 7 6 0www.ca r r ige t .com)

    74

    Ad v a n c e d SQ LAd v a n c e d SQ L

    1) Finding the nth element in a list

    2) Finding the median

    3) Correlated subquery

    4) Data Definition Language Constructs

  • 8/6/2019 Data Warehouse Overview Done

    38/112

    38

    75

    F i n d t h e N t h E le m e n tF i n d t h e N t h E le m e n t

    It is very common to try to find the nth element ina list.

    Examples:

    Who makes the second highest salary in marketing department?

    What is the fifth best product in sales?

    This can be done with a program that uses SQL to access the

    database: SQL is sent to the database and the program keeps

    retrieving the result set until the threshold is crossed.

    We show another way of

    doing this using standard SQL.

    76

    F i n d t h e Nt h E l e m e n t : E x a m p le Ta b l eF i n d t h e Nt h E l e m e n t : E x a m p le Ta b l e

    Consider a table, called TEST, with just one

    column, x, with the following values:

    X

    4

    5

    8

  • 8/6/2019 Data Warehouse Overview Done

    39/112

    39

    77

    First join TEST with itself, this yields each

    element matched with every other element:

    F i n d t h e N t h E le m e n t : S t e p 1F i n d t h e N t h E le m e n t : S t e p 1

    4

    4

    4

    5

    5

    5

    88

    8

    4

    5

    8

    4

    5

    8

    45

    8

    78

    F i n d t h e N t h E le m e n t : S t e p 2F i n d t h e N t h E le m e n t : S t e p 2

    Next keep only those rows where the first columnis greater than or equal the second column.

    Notice the pattern that just developed, each number on thelist now has a certain number of values that match on theright. This number matches the position of this value inthe list. For example, 4 has only one match as it is the firstnumber in the list, 5 has two matches, 8 has three matches.

    4

    5

    5

    8

    8

    8

    4

    4

    5

    4

    5

    8

    4

    4

    4

    5

    5

    5

    88

    8

    4

    5

    8

    4

    5

    8

    45

    8

  • 8/6/2019 Data Warehouse Overview Done

    40/112

    40

    79

    F i n d t h e N t h E le m e n t : S t e p 3F i n d t h e N t h E le m e n t : S t e p 3

    Now group by the column on the left and identifythe size of each group.

    The same ideas can be applied to any SELECT

    statement output.

    4

    5

    5

    8

    8

    8

    4

    4

    5

    4

    5

    8

    4

    5

    8

    1

    2

    3

    80

    F i n d i n g t h e Nt h E le m e n t : E x a m p leF i n d i n g t h e N t h E le m e n t : E x a m p le

    Query Purpose: Find the information about theproduct with the second highest price.

    SELECT a.ProductName, a.ProductType,

    a.Price, a.SKUNumber

    FROM TinyProducts a, TinyProducts b

    WHERE a.Price >= b.Price

    GROUP BY a.ProductName,a.ProductType,

    a.Price, a.SKUNumber

    HAVING COUNT(*) =

    (SELECT COUNT(*)-1 FROM TinyProducts)

  • 8/6/2019 Data Warehouse Overview Done

    41/112

    41

    81

    F i n d i n g t h e T o p N E le m e n t s : E x a m p l eF i n d i n g t h e T o p N E le m e n t s : E x a m p l e

    To ask for the top nvalues instead of the nthvalue, specify a range (>=) instead of just anequality (=) in the HAVING.

    Query Purpose: Find information about theproducts with the two highest prices.

    SELECT a.ProductName, a.ProductType, a.Price, a.SKUNumber

    FROM TinyProducts a, TinyProducts b

    WHERE a.Price >= b.Price

    GROUP BY a.ProductName,a.ProductType,

    a.Price, a.SKUNumber

    HAVING COUNT(*) >=

    (SELECT COUNT(*)-1 FROM TinyProducts)

    ORDER BY a.Price

    82

    F i n d i n g t h e M ed i a nF i n d i n g t h e M ed i a n

    The median is defined as the element in themiddle of the list.

    Query Purpose: Find the median price inTinyProducts.

    SELECT a.ProductName, a.ProductType, a.Price, a.SKUNumber

    FROM TinyProducts a, TinyProducts b

    WHERE a.Price >= b.Price

    GROUP BY a.ProductName,a.ProductType, a.Price, a.SKUNumber

    HAVING COUNT(*) = (SELECT (COUNT(*)/2)+1 FROM TinyProducts)

  • 8/6/2019 Data Warehouse Overview Done

    42/112

    42

    83

    U s in g S u b q u e r i e sU s in g S u b q u e r i e s

    A subquery may be used in the middle of a query.

    Query Purpose: Find the information about thehighestpriced product, using a simple subquery.

    SELECT a.ProductName, a.ProductType, a.Price, a.SKUNumber

    FROM TinyProducts a

    WHERE Price = (SELECT MAX(PRICE) FROM TinyProducts)

    84

    C or r e la t e d S u b q u e r yC or r e la t e d S u b q u e r y

    If the subquery references a data element fromoutside of the subquery, it is called a correlatedsubquery.

    For each row in the outer part of the query, the correlated subquery

    is executed.

    The following query will indicate who makes more money than Ethel

    SELECT a.Name, a.Salary

    FROM Employee a WHERE EXISTS

    (SELECT b.Salary

    FROM Employee b

    WHERE a.Salary > b.Salary

    AND b.Name = 'Ethel')

  • 8/6/2019 Data Warehouse Overview Done

    43/112

    43

    85

    INSERT

    Add rows to a single table

    UPDATE

    Modify rows in a single table

    DELETE

    Remove rows from a single table

    O t h e r D a t a M a n i p u l a t i onO t h e r D a t a M a n i p u la t i o n

    86

    INSERT E x a m p l e sINSERT E x a m p l e s

    Statement Purpose: Add a record for employee#1, Fred' with a salary of 200 to the EMPLOYEEtable

    INSERT INTO Employee VALUES

    (1, Fred', 200)

    Statement Purpose: Copy all rows in theEMPLOYEE table and place them inNEW_EMPLOYEE

    INSERT INTO New_Employee

    SELECT * FROM Employee

  • 8/6/2019 Data Warehouse Overview Done

    44/112

    44

    87

    UPDATE E x a m p l eUPDATE E x a m p l e

    Statement Purpose: Modify Freds salary to 150

    UPDATE Employee

    SET Salary = 150.00

    WHERE Name = 'Fred'

    Statement Purpose: Give all employees a tenpercent raise

    UPDATE Employee

    SET Salary = Salary * 1.10

    88

    DELETE E x a m p l e sDELETE E x a m p l e s

    Statement Purpose: Remove all employees whohave a salary higher than 100.

    DELETE FROM Employee

    WHERE Salary > 100

    To remove all employees:

    DELETE FROM Employee

  • 8/6/2019 Data Warehouse Overview Done

    45/112

    45

    89

    CREATE TABLE E x a m p l eCREATE TABLE E x a m p l e

    Statement Purpose: Create a table to storeemployee information

    CREATE TABLE EMPLOYEE

    (EmpId SMALLINT,

    Name CHAR(10),

    Salary DECIMAL(5,2))

    To drop the EMPLOYEE table

    DROP TABLE EMPLOYEE

    90

    Data WarehouseSecurity

    ( s l ides in th is sec t ion

    a r e u s e d c o u r t e s y o f C a r r i g E m e r g i n g T e c h n o l og y

    P h : 4 1 0 - 5 5 3 - 6 7 6 0www.ca r r ige t .com)

  • 8/6/2019 Data Warehouse Overview Done

    46/112

    46

    91

    D a t a Wa r e h o u s e S e cu r i t yD a t a Wa r e h o u s e S e cu r i t y

    1) Key Security Services

    2) Views

    3) Access Control

    4) Roles

    5) Encryption

    6) Audit Trails

    7) Security Holes

    8) Intrusion Detection9) Misuse Detection

    92

    I n t r o d u c t i o nI n t r o d u c t i o n

    A key feature provided by database systems isgood security services.

    In a database system with good security, applications do not have

    to worry about problems that arise with security violations.

    A data warehouse also requires good securityservices because it holds key, corporate data.

    EIS

    Database System

    SecurityServices

  • 8/6/2019 Data Warehouse Overview Done

    47/112

    47

    93

    K e y Se c u r i t y Se rv i c e sK e y Se c u r i t y Se rv i c e s

    Access Control Controls who accesses what data

    Administration of Access Control

    Used to give access to users as well as track who has various

    accesses and what kind of accesses are given to a user or group of

    users

    Audit tracks the usage of the data warehouse

    94

    S e cu r i t y in a D a t a Wa r e h o u s eS e cu r i t y in a D a t a Wa r e h o u s e

    A data warehouse consolidates organizations keydata in one place.

    A data warehouse increases the security risk that unauthorized

    users will try to obtain this data

    Security aspects of EIS applications must bedesigned and implemented very thoroughly.

    Access control and audits aretwo of the critical components

    of security.

  • 8/6/2019 Data Warehouse Overview Done

    48/112

    48

    95

    D a t a Wa r e h o u s e S e cu r i t y C om p o n e n t sD a t a Wa r e h o u s e S e cu r i t y C om p o n e n t s

    Database system components that can be usedto protect a data warehouse include:

    Views

    Allow users to only see certain rows or columns of data

    Access control

    Indicate which users have access to what data

    Administration

    This component is used to actually give access to groups of usersand to define the accesses given to either an individual or a group.

    Encryption

    Protect data from access outside of the DBMS

    Audit Track what users are doing

    96

    Vi e w s in D a t a Wa r e h o u s eVi e w s in D a t a Wa r e h o u s e

    A view is a logical viewinto one or more tables.Users may be given access to the view withoutaccess to the base table.

    Views provide some security assistance because

    they can hidedata from users.

    Name Address Salary

    H a n k

    E s t h e rTo m

    Su eDave

    Pete

    Ka thy

    1 S o u t h S t r e e t

    2 N o r t h S t r e e t

    3 4 M a i n S t r e e t

    4 5 E a s y S t r e e t5 6 5 t h A v e n u e7 Broadway

    8 9 W e s t e r n A v e n u e

    $ 5 0 , 0 0 0

    $ 8 0 , 0 0 0$ 9 0 , 0 0 0

    $ 2 8 , 5 0 0$ 3 5 , 0 0 0

    $ 6 0 , 0 0 0$ 8 5 , 0 0 0

    EMPLOYEE

  • 8/6/2019 Data Warehouse Overview Done

    49/112

    49

    97

    Vi e w E x a m p leVi e w E x a m p l e

    VIEW (SAFE_EMPLOYEE)

    Salary is ef fect ively hidden

    A view called SAFE_EMPLOYEE may be createdas:CREATE VIEW SAFE_EMPLOYEE AS

    (SELECT name, address FROM EMPLOYEE)

    Now users of the view SAFE_EMPLOYEE will not

    even know that salaryexists.

    Name Address Salary

    H a n k

    E s t h e rTo m

    Su eDave

    Pete

    Ka thy

    1 S o u t h S t r e e t

    2 N o r t h S t r e e t

    3 4 M a i n S t r e e t

    4 5 E a s y S t r e e t5 6 5 t h A v e n u e7 Broadway

    8 9 W e s t e r n A v e n u e

    SAFE_EMPLOYEE

    98

    U p d a t in g Vie w sU p d a t i n g Vie w s

    Restrictions exist on updating views. For theEMPLOYEE table, it is possible to insert into theSAFE_EMPLOYEE view.

    Example:

    INSERT INTO SAFE_EMPLOYEE VALUES (Hank, 300)

    This will insert aNULL into the SALARY column of the base table

    EMPLOYEE.

    Other restrictions to view updates exist:

    Cannot update a view that is defined with an aggregate

    Cannot update a view that is defined with a GROUP BY

  • 8/6/2019 Data Warehouse Overview Done

    50/112

    50

    99

    D a t a Wa r e h o u s e Ac c e ss C on t r o lD a t a Wa r e h o u s e Ac c e ss C on t r o l

    Access control is implemented in a datawarehouse with the SQL Grant and Revokecommands.

    Syntax

    GRANT ON

    TO

    Example: GRANT SELECT ON EMPLOYEE TO MARY

    Access control is done by DBAs and creators oftables.

    To remove access the REVOKE command is used.

    Example: REVOKE SELECT ON EMPLOYEE FROM MARY

    100

    D a t a b a s e R o le sD a t a b a s e R o le s

    Roles provide security administration by allowingusers to be grouped into roles. Accesses maythen be given to a group of users. As an example, some roles for a company might be:

    Administrative assistant

    Loan officer

    Salesperson

    Accesses may be assigned based on roles. This dramatically simplifies administration.

    If new tables are created, it is not necessary to add thousands of

    new accesses. Examples:

    CREATE ROLE loan_officer AS (Hank, John, Mike)

    GRANT SELECT ON LOAN TO LOAN_OFFICER

  • 8/6/2019 Data Warehouse Overview Done

    51/112

    51

    101

    E x a m p le of Ap p l ic a t i o n -b a s e d R o le sE x a m p le of Ap p l ic a t i o n -b a s e d R o le s

    Consider:

    If the database system controls accesses than itdoes not matter what the application does,accesses are controlled consistently (same forSALES as MARKETING)

    However, more fine-grained access control canbe granted in the application.

    DatabaseSystem

    Applications DataUsers

    102

    Ap p l ic a t i o n R o le sAp p l ic a t i o n R o le s

    The application can restrict:

    Data entry screens

    Reports

    Care must be taken to restrict users in aconsistent fashion so that a user cannot jump toa different application and avoid security set up

    by another application.

  • 8/6/2019 Data Warehouse Overview Done

    52/112

    52

    103

    R o le B a s e d S e cu r i t y in a D a t a W a r e h o u s eR o le B a s e d S e cu r i t y in a D a t a W a r e h o u s e

    Both application and database level security areuseful in a data warehouse.

    Database level security is needed so that usersare only allowed to see data they

    need to see.

    Application level security can

    be used to control access tocertain menus so that users donot even know what reports exist.

    104

    E n c r y p t i o nE n c r y p t i o n

    Encryption is the process of coding data so that itcan only be read by users who have the key thatallows them to decrypt the data.

    Example:A message sell 500 shares would appear as xyzzy

    without the key. Once the key is paired with the encrypted string

    xyzzy, it can then be decrypted.

    The size of the key is a factor in how difficult it is to attack the

    encryption scheme.

    Three places where encryption might be used in adata warehouse:

    Network

    Data

    Tape backups

  • 8/6/2019 Data Warehouse Overview Done

    53/112

    53

    105

    N et w o r k E n c r y p t i o nN e t w or k E n c r y p t i o n

    In a data warehouse application, data and queriesare transmitted through a network.

    Attackers might be able to steal network traffic just by breaking

    into the network medium.

    One way to reduce the risk of this threat is toencrypt traffic on the network.

    User

    DatabaseSystem

    Application

    Data

    Warehouse

    N e t w o r k

    Tape Backup

    106

    Network encryption is critical because thenetwork connects all of the key components in adata warehouse.

    Encrypting network traffic mitigates the risk that

    an attacker could succeed with the man in themiddle attack.

    Without this, it may be possible for theman in the middle to masquerade as

    another user and circumvent existingapplication and database security.

    N et w o r k E n c r y p t i o nN e t w or k E n c r y p t i o n

  • 8/6/2019 Data Warehouse Overview Done

    54/112

    54

    107

    D a t a E n c r y p t i onD a t a E n c r y p t i on

    Data encryption refers to encrypting the actualdata in the data warehouse.

    If the attackers were to retrieve data fromthe warehouse, they would have to

    decrypt it in order to read it.

    EIS DatabaseSystem

    DataWarehouse

    108

    B a c k u p E n c r y p t i onB a c k u p E n c r y p t i on

    Periodically, databases are copied to some kindof long-term storage (usually tapes).

    If the database is encrypted, but the tapes are notencrypted, the risk exists of someone walking off

    with the tapes.

    Tape Backup

    EIS DatabaseSystem

    DataWarehouse

  • 8/6/2019 Data Warehouse Overview Done

    55/112

    55

    109

    Au d i t T ra i lsAu d i t T ra i ls

    Audit trails are a means of tracking queries,updates, deletes, and additions of new data to thedata warehouse.

    Audit trails are turned on when the DBMS is started and all

    activity that uses the data warehouse is tracked in the audit trail.

    If a user is suspected of an evil deed, the audittrail can be examined to identify what data hasbeen accessed by users.

    110

    Det a i l s o f DW Au d i t T r a i l sDe t a i l s o f DW Au d i t T r a i l s

    An audit trail of a database system typicallyincludes the following information:

    User ID, Date, Time, Object that has been accessed (table or view),Action that accessed the object (INSERT, UPDATE, DELETE,

    SELECT)

    For UPDATE, the old value and new value is tracked.

    For data warehouses, the SELECT is often usedto track the queries that have

    been run against the warehouse.

  • 8/6/2019 Data Warehouse Overview Done

    56/112

    56

    111

    O th e r U s e s for D W Au d i t T r a i l sO th e r U s e s fo r D W Au d i t T r a i l s

    Audit trails can be used to identify the mostpopular data in the warehouse.

    This information can be used to optimize queries

    An additional use for audit trails is performancetuning of the data warehouse.

    Administrators know where to focus their efforts

    Reduces administrative overhead

    112

    D e a li n g w i t h K n o w n S e cu r i t y H o le sD e a li n g w i t h K n o w n S e cu r i t y H o le s

    Commercial database systems and operatingsystems are often filled with holes that allowusers to obtain unauthorized access.

    To reduce the risk of these known holes, vendors often provide

    fixes to their products as soon as these holes become public.

    It is important to constantly keep up with knownsecurity holes and apply the latest fixes as soonas they are released.

    One of the key risks surrounding a data

    warehouse is that privileged usershave the keysto the kingdom.

  • 8/6/2019 Data Warehouse Overview Done

    57/112

    57

    113

    T h e R i s k o f P r iv i le g e d U s e r s T h e R i s k o f P r iv il e g e d U s e r s

    "Privileged users" include: Data warehouse administrators

    Operating system programmers

    Operators in the computer center

    These users can:

    Modify, delete and query any data in the warehouse

    Modify the audit trail to mask their actions

    Give other users unauthorized access

    Numbers of "privileged users" could

    be anywhere from 20 to 30 in someorganizations.

    114

    R e d u c i n g t h e R i s k o f P r i v i le g e d U s e r sR e d u c in g t h e R i s k o f P r i v i le g e d U s e r s

    One way to reduce the risk of privileged users isto separate security administration from databaseadministration.

    This would separate the task of giving accesses and managing the

    audit trail from the task of making sure the data in the warehouse

    was correct and properly optimized.

    Secur i t y Serv icesAccess Contro l

    Audit

    Dat abas e Serv icesDatabase Tuning

    Q u e r y O p t i m i z a t i o nBackups

    Security

    Services

    Access Contro lAudi t

    Database

    ServicesDatabase Tuning

    Q u e r y O p t i m i z a t i o nBackups

  • 8/6/2019 Data Warehouse Overview Done

    58/112

    58

    115

    I n f or m a t i o n S e c u r i t y At t a c k sI n f or m a t i o n S e c u r i t y At t a c k s

    Two types of Information security attacks on datawarehouses are:

    Intrusion

    An intrusion occurs when an unauthorized user gains access to thedata warehouse.

    Misuse

    Misuse, often referred to as the insiderproblem occurs when a user who has access

    to the warehouse uses that access for anunauthorized purpose

    Audit Trails can be used toidentify either type of attack, but

    identification of misuse is typically MUCH harder

    to do than intrusion.

    116

    I n t r u s io n D et e c t i o nI n t r u s i o n De t e c t i o n

    An intrusionis defined as an unauthorizedaccess to a system. The assumption is the user isexternal to the environment (e.g.; a hacker).

    To reduce the risk of intrusion, intrusiondetection tools are used.

    These tools monitor access to the data warehouse and sound an

    alarm if unauthorized accesses are detected.

    DATAWAREHOUSEUSER

    INTRUSION DETECTION SYSTEM

  • 8/6/2019 Data Warehouse Overview Done

    59/112

    59

    117

    M is u s e D e t e c t i o nM is u s e D e t e c t i o n

    Unwanted access by a user that has the ability toaccess data is referred to as misuse.

    This is also known as the insider problem.

    Some estimates have shown that 80 % of computer crime is a

    result of misuse.

    For data warehouses the threat of misuse is high

    especially by privileged users.

    118

    S u m m a r yS u m m a r y

    DBMS Security is useful for data warehouses tohide data from users with viewsand to restrictaccess to data with GRANT and REVOKE.

    Application Level Security assists EIS thataccess data warehouses by hiding certain reportsfrom users.

    Encryption can be used to further protect againstthe risk of someone walking off with the datawarehouse.

    Audit Trails are useful for:

    Catching attackers

    Identifying usage trends of the data warehouse

  • 8/6/2019 Data Warehouse Overview Done

    60/112

    60

    119

    Moving Datato the

    Data Warehouse

    ( s l ides in th is sec t ion

    a r e u s e d c o u r t e s y o f C a r r i g E m e r g i n g T e c h n o l og y

    P h : 4 1 0 - 5 5 3 - 6 7 6 0www.ca r r ige t .com)

    120

    M ov in g D a t a t o t h e D a t a Wa r e h o u s eM ov in g D a t a t o t h e D a t a Wa r e h o u s e

    1) Moving Data into the Data Warehouse

    2) Updating the Data Warehouse

    3) Full Refresh

    4) Copy Only the Changes

    5) BCP

    6) Simple Transformations

    7) Complex Transformations

    8) Commercial ETL Tools

  • 8/6/2019 Data Warehouse Overview Done

    61/112

    61

    121

    M ov in g D a t a i n t o t h e D a t a W a r e h o u s eM ov in g D a t a i n t o t h e D a t a Wa r e h o u s e

    Data must be moved to the data warehouse fromsource systems.

    Some key issues:

    Determine the frequency of data updates -- how often should data

    be moved from source systems to the data warehouse.

    Various means of updating data in the warehouse exist:

    SQL Commands

    Database system load programs (e.g.; SQL Servers BCP)

    Commercial tools

    122

    U p d a t i n g t h e D a t a Wa r e h o u s eU p d a t i n g t h e D a t a Wa r e h o u s e

    OLTP (On-Line Transaction Processing) Systemshave to send their updates to the data warehouse.

    InventoryOLTP

    Application

    FinanceOLTP

    Application

    SalesOLTP

    Application

    Data Warehouse

    F i n a n c e

    Subjec tArea

    InventorySubjec t

    Area

    S a l e sSubjec t

    Area

  • 8/6/2019 Data Warehouse Overview Done

    62/112

    62

    123

    F r e q u e n c y of U p d a t e s t o t h e D a t a

    W a r e h o u s e

    F r e q u e n c y o f U p d a t e s t o t h e D a t a

    W a r e h o u s e

    Updates may occur daily, weekly, monthly, or inreal-time.

    DailyU

    pdate

    WeeklyUp

    date

    Mon

    thlyU

    pdate

    InventoryOLTP

    Application

    FinanceOLTP

    Application

    SalesOLTP

    Application

    Data Warehouse

    F i n a n c e

    Subjec tArea

    InventorySubjec t

    Area

    S a l e sSubjec t

    Area

    124

    D e t er m i n i n g t h e F r e q u e n c y o f U p d a t e sD e t er m i n in g t h e F r e q u e n c y o f U p d a t e s

    Requirements should drive update frequency

    Range of updates runs from real-time, toquarterly.

    Real time update

    Expensive

    Requires update of warehouse while users arequerying

    Daily update

    Somewhat cheaper than real time, but significantmaintenance required if the warehouse has lots of tables.

    Monthly or weekly update Much more manageable

  • 8/6/2019 Data Warehouse Overview Done

    63/112

    63

    125

    U p d a t i n g t h e Wa r e h o u s eU p d a t i n g t h e Wa r e h o u s e

    Full Refresh vs. Only the Changes

    InventoryOLTP

    ApplicationFinanceOLTP

    Application

    SalesOLTP

    Application

    FullR

    efresh

    Changessin

    cela

    stup

    date

    Fullre

    fresh

    ofso

    meta

    bles

    chang

    esfo

    roth

    erta

    bles

    Data Warehouse

    F i n a n c e

    Subjec tArea

    InventorySubjec t

    Area

    S a l e sSubjec t

    Area

    126

    F u l l R e f r e s hF u l l R e f r e s h

    Copy the entire source table in the OLTP systemto the destination table in the Data Warehouse.

    SourceTable

    Source OLTP

    TargetTable

    Target Data Warehouse

  • 8/6/2019 Data Warehouse Overview Done

    64/112

    64

    127

    C o p y On l y t h e C h a n g e sC o p y On l y t h e C h a n g e s

    Copy only the changes to the source table in theOLTP system to the destination table in the datawarehouse.

    SourceTable

    Source OLTP

    TargetTable

    Target Data Warehouse

    Modif ied da ta

    s i n c e l a s t u p d a t et o t h e w a r e h o u s e

    D a t a f r o m t w o u p d a t e s a g o .

    His tor ica l da ta no longer insource OLTP.

    128

    F u l l R e f r e s h v s . O n l y t h e C h a n g e sF u l l R e f r e s h v s . O n l y t h e C h a n g e s

    Full Refresh

    Pros

    Much easier to implement

    Less chance of messing up your database (good data integrity)

    Cons

    Can take a lot longer to actually do -- may run out of night

    Can lose out on warehouse ability to track historical data.

    Only the Changes (DELTA)

    Pros

    Tracks historical data

    Cons

    Can be very hard to implement

    Can require changes in source applications (more on this later)

  • 8/6/2019 Data Warehouse Overview Done

    65/112

    65

    129

    One way to move data from one table to anotheris via the INSERT-SELECT. Syntax: INSERT INTO

    Example:INSERT INTO DW_EMPLOYEE

    SELECT *

    FROM EMPLOYEE

    Fu l l R e f r e s h U s in g IN SE R T -SE L E C TFu l l R e f r e s h U s in g IN SE R T -SE L E C T

    TARGET

    130

    U p d a t i n g Ch a n g e s Us in g INSERT-SELECTU p d a t i n g Ch a n g e s Us in g INSERT-SELECT

    Changes may be moved by adding aWHEREclause to the INSERT-SELECT.

    Example: INSERT INTO DW_EMPLOYEE

    SELECT *

    FROM EMPLOYEE

    WHERE DATE-UPDATED =

    DATEPART(m, CURRENT_TIMESTAMP)

  • 8/6/2019 Data Warehouse Overview Done

    66/112

    66

    131

    U p d a t i n g U s in g B C PU p d a t i n g U s in g B C P

    BCP is the bulk copy program that comes with MSSQL Server.

    Bulk copy (BCP) moves data to or from a flat file to a SQL table.

    Syntax:bcp [in | out]

    Source

    Table

    Source OLTP

    Target

    Table

    Target DataWarehouse

    TemporaryF l a t

    File

    Unload Load

    132

    B CP E x a m p leB CP E x a m p le

    To bulk copy data from thepublishers table inthepubs database to thepublishers.txt data file

    in ASCII text format, execute from the commandprompt:

    bcp pubs..publishers out publishers.txt -c

    -Sservername -Usa -Ppassword

    To bulk copy data from thepublishers.txt fileinto thepub2 table in thepubs database, execute

    from the command prompt:

    bcp pubs..pub2 in publishers.txt -c

    -Sservername -Usa -Ppassword

  • 8/6/2019 Data Warehouse Overview Done

    67/112

  • 8/6/2019 Data Warehouse Overview Done

    68/112

    68

    135

    Key tools in the marketplace Informatica

    Ardent

    DecisionBase (Platinum)

    Microsoft Data Transformation Services

    All provide libraries of common transformations.

    All provide the ability to

    code complex transformations.

    C o mm e r c i a l E T L T o ol sC o mm e r c i a l E T L T o ol s

    136

    D a t a T r a n s fo r m a t i o n S e r v i ce sD a t a T r a n s fo r m a t i o n S e r v i ce s

  • 8/6/2019 Data Warehouse Overview Done

    69/112

    69

    137

    C h o o se a S o u r c eC h o o se a S o u r c e

    138

    C h o o s e a D e st i n a t i o nC h o o s e a D e st i n a t i on

  • 8/6/2019 Data Warehouse Overview Done

    70/112

    70

    139

    C h o o s e t o u s e a Q u e r y fo r T r a n s fe rC h o o s e t o u s e a Q u e r y fo r T r a n s fe r

    140

    E n t e r S Q L Q u e r yE n t e r S Q L Q u e r y

  • 8/6/2019 Data Warehouse Overview Done

    71/112

    71

    141

    C h o o s e De s t i n a t i o n T a b l eN a m eC h o o s e De s t i n a t i o n T a b l e Na m e

    142

    V e r i fy T ra n s fo rma t io nVe r i fy T ra n s fo rm a t io n

  • 8/6/2019 Data Warehouse Overview Done

    72/112

    72

    143

    D e ci d e Wh e n t o R u n T r a n s fo r m a t i o nD e ci d e Wh e n t o R u n T r a n s fo r m a t i o n

    144

    Fina l Ver i f i ca t ionFina l Ver i f i ca t ion

  • 8/6/2019 Data Warehouse Overview Done

    73/112

    73

    145

    R u n T r a n s fo r m a t i onR u n T r a n s fo r m a t i on

    146

    C h e c k R e s u l t sC h e c k R e s u l t s

    orderid orderdate productid productname quantity unitprice discount

    10248 1996-07-04 00:00:00.000 11 Queso Cabrales 12 14.0000 0.0

    10248 1996-07-04 00:00:00.000 42 Singaporean Hokkien Fried 10 9.8000 0.0

    10248 1996-07-04 00:00:00.000 72 Mozzarel la di Giovanni 5 34.8000 0.0

    10249 1996-07-05 00:00:00.000 14 Tofu 9 18.6000 0.0

    10249 1996-07-05 00:00:00.000 51 Manjimup Dried Apples 40 42.4000 0.0

    select *

    from orderfact

  • 8/6/2019 Data Warehouse Overview Done

    74/112

    74

    147

    S u m m a r yS u m m a r y

    ETL is one of the hard parts of building a datawarehouse.

    Either full refreshes of data or just the changesmay be done.

    Doing full refresh is easy, but historical data islost and it may take a lot of time.

    Tracking changes is a tough business.

    ETL commercial tools are beginning to matureand can lessen the pain of this task.

    148

    More Ways ofMoving Data to

    the

    Data Warehouse( s l ides in th is sec t iona r e u s e d c o u r t e s y o f C a r r i g E m e r g i n g T e c h n o l og y

    P h : 4 1 0 - 5 5 3 - 6 7 6 0www.ca r r ige t .com)

  • 8/6/2019 Data Warehouse Overview Done

    75/112

    75

    149

    M o re W a y s o f Mo v in g D a t a

    t o t h e D a t a Wa r e h o u s e

    M o re W a y s o f Mo v in g D a t a

    t o t h e D a t a Wa r e h o u s e

    1) Determining What Data Has Changed

    2) Recovery Logs

    3) Triggers

    4) Insert Triggers

    5) Delete Triggers

    6) Update Triggers

    7) Manual Detection

    150

    There is a need to move data into the datawarehouse from OLTP and DSS applications

    The problem is detecting what data needs to bemoved into the data warehouse

    Three methods:

    Recovery Logs

    Triggers

    Manual Techniques

    M o re W a y s o f Mo v in g D a t a

    t o t h e D a t a Wa r e h o u s e

    M o re W a y s o f Mo v in g D a t a

    t o t h e D a t a Wa r e h o u s e

  • 8/6/2019 Data Warehouse Overview Done

    76/112

    76

    151

    D e t e r m i n i n g Wh a t D a t a H a s C h a n g e dD e t e r m i n i n g Wh a t D a t a H a s C h a n g e d

    Problem: How to get updates made to the sourceto the same information in the data warehouse?

    TABLE

    A

    SOURCE

    OLTP

    DATA WAREHOUSE

    TABLE

    B

    ??UPDATES

    How t o get updat es f rom

    Source Tab le A t o

    Data Warehouse Table B

    152

    ??

    D e t e r m i n i n g Wh a t D a t a H a s Ch a n g e d (cont . )D e t e r m i n i n g Wh a t D a t a H a s Ch a n g e d (cont . )

    Problem: How to get updates made to multiplesources to the same information in the data

    warehouse?

    TABLE

    A

    SOURCE

    OLTP

    UP

    DATES

    ROW X

    NAME DEPT. SALARY

    Fred Mktg 35000

    Hank Sales 60000

    Sue IT 71000

    Joe Sales 50000

    Employee

    ROWX

    UPDATES

    Insert into

    Employee Values

    (Joe,Sales,50000)

    DATA WAREHOUSE

    TABLE

    A

    ROW X

    TABLE

    B

    ROW X

    DEPT COUNT

    Mktg 1

    Sales 1

    IT 1

    HR 0??EmployeeCount

    SalaryInfo

    2DEPT AVG SAL TOT SAL

    Mktg 35000 35000

    IT 71000 71000

    HR 0 0

    Sales 60000 60000

    55000 110000

  • 8/6/2019 Data Warehouse Overview Done

    77/112

    77

    153

    Wh a t i s t h e R e c o v e r y L o g ?Wh a t i s t h e R e c o v e r y L o g ?

    Recovery log is used for transaction processing Used to handle errors

    Does contain before and afterimage.

    Recovery log can be used toidentify the data to be updated

    in the data warehouse.

    Change Data Capture Utility

    This scans the database log and identifies all changes that the useris interested in and either writes them to a file or stores themin

    another table.

    154

    C h a n g e D a t a C a p t u r e U t il it y i n A ct i onC h a n g e D a t a C a p t u r e U t il it y i n Ac t i on

    DBMS

    DATA

    LOG

    All changes

    to DBMS

    SOURCE

    OLTP

    CHANGEDATA

    CAPTURE

    UTILITY

    DATA WAREHOUSE

    READS

    READS

    WRITESWRITES

    RECOVERY LOG

  • 8/6/2019 Data Warehouse Overview Done

    78/112

    78

    155

    E x a m p l e o f U si n g R e c o v e r y L o gE x a m p l e o f U si n g R e c o v e r y L o g

    UPDATE EMPLOYEE

    Where SSN=10

    SET Salary=Salary*2.0

    LOGTABLE=EMPLOYEE

    SSN=10

    OldSalary=100,

    NewSalary=200

    CHANGE

    DATACAPTURE

    RECONSTRUCTS

    UPDATE

    DATAWAREHOUSE

    Consider an update to the Employee table The information is recorded in the log

    The change data capture reconstructs update

    Can then be sent to the data warehouse

    156

    U s in g t h e R e c o v e r y L o gU s in g t h e R e c o v e r y L o g

    Recovery logs are usually in proprietary format.Use commercial tools to read the log and identifythe changes.

    Commercial tools such as CAs log analyzercan

    place the results of their work in a table.

  • 8/6/2019 Data Warehouse Overview Done

    79/112

    79

    157

    S u m m a r y o f C h a n g e D a t a Ca p t u r eS u m m a r y o f C h a n g e D a t a Ca p t u r e

    Pro

    Log exists anyway, might as well use it to find what has changed

    Con

    Some difficult scenarios may occur where it is hard to see what the

    new update should be in the Data Warehouse.

    Proprietary format, may not be supported in many DBMS and will

    always lag behind DBMS development.

    Many tables will be in the source that have nothing to do with the

    data warehouse, but change data capture will process their changesas well.

    158

    T r i g g e r sT r i g g e r s

    Triggers allow DBAs to specify that when anevent such as an INSERT, UPDATE , orDELETE

    occurs on a table, another event is triggered.

    Triggers are used to identify changes that are needed by the

    warehouse.

    A trigger can be added to a source table and whenever the source

    table is updated, an update can be placed either directly in the

    warehouse or in a staging table that tracks all updates.

    Triggers can be used to detect the

    changes and perform datawarehouse updates.

    A different trigger might be run on key updates so that the data

    warehouse nightly process would know what data has changed.

  • 8/6/2019 Data Warehouse Overview Done

    80/112

    80

    159

    E x a m p l e o f a T r i gg e rE x a m p l e o f a T r i gg e r

    TABLE

    A

    INSERT intoTABLE A

    VALUES (X, Y)

    STEP 1

    STEP 2

    Values (X, Y) a re inse r ted

    STEP 4

    When va lues a reinse r ted , se ts of f

    the TRIGGER

    Night ly P rocess inse r ts

    values (X, Y) intot h e D a t a W a r e h o u s e

    DATA WAREHOUSE

    TABLEA

    Values (X, Y)

    X, Y

    STAGING

    STEP 3

    TRIGGER inserts

    values (X, Y) intoa STAGING area

    NightlyProcess

    160

    R e a l -L i fe T r ig g e r E x a m p leR e a l -L i fe T r ig g e r E x a m p le

    OLTP/DSS Data - Employee table:

    Employee (ssn, name, salary)

    DW Data - Summary table:

    EmployeeStatistics (total number employees,

    total salary paid, average salary).

    When a row is inserted in the employee table, weneed to do an insert into the EmployeeStatistics

    table. Shown on the next page

  • 8/6/2019 Data Warehouse Overview Done

    81/112

    81

    161

    I n s e r t T r i gg e r E x a m p l eI n s e r t T r i gg e r E x a m p l e

    CREATE TRIGGER EmployeeInsertTriggerON Employee

    FORINSERTAS

    BEGIN

    UPDATE EmployeeStatistics

    SET NoEmployee = NoEmployee +

    (SELECT COUNT(*) FROM INSERTED)

    UPDATE EmployeeStatistics

    SET TotSalary = TotSalary +

    (SELECT SUM(Salary) FROM INSERTED)

    UPDATE EmployeeStatistics

    SET AvgSalary = TotSalary / NoEmployee

    END

    162

    I n s e r t T r i g ge r i n Ac t i onI n s e r t T r i g ge r i n A ct i o n

    INSERT INTO EMPLOYEEVALUES (1, 'John', 300) (1 ROW(S) AFFECTED)

    INSERT INTO EMPLOYEE

    VALUES (2,'Mike', 400) (1 ROW(S) AFFECTED)

    SELECT * FROMEMPLOYEE

    Employee

    EmpId Name Salary------ --------------------------1 John 300.00

    2 Mike 400.00

    SELECT * FROM

    EMPLOYEESTATISTICS

    EmployeeStatisticsNoEmployee TotSalary AvgSalary

    ---------- ---------- ---------2 700.00 350.00

    COMMANDS RESULTS

  • 8/6/2019 Data Warehouse Overview Done

    82/112

    82

    163

    D e le t e T r i g g e r E x a m p l eD e le t e T r i g g e r E x a m p l e

    CREATE TRIGGER EmployeeDeleteTriggerON Employee

    FOR DELETE AS

    BEGIN

    DECLARE @numberEmployee int

    UPDATE EmployeeStatistics

    SET NoEmployee = NoEmployee - (SELECT COUNT(*) FROM DELETED)

    UPDATE EmployeeStatistics

    SET TotSalary = TotSalary - (SELECT SUM(Salary) FROM DELETED)

    SELECT @numberEmployee = NoEmployee FROM EmployeeStatistics

    IF @numberEmployee > 0

    BEGIN

    UPDATE EmployeeStatistics

    SET AvgSalary = TotSalary / NoEmployeeEnd

    ELSE

    UPDATE EmployeeStatistics SET AvgSalary = 0.0

    END

    164

    U p d a t e T r i gg e r E x a m p leU p d a t e T r i gg e r E x a m p le

    CREATE TRIGGER EmployeeUpdateTrigger

    ON Employee

    FORUPDATEAS

    BEGIN

    IF UPDATE (Salary)

    UPDATE EmployeeStatistics

    SET TotSalary = TotSalary -

    (SELECT SUM(Salary) FROM DELETED) +

    (SELECT SUM(Salary) FROM INSERTED)

    UPDATE EmployeeStatistics

    SET AvgSalary = TotSalary / NoEmployee

    END

  • 8/6/2019 Data Warehouse Overview Done

    83/112

    83

    165

    S u m m a r y o f U s in g T r i g g e r sS u m m a r y o f U s in g T r i g g e r s

    Pro Only needed for tables whose data is going to go to the DW

    Con

    Additional work needed to create detailed triggers

    Non-trivial to generate a trigger to implement appropriate action

    May not be acceptable for commercial software on source system

    166

    O t h e r Wa y s t o De t e r m i n e Wh a t H a s C h a n g e dO t h e r Wa y s t o De t e r m i n e Wh a t H a s C h a n g e d

    There are other manual ways of detecting thechange and doing DW updates

    Look at each row of OLTP and the data in the warehouse

    Compare the differences between the two files, if the data is not in

    the warehouse, add it!

    Hank

    J o h n

    Mike

    S a m

    OLTP DATA WAREHOUSE

    Hank

    J o h n

    Mike

    COMPARE

    ADD THE DIFFERENCES ADD THE DIFFERENCES

  • 8/6/2019 Data Warehouse Overview Done

    84/112

    84

    167

    M a n u a l ly I d e n t i fy in g W h a t H a s C h a n g e dM a n u a l ly I d e n t i fy in g W h a t H a s C h a n g e d

    Pro Flexible

    Con

    Very expensive

    Could take a long time

    168

    S u m m a r yS u m m a r y

    Recovery Logs

    Triggers

    Manual Detection

  • 8/6/2019 Data Warehouse Overview Done

    85/112

    85

    169

    Data WarehouseDesign

    ( s l ides in th is sec t ion

    a r e u s e d c o u r t e s y o f C a r r i g E m e r g i n g T e c h n o l og y

    P h : 4 1 0 - 5 5 3 - 6 7 6 0www.ca r r ige t .com)

    170

    D a t a Wa r e h o u s e D e si gnD a t a Wa r e h o u s e D e si gn

    1) Overview

    2) Describing a Design - ER Diagrams

    3) Design Normalization

    4) Star Schema Design

  • 8/6/2019 Data Warehouse Overview Done

    86/112

    86

    171

    O v e rv i e wO v e rv i e w

    How to describe a design

    Entity Relationship (ER) Diagram

    Types of Designs

    Normalized

    Star Schema

    Snowflake

    172

    D e s c r ib in g a D e s ig nD e s c r ib in g a D e s ig n

    Different techniques exist, the most prevalent isthe ER (Entity-Relationship) Diagram

    Entities

    Things that occur in the real world, usually nouns e.g.; employee,

    part, product, etc.

    Relationships

    How entities interact, example: one employee may attendmany

    colleges -- usually verbs

    Types of relationships

    1-1

    1-Many

    Many-1

    Many-Many

  • 8/6/2019 Data Warehouse Overview Done

    87/112

    87

    173

    E x a m p l e s o f R e l a t i o n s h i p sE x a m p l e s o f R e l a t i o n s h i p s

    1-MANY

    MANY-1

    1-1

    MANY-MANY

    174

    N o rm a l i ze d D e s ig nN o rm a l i ze d D e s ig n

    Methodology

    All 1-1 relationships are placed in a single table.

    Many-many relationships require two tables that store the single-

    valued relationships and one linking table that indicates how the

    entities are related. The relationship is represented in the linking

    table by referencing keys in the two tables that represent each

    entity in the relationship.

    Checking the design

    In a Normalized Design, there are many different normalized

    forms. Each normal form (NF) builds on the previous one so that a

    table in 2NF is, by definition, in 1NF. 1NF

    2NF

    3NF

  • 8/6/2019 Data Warehouse Overview Done

    88/112

    88

    175

    D e a l in g Wi th M a n y-M a n y R e l a t i o n s h ip sD e a l i n g Wi th M a n y-M a n y R e l a t i o n s h ip s

    For Many-Many Two 1-1 Tables (SUPPLIER, PARTS)

    One linking table (SP)

    Ex: Suppliers, Parts are the 1-1, SP is the linking table that says

    who sells whatparts.

    S# SNAME

    1 SEARS

    2 OFFICE DEPOT

    SUPPLIER PARTS

    P# PNAME

    1 HAMMERS

    2 NAILS

    SP

    S# P#

    1 1

    1 2

    2 1

    2 2

    176

    N o r m a l iz e d D e s i gn : E x a m p l eN o r m a l i ze d D e s ig n : E x a m p l e

    A store sells a product which is supplied by agiven vendor. The product is purchased by acustomer at a certain time.

    Entities: Customer, Product, Store

    Relationships: Customer buys Product

    Product is located in Store

    Product is suppliedBy a Vendor

    CUSTOMER PRODUCT

    BUYS

    STORE

    IS-LOCATED-IN

    VENDOR

  • 8/6/2019 Data Warehouse Overview Done

    89/112

    89

    177

    C h e c k i n g a N o r m a l iz e d D e s ig nC h e c k i n g a N o r m a l iz e d D e s ig n

    Normalization Used to reduce data insertion, delete, and update anomalies caused

    by bad designs.

    Enables users to quickly check a design and make sure there are no

    glaring holes in the design.

    1NF

    All cells are atomic -- i.e. each entry in a column contains onlyone value

    2NF

    All non-key values are functionally dependent upon the entireprimary key -- i.e. if the primary key changes, all other columnschange.

    3NF

    No transitive dependencies -- i.e. all keys are completely

    dependent on the primary key. If the primary key changes, allnon-key columns are affected.

    178

    O v e rv i e w o f No rm a l i ze d D e s ig nO v e rv i e w o f No rm a l i ze d D e s ig n

    Pro

    Relatively easy to change

    Con

    Queries can involve numerous joins

    The massive number of tables and links between tables makes it

    hard for customers to build their own queries

  • 8/6/2019 Data Warehouse Overview Done

    90/112

    90

    179

    S t a r S ch e m aS t a r S ch e m a

    Methodology Single fact table in the middle describing a key event (e.g. sale)

    surrounded by dimension tables (i.e. location, time, employee)

    FACT

    D5

    D1

    D3

    D2

    D4

    D = DIMENSIONS

    180

    S t a r S c h e m a : M e t h o d o l o gyS t a r S c h e m a : M e t h o d o l o gy

    Identify a key fact that occurs.

    Usually some event creates a real fact. Selling a product in a store

    on Wednesday, patient visiting a hospital, etc.

    Identify all the dimensions of the data being used.

    Think of a dimension as a way to slice the data.

    Ex: by time, by product, by customer, etc.

    Drill down operations are very well supported

  • 8/6/2019 Data Warehouse Overview Done

    91/112

    91

    181

    S t a r S ch e m a : E x a m p leS t a r S ch e m a : E x a m p le

    A store sells a product which is supplied by agiven vendor. The product is purchased by acustomer at a certain time.

    Fact

    CustomerPurchase

    Dimensions are

    Customer

    Product

    Time

    Vendor

    182

    S t a r S ch e m a : E x a m p le ( con t . )S t a r S ch e m a : E x a m p le ( con t . )

    Sale

    C u s t o m e r

    St ore

    Time

    Product

    SALE ID

    1

    CUST. ID

    3

    STORE ID

    7

    PROD. ID

    4

    PRICE

    $ 3 . 0 0

    TIME

    4 / 2 4 / 9 9

    CUST. ID

    3

    NAME

    FRED

    PHONE

    1 2 3 4

    Buys Apples

    Y

    Has Big Car

    Y

    DAY

    24

    MONTH

    4

    QT R

    2Q

    YEAR

    99

    Price

    SALE

    CUSTOMER

    TIME

  • 8/6/2019 Data Warehouse Overview Done

    92/112

    92

    183

    S t a r S c h e m a : O v e r v i e wS t a r S c h e m a : O v e r v i e w

    Pro

    Easy for users to navigate and understand

    Con

    Performance

    Can end up with one monster fact table, millions of rows

    Flexibility

    Not as easy for customers to change the design

    184

    Make

    Chips

    P a r t s

    Manu-facturing

    PRODUCT

    Price

    Labor

    Cost

    S n o w fla k e S c h e m aS n o w fla k e S ch e m a

    Several stars can be connected to form a snowflake

    Sale

    Price

    R ev en u e

    Product

    Marketing

    Vendor

    SALES

    Direct Mail

    Pr ice

    Ad

    Location

    MARKETING

    Distrib-u t i o n

    Sales

  • 8/6/2019 Data Warehouse Overview Done

    93/112

    93

    185

    S u m m a r yS u m m a r y

    Two basic types of design Star Schema

    Normalized

    Many Data Warehouse vendors sell products builtspecifically for the star schema

    Some data warehouses insist that normalizationis the way to build the data warehouse.

    186

    Building aData Warehouse

    ( s l ides in th is sec t ion

    a r e u s e d c o u r t e s y o f C a r r i g E m e r g i n g T e c h n o l og y

    P h : 4 1 0 - 5 5 3 - 6 7 6 0www.ca r r ige t .com)

  • 8/6/2019 Data Warehouse Overview Done

    94/112

    94

    187

    B u i ld i n g a D a t a W a r e h o u s eB u i ld i n g a D a t a W a r e h o u s e

    1) Top Down Approaches

    2) Enterprise Data Model Approach

    3) "Let Data Users Decide"

    4) "Let Data Warehouse Builders Decide"

    5) "Let Senior Management Decide"

    6) Bottom Up Approach

    188

    B u i ld i n g t h e D a t a Wa r e h o u s eB u i ld i n g t h e D a t a Wa r e h o u s e

    How to decide what data goes into the datawarehouse?

    Methods:

    Top Down

    Using Enterprise Data Models

    "Let data users decide" approach

    "Let data warehouse builders decide" approach

    "Let senior management decide" approach

    Bottom Up Combine data marts into a data warehouse

  • 8/6/2019 Data Warehouse Overview Done

    95/112

    95

    189

    U s in g E n t e r p r i se D a t a M od e l sU si n g E n t e r p r i s e D a t a M od e l s

    Use the Enterprise Data Model to decide whatdata goes into the data warehouse.

    Model key processes. This approach says let the business decide.

    Identify key data used by these processes in an enterprise data

    model -- might be a giantEntity-Relationship diagram.

    Put data in the warehouse based on theenterprise data model.

    190

    CHIPRECIPES

    An E n t e r p r i s e D a t a M o d e l E x a m p leAn E n t e r p r i s e D a t a M o d e l E x a m p le

    MAKECHIPS

    PUT INBAGS

    SELL CHIPS

    COUNT$$

    BUY MORE

    POTATOES

    INGREDIANTS

    CHIPSUPPLIERS

  • 8/6/2019 Data Warehouse Overview Done

    96/112

    96

    191

    "E n te r p r i s e Da t a M o d e l " Ap p r o a c h"E n te r p r i s e Da t a M o d e l " Ap p r o a c h

    Pro

    All inclusive -- no chance of leaving key data out.

    Con

    Very difficultto build an EDM.

    If the business model changes, you may have to rebuild the

    Enterprise Data Model and the data warehouse.

    Ways of Avoiding the Con

    In some cases you can buy an EDM -- if the business is commonenough the packaged EDM might be very close and then you just

    have to modify it to fit your business.

    192

    USERS

    S

    OURCE

    "L e t D a t a U s e r s D e c id e ""L e t D a t a U s e r s D e c id e "

    Let the users of the data warehouse choose whatdata will go into the warehouse.

    The data users deciding the data warehouse data and design will

    pay for it as well.

    Also, you can charge users who

    query the data as well.

    DATA WAREHOUSE

  • 8/6/2019 Data Warehouse Overview Done

    97/112

    97

    193

    "Le t Da t a Use r s Dec id e": An Exa m ple"Le t Da t a Use r s Dec id e": An Exa m p le

    DATA WAREHOUSE

    MARKETING HUMANRESOURCES FINANCE

    DATA

    d e m o g r a p h i c s

    Adver t is ing

    ?

    trends

    e d u c a t i o n

    Et hn ic

    group

    Ag e

    ?

    DATA

    budget

    spendingR e v e n u e

    ?

    DATA

    194

    "L e t D a t a U s e r s D e c id e " Ap p r o a c h"L e t D a t a U s e r s D e c id e " Ap p r o a c h

    Pro

    Reduces budget problems

    Users know best!

    Con

    Requires marketing

    Could end up with data in the warehouse that is meaningless to the

    people who run the place.

    Users may not place important data in the warehouse because their

    budget is small.

    Users who need the data may not use the DW because of budget

    concerns.

    Ways of Mitigating the Con

    Do not just take money -- try to determine if data is really

    corporate.

  • 8/6/2019 Data Warehouse Overview Done

    98/112

    98

    195

    P a y As Yo u G o Wa r e h o u s e An a l o g yP a y As Yo u G o Wa r e h o u s e An a l og y

    I-495

    196

    "L e t D a t a Wa r e h o u s e B u i ld e r s D e c id e ""L e t D a t a Wa r e h o u s e B u i l d e r s D e c i d e "

    LETS PUT

    INFORMATION ON

    HOW TO BUILD

    VIRUSES IN THE

    DATA WAREHOUSE

    DATA WAREHOUSE

    The technical staff who is building the warehousedecides what data gets put in the warehouse.

  • 8/6/2019 Data Warehouse Overview Done

    99/112

    99

    197

    "L e t D a t a Wa r e h o u s e B u i l d e r s D e c i d e "

    A p p ro a c h

    "L e t D a t a Wa r e h o u s e B u i l d e r s D e c i d e "

    A p p ro a c h

    Pro Very easy to design

    Does not take much time

    Do not have to deal with users

    Con

    Could easily result in data DUMP not data warehouse

    Ways to mitigate the con

    Talk to lots of users to help you guess what should go in the DW

    198

    L e t S e n i o r M a n a g e m e n t D e c id e L e t S e n i o r M a n a g e m e n t D e c id e

    The senior management decides what data goesinto the warehouse.

    Asking the senior management is the safest way

    to build a data warehouse.

    Identify the key questions on seniormanagements mind and get the data to answer

    these questions.

  • 8/6/2019 Data Warehouse Overview Done

    100/112

    100

    199

    L e t S e n i o r M a n a g e m e n t D e c id e Ap p r o a c hL e t S e n i o r M a n a g e m e n t D e c id e Ap p r o a c h

    Pro Ensures executive support for the project

    Con

    Senior management does not have much time for this -- you will

    have to only get a few questions at a time

    This dramatically increases visibility - if you do not move quickly

    senior management will become very angry with the DW.

    Ways to mitigate the con

    Do your homework before talking to the senior management -- talk

    to the aides of senior management to find out what is on theirmind.

    Allocate resources so you can plan to move very quickly once you

    hear from the senior management.

    200

    B o t to m-Up Ap p r o a c hB o t t o m -U p Ap p r o a c h

    Move data from existing OLTP Applications todata marts.

    Combine data marts into a data warehouse.

    DATA

    MART25

    YARDS

    DATA

    MART5 0

    METERS

    DATA

    MART

    20 0CM

    DATA

    WAREHOUSE

    OLTP

    APP

    OLTP

    APP

    OLTP

    APP

  • 8/6/2019 Data Warehouse Overview Done

    101/112

    101

    201

    Pro Data marts are much easier to build than full-fledged DW.

    Con

    Could end up with a bunch of stove pipe data marts.

    Ways to mitigate the con

    Develop standards for data when building the data marts so that

    you can glue data from different data marts together.

    B o t to m-Up Ap p r o a c hB o t t o m -U p Ap p r o a c h

    202

    R e c om m e n d a t i on s f or a n Ap p r o a c hR e c om m e n d a t i on s f or a n Ap p r o a c h

    "Let senior management