d Wh Concepts

download d Wh Concepts

of 79

Transcript of d Wh Concepts

  • 7/30/2019 d Wh Concepts

    1/79

    Data Warehousing Concepts

  • 7/30/2019 d Wh Concepts

    2/79

    2

    Course Overview

    What is Data Warehouse

    OLTP Vs. Data Warehousing

    Data Warehousing Architecture

    Data Warehousing Schemas & Objects

    Physical Design in Data Warehouse

    Definition of Data Warehousing

  • 7/30/2019 d Wh Concepts

    3/79

    3

    Course Overview

    Data Warehousing basic DesignApproaches

    Data Warehousing OperationalProcesses

    Technical Problems in DataWarehousing

    Representative DSS Tools

    Business Intelligence

  • 7/30/2019 d Wh Concepts

    4/79

    4

    What is a Data Warehouse?

    A data warehouse is a relational database that is designed for query and analysisrather than for transaction processing. It usually contains historical data derivedfrom transaction data.

    A data warehouse environment includes an extraction, transportation,transformation, and loading (ETL) solution, online analytical processing (OLAP)

    and data mining capabilities, client analysis tools, and other applications thatmanage the process of gathering data and delivering it to business users.

    It is a series of processes, procedures and tools (h/w & s/w) that help theenterprise understand more about itself, its products, its customers and themarket it services

  • 7/30/2019 d Wh Concepts

    5/79 5

    NOT possible to

    purchase a DataWarehouse, but it ispossible to build one.

    Data Warehouse is

    NOT a specifictechnology

    Facts !

  • 7/30/2019 d Wh Concepts

    6/79 6

    Who are the potentialCustomers ?

    Which Products are sold themost ?

    What are the region-wisepreferences ?What are the competitorproducts ?

    What are the projectedsales ?

    What if you sale morequantity of a particularproduct ?

    What will be the impacton revenue ?Results of promotionschemes introduced ?

    Why Data Warehousing?

    Need of Intelligent Information in Competitive Market

  • 7/30/2019 d Wh Concepts

    7/797

    William Imon

    Defining Data warehouse

  • 7/30/2019 d Wh Concepts

    8/798

    Subject Oriented

    The data in datawarehouse is

    organized around themajor subject of theenterprise ( i.e. thehigh level entities).

    The orientation aroundthe major subject areascauses the data

    warehouse design tobe data driven.

    The operationalsystems are designedaround the applicationand functions. e.g.

    Loans , savings , creditcards in case of aBank. Where DataWarehouse is designedaround a subject likeCustomer , Product ,Vendor etc.

    OperationalSystems

    DataWarehouse

    Customer

    Supplier

    Product

    Organized by processesor tasks

    Organized bysubject

  • 7/30/2019 d Wh Concepts

    9/799

    Data Warehouse Data

    Time Data

    {

    Key

    Time Variant

    Data is stored as a series of snapshots or views which record how it is

    collected across time.

    It helps in Business trend analysis

    In contrast to OLTP environment, data warehouses focus

    on change over time that is what we mean by time variant.

  • 7/30/2019 d Wh Concepts

    10/79

    10

    Integrated

    Data is stored once in a single integrated location

    Data WarehouseDatabase

    Subject = Customer

    Auto Policy

    Processing

    System

    Customerdata

    storedin several

    databases

    Fire Policy

    Processing

    System

    FACTS, LIFE

    Commercial, Accounting

    Applications

    It is closely related with subject orientation.

    Data from disparate sources need to be put in a consistent format.

    Resolving of problems such as naming conflicts andinconsistencies

  • 7/30/2019 d Wh Concepts

    11/79

    11

    Non-Volatile

    Existing data in the warehouse is not overwritten or updated.

    External

    Sources

    Read-Only

    Data

    WarehouseDatabaseData

    Warehouse

    Environment

    Production

    Databases

    Production

    Applications

    Update

    InsertDelete

    Load

    This is logical because the purpose of a data warehouse is to enable you toanalyze what has occurred.

  • 7/30/2019 d Wh Concepts

    12/79

    12

    So, whats different between OLTP

    and Data Warehouse?

  • 7/30/2019 d Wh Concepts

    13/79

    13

    OLTP vs. Data Warehouse

    OLTP systems are tuned for known transactions and workloads while workload is

    not known in a data warehouse

    Special data organization, access methods and implementation methods areneeded to support data warehouse queries (typically multidimensional queries)

    e.g., average amount spent on phone calls between 9AM-5PM in Pune duringthe month of December

  • 7/30/2019 d Wh Concepts

    14/79

    14

    OLTP vs. Data Warehouse

    OLTP

    Application Oriented

    Used to run business

    Detailed data

    Current up to date

    Isolated DataRepetitive access

    Clerical User

    WAREHOUSE (DSS)

    Subject Oriented

    Used to analyze business

    Summarized and refined

    Snapshot data

    Integrated DataAd-hoc access

    Knowledge User (Manager)

  • 7/30/2019 d Wh Concepts

    15/79

    15

    OLTP vs Data Warehouse

    OLTP

    Performance Sensitive

    Few Records accessed at a time (tens)

    Read/Update Access

    No data redundancy

    Database Size 100MB -100 GB

    DATA WAREHOUSE

    Performance relaxed

    Large volumes accessed at atime(millions)

    Mostly Read (Batch Update)

    Redundancy present

    Database Size 100 GB -few terabytes

  • 7/30/2019 d Wh Concepts

    16/79

    16

    OLTP vs Data Warehouse

    OLTP

    Transaction throughput is theperformance metric

    Thousands of users

    Managed in entirety

    Data Warehouse

    Query throughput is theperformance metric

    Hundreds of users

    Managed by subsets

  • 7/30/2019 d Wh Concepts

    17/79

    17

    To summarize ...

    OLTP Systems are

    used to runa business

    The Data Warehouse helps tooptimizethe business

  • 7/30/2019 d Wh Concepts

    18/79

    18

    Data Warehouse Architectures

    Centralized

    In a centralized architecture, there exists only one data warehouse which storesall data necessary for business analysis. As already shown in the previous section,the disadvantage is the loss of performance in opposite to distributed approaches.

    Central Architecture

  • 7/30/2019 d Wh Concepts

    19/79

    19

    Federated

    In a federated architecture the data is logically consolidated but stored inseparate physical databases, at the same or at different physical sites. The localdata marts store only the relevant information for a department.The amount of data is reduced in contrast to a central data warehouse. The levelof detail is enhanced.

    Federated Architecture

    Data Warehouse Architectures Contd

  • 7/30/2019 d Wh Concepts

    20/79

    20

    Tiered:

    A tiered architecture is a distributed data approach. This processcan not be done in one step because many sources have to beintegrated into the warehouse.On a first level, the data of all branches in one region is collected, inthe second level the data from the regions is integrated into onedata warehouse.

    Advantages:

    Faster response timebecause the data islocated closer to the clientapplications and

    Reduced volume of datato be searched.

    Tiered Architecture

    Data Warehouse Architectures Contd

  • 7/30/2019 d Wh Concepts

    21/79

    21

    Metadata

    Data Sources Data Management Access

    Complete Warehouse Solution Architecture

    Operational Data

    Legacy Data

    The Post

    VISA

    External DataSources

    EnterpriseData

    Warehouse

    Organizationally

    structured

    Extract

    Transform

    Load

    Data Information Knowledge

    Asset Assembly (and Management) Asset Exploitation

    DataMart

    DataMart

    Departmentallystructured

    Data

    Mart

    Sales

    Inventory

    Purchase

  • 7/30/2019 d Wh Concepts

    22/79

    22

    Data Sources:

    Legacy data

    Operational data

    External data resources

    Data Management :

    Metadata - At all levels of the data warehouse, information is required to supportthe maintenance and use of the Data Warehouse.

    Data Mart A data mart is a subject oriented data warehouse.

    Data Warehouse Architecture Components

    Disparate datasources

  • 7/30/2019 d Wh Concepts

    23/79

    23

    Introduction To Data Marts

    What is a Data Mart

    From the Data Warehouse , atomic data flows to various departments for theircustomized needs. If this data is periodically extracted from data warehouse

    and loaded into a local database, it becomes a data mart. The data in Data Mart

    has a different level of granularity than that of Data Warehouse. Since the data

    in Data Marts is highly customized and lightly summarized , the departments cando whatever they want without worrying about resource utilization. Also thedepartments can use the analytical software they find convenient. The cost ofprocessing becomes very low.

  • 7/30/2019 d Wh Concepts

    24/79

    24

    Data Mart Overview

    Data Marts

    Satisfy 80% of

    the local end-

    users requests

    Sales Representatives

    and Analysts

    Human

    Resources

    Financial Analysts,

    Strategic Planners,

    and Executives

    DM Marketing

    DM Finance

    DM SalesDM HR

    Data Warehouse

    DM Sales

    DM HR

    DM Marketing

  • 7/30/2019 d Wh Concepts

    25/79

    25

    From TheData Warehouse To Data Marts

    DepartmentallyStructured

    IndividuallyStructured

    Data WarehouseOrganizationallyStructured

    Less

    More

    HistoryNormalizedDetailed

    Data

    Information

  • 7/30/2019 d Wh Concepts

    26/79

    26

    Operational Data Store (ODS)

    What is an ODSAn Operational Data Store (ODS) integrates data from multiple business operation

    sources to address operational problems that span one or more business functions.

    An ODS has the following features:

    Subject-oriented Organized around major subjects of an organization(customer, product, etc.), not specific applications (order entry, accounts

    receivable, etc.).

    Integrated Presents an integrated image of subject-oriented data which ispulled from fragmented operational source systems.

    Current Contains a snapshot of the current content of legacy source systems.History is not kept, and might be moved to the data warehouse for analysis.

    Volatile Since ODS content is kept current, it changes frequently. Identicalqueries run at different times may yield different results.

    Detailed ODS data is generally more detailed than data warehouse data.Summary data is usually not stored in an ODS; the exact granularity depends on thesubject that is being supported.

  • 7/30/2019 d Wh Concepts

    27/79

    27

    Operational Data Store (ODS) Contd

    The ODS provides an integrated view of data in operational systems.

    As the figure below indicates, there is a clear separation between the ODS and thedata warehouse.

    A

    B

    C

    EIS

    DSS

    Apps

    PC

    Operational

    Data Store

    Current or near

    current data

    Detailed data

    Updates allowed

    Historical data

    Summary and detail

    Non-volatile

    snapshots only

    Data Warehouse

  • 7/30/2019 d Wh Concepts

    28/79

    28

    Benefits Of ODS

    Supports operational reporting needs of the organization

    Provides a complete view of customer relationships, the data for which might bestored in several operational databases -- this data can include data from anorganizations internal systems, as well as external data from third-party vendors.

    Operates as a store for detailed data, updated frequently and used for drill-downs

    from the data warehouse which contains summary data.

    Reduces the burden placed on other operational or data warehouse platforms byproviding an additional data store for reporting.

    Provides more current data than in a data warehouse and more integrated than an

    OLTP system

    Feeds other operational systems in addition to the data warehouse

  • 7/30/2019 d Wh Concepts

    29/79

    29

    Data Warehousing SCHEMAS & OBJECTS

    A schema is a collection of database objects, including tables, views,indexes, and synonyms.

    There is a variety of ways of arranging schema objects in the schema

    models designed for data warehousing. The are:

    Star Schema

    Snowflake Schema

    Galaxy Schema

  • 7/30/2019 d Wh Concepts

    30/79

    30

    Star Schema: It Consists of a fact table connected to a set of dimensional

    tables

    Data is in Dimension tables is De-Normalized

    Snowflake Schema:

    It is refinement of star schema where some dimensional

    hierarchy is normalized in to a set of dimensional tables

    Galaxy Schema:Multiple fact tables share dimension tables viewed as a

    collection of stars, therefore called galaxy schema

  • 7/30/2019 d Wh Concepts

    31/79

    31

    Star Schema

    A star schema a highly De-Normalized, query-centric model where

    information is broken into two groups: facts and dimensions.

    Time_DimTimeKeyTheDate...

    Sales_FactTimeKeyEmployeeKeyProductKeyCustomerKeyShipperKey

    Required Data

    (Business Metrics)

    or (Measures)...

    Employee_DimEmployeeKeyEmployeeID...

    Branch_DimBranchIDBranchno...

    Customer_Dim

    CustomerKeyCustomerID...

    Shipper_DimShipperKeyShipperID...

    S fl k S h

  • 7/30/2019 d Wh Concepts

    32/79

    32

    Sales_fact

    timeID {FK}

    propertyID {FK}

    branchID {FK}

    clientID {FK}

    promotionID {FK}

    staffID {FK}

    ownerID {FK}

    offerPrice

    sellingPrice

    saleCommission

    saleRevenue

    Branch_Dim

    branchID {PK}

    branchNo

    branchType

    city {FK}

    City

    city {PK}

    region {FK}

    Regionregion {PK}

    country

    Figure32.2

    Fact Table

    Dimension

    Tables

    Snowflake Schema

  • 7/30/2019 d Wh Concepts

    33/79

    33

    Multiple Groups of Facts links by few common dimensions

    Fact1

    Fact2 Fact3

    Dimension2Dimension1

    Dimension4

    Dimension5

    Dimension3

    Dimension7Dimension6

    Galaxy Schema

  • 7/30/2019 d Wh Concepts

    34/79

    34

    Data Warehousing Objects

    All the three types of Schemas are described in the Data Modeling section

    Various Objects used in Data Warehousing are:

    Fact Tables

    Dimension Tables

    Hierarchies

    Unique Identifiers

    Relationships

  • 7/30/2019 d Wh Concepts

    35/79

    35

    Data Warehousing Objects

    Fact Tables:

    Represent a business process, i.e., models the business process as an artifact inthe data model

    Contain the measurements or metrics or facts of business processes

    "monthly sales number" in the Sales business process

    most are additive (sales this month), some are semi-additive (balance as of),some are not additive (unit price)

    The level of detail is called the grain of the table

    Contain foreign keys for the dimension tables

    F t T

  • 7/30/2019 d Wh Concepts

    36/79

    36

    Additive facts:

    Additive facts are facts that can be summed up through all of the dimensions

    in the fact table

    Semi-Additive facts:

    Semi-additive facts are facts that can be summed up for some of the dimensions

    in the fact table

    Non-additive facts:

    Non-additive facts are facts that cannot be summed up for any of the

    dimensions Present in the fact table

    Fact Types :

    Examples to illustrate Additive, Semi-Additive& Non-Additive facts:

  • 7/30/2019 d Wh Concepts

    37/79

    37

    & Non-Additive facts:

    Date

    Store

    Product

    Sales_Amount

    The purpose of this table is to record the Sales_Amount for each product in each storeOn a daily basis. Sales_Amount is the fact.

    In this case, Sales_Amount is an additive fact, because we can sum up this fact alongwith any of the 3 dimensions present in the fact table date, store, and product

    Fact table:

    Eg for semi Additive & Non Additive facts:

  • 7/30/2019 d Wh Concepts

    38/79

    38

    Eg for semi-Additive & Non-Additive facts:

    Date

    Account

    Current_Balance

    Profit_Margin

    Fact table:

    The purpose of this table is to record the current balance for each account at the end ofeach day, as well as the profit margin for each account for each day

    Current_Balance & Profit_Margin are the facts

    Current_Balance is a semi additive fact, as it makes sense to add them up for allaccounts (whats the total current balance for all accounts in the bank?), but it does not

    make sense to add them up through time

    Profit_Margin is a non additive fact, for it does not make sense to add them up for theaccount level or the day level

    types of fact tables :

  • 7/30/2019 d Wh Concepts

    39/79

    39

    Based on the above classifications, there are two types of fact tables

    Cumulative Snapshot

    Cumulative: This type of fact table describes what has happened over a period of timeFor example this fact table may describe the total sales by product by store by dayThe facts for this type of fact tables are mostly additive. The first example is a

    Cumulative fact table.

    Snapshot: This type of fact table describes the state of things in a particular instanceOf time, and usually includes more semi additive and non-additive facts.

    The second example presented is a snapshot fact table

    types of fact tables :

    D t W h i Obj t C td

  • 7/30/2019 d Wh Concepts

    40/79

    40

    Data Warehousing Objects Contd.

    Dimension Tables:

    Dimension tables

    Define business in terms already familiar to users

    Wide rows with lots of descriptive text

    Small tables (about a million rows)

    Joined to fact table by a foreign key

    heavily indexed

    typical dimensions

    time periods, geographic region (markets, cities), products, customers,salesperson, etc.

    Dimension tables Types

  • 7/30/2019 d Wh Concepts

    41/79

    41

    Dimension tables Types

    Dimension tables Types

    Slowly Changing dimensions

    Junk Dimensions

    Confirmed Dimensions

    Degenerated Dimensions.

    Slowly Changing Dimensions :(SCD)

  • 7/30/2019 d Wh Concepts

    42/79

    42

    Various data elements in the dimension undergo changes (e.g. changes in

    attributes, hierarchical structures) which need to be captured for analysis.

    SCD problem is a common one particular to data warehousing.

    In a nutshell, this applies to cases where the attribute for a record varies over time.

    For eg:Customer key Name State

    1001 Christina Illinois

    Christina is a customer who first lived in chicago,illinois. At a later date, she moved to

    Los Angeles,California. Now how to modify the table to reflect this change?

    This is a Slowly Changing Dimension problem

    Slowly Changing Dimensions :(SCD)

    Types of SCD

  • 7/30/2019 d Wh Concepts

    43/79

    43

    There are in general 3 ways to solve this type of problem, and they are

    categorized as follows:

    Type 1

    Type 2

    Type 3

    Type 1: New record places the original record. No trace of the old record exists

    Type 2:A new record is added to the customer dimension table

    Type 3: The Original record is modified to reflect the change

    Types of SCD

    TYPE 1:

  • 7/30/2019 d Wh Concepts

    44/79

    44

    New record places the original record. No trace of the old record exists

    Eg: Customer key Name State

    1001 Christina Illinois

    After Christina moved from illinois to California, the new information replaces the

    new record and we have the following table:

    Customer key Name State

    1001 Christina California

    Advantages:This is the easiest way to handle the Slowly Changing dimension, Since there

    is no need to keep track of the old information.

    Disadvantages:All the history is lost. By applying this methodology, it is not possible to

    track back in history. Foreg In the above case, the company would not able to knowthat Christina lived in Illinois before.

    TYPE 1:

    TYPE 2:

  • 7/30/2019 d Wh Concepts

    45/79

    45

    In type 2 SCD a new record is added to the table to represent the new Information.Therefore both the original & the new record will be present

    Eg:

    After Christina moved from illinois to California, we add the new information as a

    new row into the tableAdvantages:

    This allows us to accurately keep all historical information

    Disadvantages:

    This will cause the size of the table to grow fast where the number of rows for the

    table is very high to start with, storage and performance can become a concern

    Customer key Name State

    1001 Christina Illinois

    1005Christina California

    TYPE 2:

    TYPE 3:

  • 7/30/2019 d Wh Concepts

    46/79

    46

    In type 3 SCD there will be two columns to indicate the particular attribute ofinterest, one indicating the original value, and one indicating the current value.There will also be a column that indicates when the current value becomes active.

    Eg:

    After Christina moved from illinois to California, the original information gets updated,

    And we have the above table (Assuming the effective date of change is January 15,2003Advantages: This does not increase the size of the table, since new information is updated

    This allows us to keep some part of history

    Disadvantages:Type 3 will not be able to keep all history where an attribute is changed more than

    Once. For eg, if Christina later moves from to Texas on December 15,2003 theCalifornia information is lost

    Customer key Name Original State Current State Effective Date

    1001 Christina Illinois California 15-Jan-03

    TYPE 3:

    Degenerated Dimension:

  • 7/30/2019 d Wh Concepts

    47/79

    47

    Degenerate dimension is a dimension which is derived from the fact tableand doesn't have its own dimension table.

    Degenerate dimensions are often used when a fact table's grain representstransactional level data and one wishes to maintain system specific identifierssuch as order numbers, invoice numbers and the like without forcing their

    inclusion in their own dimension.

    Degenerated Dimension:

    Confirmed Dimensions :

  • 7/30/2019 d Wh Concepts

    48/79

    48

    Dimension which is fixed and reusable.

    It is also called as fixed dimension. It is a dimension which doesn't effectwith respect to time.

    Ex : if the name of the city is changed from Bombay to Mumbai, the name

    will not change from time to time, once the change is done ,The change is permanent.This type of dimensions are called confirmed or fixed dimensions.

    Confirmed Dimensions :

    Junk dimensions:

  • 7/30/2019 d Wh Concepts

    49/79

    49

    A dimension where one can store random transactional codes,flags and text attributes that are not related to other dimensionsand which provides a simple way for users to easily find thoseunrelated attributes.

    Ex: Martial Status : (Yes or No)

    Gender : (M or F) e.t.c.

    Junk dimensions:

    Data Warehousing Objects Contd.

  • 7/30/2019 d Wh Concepts

    50/79

    50

    Data Warehousing Objects Contd.

    Hierarchies:

    Hierarchies are logical structures that use ordered levels as a meansof organizing data. A hierarchy can be used to define data aggregation.For example, in a time dimension, a hierarchy might aggregate data fromthe month level to the quarter level to the year level. A level represents aposition in a hierarchy.

    Unique Identifiers:

    Unique identifiers are specified for one distinct record in a dimension table.Artificial unique identifiers are often used to avoid the potential problem ofunique identifiers changing.

    Relationships:

    Relationships guarantee business integrity. Designing a relationship betweenthe sales information in the fact table and the dimension tables products andcustomers enforces the business rules in databases.

    Physical Design In Datawarehouse

  • 7/30/2019 d Wh Concepts

    51/79

    51

    y g

    Physical design is the creation of the database with SQL statements. During the

    physical design process, you convert the data gathered during the logical designphase into a description of the physical database structure.

    Physical Design Structures:

    Table spaces: A tablespace consists of one or more data files, which are physical

    structures within the operating system you are using. A data file is associatedwith only one tablespace. From a design perspective, table spaces are containersfor physical design structures.

    Tables and Partitioned Tables: Tables are the basic unit of data storage. They arethe container for the expected amount of raw data in your data warehouse. Usingpartitioned tables instead of non-partitioned ones addresses the key problem of

    supporting very large data volumes by allowing you to decompose them intosmaller and more manageable pieces.

    Physical Design In Data Warehouse Contd.

  • 7/30/2019 d Wh Concepts

    52/79

    52

    y g

    Views:

    A view is a tailored presentation of the data contained in one or more tables orother views. A view takes the output of a query and treats it as a table. Views donot require any space in the database.

    Integrity Constraints:

    Integrity constraints are used to enforce business rules associated with yourdatabase and to prevent having invalid information in the tables. Integrityconstraints in data warehousing differ from constraints in OLTP environments. InOLTP environments, they primarily prevent the insertion of invalid data into arecord, which is not a big problem in data warehousing environments becauseaccuracy has already been guaranteed.

    Indexes:

    Indexes are optional structures associated with tables or clusters. In addition tothe classical B-tree indexes, bitmap indexes are very common in datawarehousing environments.

    Definition Of Data Warehouse

  • 7/30/2019 d Wh Concepts

    53/79

    53

    Ralph Kimball's paradigm:

    Data warehouse is the conglomerate of all data marts within the

    enterprise. Information is always stored in the dimensional model.

    Bill Inmon's paradigm:

    Data warehouse is one part of the overall business intelligence system.

    An enterprise has one data warehouse, and data marts source their

    information from the data warehouse. In the data warehouse, information

    is stored in 3rd normal form

    Basic Design Approaches of Data Warehouse

  • 7/30/2019 d Wh Concepts

    54/79

    54

    There are two major types of approaches to building or designing the

    Data Warehouse.

    The Top-Down Approach

    The Bottom-Up Approach

    The Top Down Approach

  • 7/30/2019 d Wh Concepts

    55/79

    55

    The Dependent Data Mart structure or Hub & Spoke: The Top-Down Approach

    Inmon advocated a dependent data mart structure

    The data flow in the top down OLAP environment begins with data extractionfrom the operational data sources. This data is loaded into the staging area andvalidated and consolidated for ensuring a level of accuracy and then transferredto the Operational Data Store (ODS).

    Detailed data is regularly extracted from the ODS and temporarily hosted in thestaging area for aggregation, summarization and then extracted and loaded intothe Data warehouse.

    Once the Data warehouse aggregation and summarization processes arecomplete, the data mart refresh cycles will extract the data from the Datawarehouse into the staging area and perform a new set of transformations on

    them. This will help organize the data in particular structures required by datamarts. Then the data marts can be loaded with the data and the OLAPenvironment becomes available to the users.

    The Top Down Approach Contd

  • 7/30/2019 d Wh Concepts

    56/79

    56

    Inmon Approach

    The data marts are treated as sub sets of the data warehouse. Eachdata mart is built for an individual department and is optimized for

    analysis needs of the particular department for which it is created.

    The Bottom-Up Approach

  • 7/30/2019 d Wh Concepts

    57/79

    57

    1. The Data warehouse Bus Structure: The Bottom-Up Approach

    Ralph Kimball designed the data warehouse with the data marts connectedto it with a bus structure.

    The bus structure contained all the common elements that are used by datamarts such as conformed dimensions, measures etc defined for the enterpriseas a whole.

    This architecture makes the data warehouse more of a virtual reality than aphysical reality

    All data marts could be located in one server or could be located on differentservers across the enterprise while the data warehouse would be a virtualentity being nothing more than a sum total of all the data marts

    In this context even the cubes constructed by using OLAP tools could beconsidered as data marts.

    The Bottom-Up Approach Contd

  • 7/30/2019 d Wh Concepts

    58/79

    58

    Kimball Approach

    The bottom-up approach reverses the positions of the Data warehouseand the Data marts. Data marts are directly loaded with the data from theoperational systems through the staging area.

    The data flow in the bottom up approach starts with extraction of datafrom operational databases into the staging area where it is processedand consolidated and then loaded into the ODS.

    The Bottom-Up Approach Contd

  • 7/30/2019 d Wh Concepts

    59/79

    59

    The data in the ODS is appended to or replaced by the fresh data being

    loaded. After the ODS is refreshed the current data is once again

    extracted into the staging area and processed to fit into the Data mart

    structure. The data from the Data Mart, then is extracted to the staging

    area aggregated, summarized and so on and loaded into the Data Warehouse andmade available to the end user for analysis.

    DW Operational Processes (Overview ofExtraction, Transformation & Loading)

  • 7/30/2019 d Wh Concepts

    60/79

    60

    Typically host based, legacy applications

    Customized applications, COBOL, 3GL, 4GL

    Point of Contact Devices

    POS, ATM, Call switches

    External Sources

    Nielsens, Acxiom, CMIE, Vendors, Partners

    Sequential Legacy Relational ExternalOperational/Source Data

    SourceData

    DW Operational Processes (Overview ofExtraction, Transformation & Loading) Contd

  • 7/30/2019 d Wh Concepts

    61/79

    61

    These tools try to automate or support tasks such as:-

    Data Extraction (accessing diff source data bases)

    Data Cleansing (finding and resolving inconsistencies in the source data)

    Data Transformation (between different data formats, languages, etc.)

    Data Loading

    Replication (replicating source databases into the data warehouse)

    Analyzing & Checking of Data Quality (for correctness and completeness)

    Building derived data & views

    DW Operational Processes (Overview ofExtraction, Transformation & Loading) Contd

  • 7/30/2019 d Wh Concepts

    62/79

    62

    Elements of a Data Warehouse

    DW Operational Processes (Overview ofExtraction, Transformation & Loading) Contd

  • 7/30/2019 d Wh Concepts

    63/79

    63

    Loading the Warehouse

    Cleaning the data before it is loaded

    DW Operational Processes (Overview ofExtraction, Transformation & Loading) Contd

  • 7/30/2019 d Wh Concepts

    64/79

    64

    These processes have been discussed in details in the ETL section.

    Some important definitions:

    Data Scrubbing: http://www.wisegeek.com/what-is-data-scrubbing.htm

    Data Cleansing: http://www.wisegeek.com/what-is-data-cleansing.htm

    Row level security: http://www.securityfocus.com/infocus/1743

    Staging Types: http://esj.com/Columns/article.aspx?EditorialsID=55

    Technical Problems in Data Warehouse

    http://www.wisegeek.com/what-is-data-scrubbing.htmhttp://www.wisegeek.com/what-is-data-cleansing.htmhttp://www.securityfocus.com/infocus/1743http://esj.com/Columns/article.aspx?EditorialsID=55http://esj.com/Columns/article.aspx?EditorialsID=55http://www.securityfocus.com/infocus/1743http://www.wisegeek.com/what-is-data-cleansing.htmhttp://www.wisegeek.com/what-is-data-cleansing.htmhttp://www.wisegeek.com/what-is-data-cleansing.htmhttp://www.wisegeek.com/what-is-data-cleansing.htmhttp://www.wisegeek.com/what-is-data-cleansing.htmhttp://www.wisegeek.com/what-is-data-cleansing.htmhttp://www.wisegeek.com/what-is-data-cleansing.htmhttp://www.wisegeek.com/what-is-data-scrubbing.htmhttp://www.wisegeek.com/what-is-data-scrubbing.htmhttp://www.wisegeek.com/what-is-data-scrubbing.htmhttp://www.wisegeek.com/what-is-data-scrubbing.htmhttp://www.wisegeek.com/what-is-data-scrubbing.htmhttp://www.wisegeek.com/what-is-data-scrubbing.htmhttp://www.wisegeek.com/what-is-data-scrubbing.htm
  • 7/30/2019 d Wh Concepts

    65/79

    65

    Managing large amounts of data:

    The explosion of data volume came about because the data warehouse required

    that both detail and history be mixed in the same environment.Large amounts of data need to be managed in many ways-through flexibility ofaddressability of data stored inside the processor and stored inside diskstorage, through indexing, through extensions of data, through the efficientmanagement of overflow, and so forth. To be effective, the technology usedmust satisfy the requirements for both volume and efficiency.

    Index/Monitor Data:

    If data in the warehouse cannot be easily and efficiently indexed, the datawarehouse will not be a success. Monitoring data warehouse data determinessuch factors as the following:

    If a reorganization needs to be done

    If an index is poorly structured

    If too much or not enough data is in overflow

    The statistical composition of the access of the data

    Available remaining space

    Technical Problems in Data Warehouse Contd

  • 7/30/2019 d Wh Concepts

    66/79

    66

    Interfaces to many technologies:

    Data passes into the data warehouse from the operational environment

    and the ODS, and from the data warehouse into data marts, DSS applications,exploration and data mining warehouses, and alternate storage.

    This passage must be smooth and easy.

    The interface to different technologies requires several considerations:

    Does the data pass from one DBMS to another easily?

    Does it pass from one operating system to another easily?

    Does it change its basic format in passage (EBCDIC, ASCII, etc.)?

    Technical Problems in Data Warehouse Contd

  • 7/30/2019 d Wh Concepts

    67/79

    67

    Meta Data Management:

    The data warehouse operates under a heuristic, iterative development life cycle.To be effective, the user of the data warehouse must have access to meta datathat is accurate and up-to-date.

    Several types of meta data need to be managed in the data warehouse: distrib-uted meta data, central meta data, technical meta data, and business meta data.

    Technical Problems in Data Warehouse Contd

  • 7/30/2019 d Wh Concepts

    68/79

    68

    Efficient Loading of Data

    Data is loaded into a data warehouse in two fundamental ways:

    a record at a time through a language interface or en masse with a utility.

    Indexes must be efficiently loaded at the same time the data is loaded. As theburden of the volume of loading becomes an issue, the load is often parallelized.

    Another related approach to the efficient loading of very large amounts of data isstaging the data prior to loading.

    As a rule, large amounts of data are gathered into a buffer area before beingprocessed by extract/transfer/load (ETL) software. The staged data is merged,perhaps edited, summarized, and so forth, before it passes into the ETL layer.

    Technical Problems in Data Warehouse Contd

  • 7/30/2019 d Wh Concepts

    69/79

    69

    Lock Management:

    The lock manager ensures that two or more people are not updating the

    same record at the same time. But update is not done in the data warehouse;instead, data is stored in a series of snapshot records. When a change occurs

    a new snapshot record is added, rather than an update being done.

    Steps in Building a Data Warehouse:

  • 7/30/2019 d Wh Concepts

    70/79

    70

    Identify key business drivers, sponsorship, risks, ROI

    Survey information needs and identify desired functionality and definefunctional requirements for initial subject area.

    Architect long-term, data warehousing architecture

    Evaluate and Finalize DW tool & technology

    Conduct Proof-of-Concept

    Design target data base schema

    Build data mapping, extract, transformation, cleansing andaggregation/summarization rules

    Build initial data mart, using exact subset of enterprise data warehousingarchitecture and expand to enterprise architecture over subsequent phases

    Maintain and administer data warehouse

    Representative DSS Tools

  • 7/30/2019 d Wh Concepts

    71/79

    71

    Tool Category Products

    ETL Tools ETI Extract, Informatica, IBM Visual WarehouseOracle Warehouse Builder

    OLAP Server Oracle Express Server, Hyperion Essbase,IBM DB2 OLAP Server, Microsoft SQL Server

    OLAP Services, Seagate HOLOS, SAS/MDDB

    OLAP Tools Oracle Express Suite, Business Objects,Web Intelligence, SAS, Cognos Powerplay/Impromtu, KALIDO, MicroStrategy, Brio Query,MetaCube

    Data Warehouse Oracle, Informix, Teradata, DB2/UDB, Sybase,Microsoft SQL Server, RedBricks

    Data Mining & Analysis SAS Enterprise Miner, IBM Intelligent Miner,SPSS/Clementine, TCS Tools

    Business Intelligence

  • 7/30/2019 d Wh Concepts

    72/79

    72

    How intelligent can you make your business processes?

    What insight can you gain into your business?

    How integrated can your business processes be?

    How much more interactive can your business be with customers, partners,

    employees and managers?

    What is Business Intelligence (BI)?

  • 7/30/2019 d Wh Concepts

    73/79

    73

    Business Intelligence is a generalized term applied to a broad category ofapplications and technologies for gathering, storing, analyzing and providingaccess to data to help enterprise users make better business decisions

    Business Intelligence applications include the activities of decision supportsystems, query and reporting, online analytical processing (OLAP), statisticalanalysis, forecasting, and data mining

    An alternative way of describing BI is: the technology required to turn raw datainto information to support decision-making within corporations and businessprocesses

    Why BI?

  • 7/30/2019 d Wh Concepts

    74/79

    74

    BI technologies help bring decision-makers the data in a form they can quicklydigest and apply to their decision making.

    BI turns data into information for managers and executives and in general, peoplemaking decisions in a company.

    Companies want to use technology tactically to make their operations moreeffective and more efficient - Business intelligence can be the catalyst for thatefficiency and effectiveness.

    Benefits

  • 7/30/2019 d Wh Concepts

    75/79

    75

    The benefits of a well-planned BI implementation are going to be closely tied tothe business objectives driving the project.

    Identify trends and anomalies in business operations more quickly, allowingfor more accurate and timelier decisions.

    Deliver actionable insight and information to the right place with less effort .

    Identify and operate based on a single version of the truth, allowing allanalysis to be completed on a core foundation with confidence.

    Business Intelligence Platform Requirements

  • 7/30/2019 d Wh Concepts

    76/79

    76

    Data Warehouse Databases

    OLAP

    Data Mining

    Interfaces

    Build and Manage Capabilities

    The business intelligence platform should provide good integration across thesetechnologies. It should be a coherent platform, not a set of diverse andheterogeneous technologies.

    Business Intelligence Components

  • 7/30/2019 d Wh Concepts

    77/79

    77

    TRANSFORM

    LOAD

    EXTRACT

    OLAPDATAMINING

    DataWarehouse

    Operational Data

    Business Intelligence Architecture

  • 7/30/2019 d Wh Concepts

    78/79

    78

    Business Intelligence Technologies

  • 7/30/2019 d Wh Concepts

    79/79

    79

    Data Sources

    Paper, Files, Information Providers, Database Systems, OLTP

    Data Warehouses / Data Marts

    Data Exploration

    OLAP, DSS, EIS, Querying and Reporting

    Data Mining

    Information discovery

    Data Presentation

    Visualization Techniques

    Decision Making

    Increasing potential to

    support business decisions End User

    Business Analyst

    Data Analyst

    DB Admin