39932886 Conf Dwh Concepts

download 39932886 Conf Dwh Concepts

of 78

Transcript of 39932886 Conf Dwh Concepts

  • 7/28/2019 39932886 Conf Dwh Concepts

    1/78

    1

    Course Overview

    What is Data Warehouse

    OLTP Vs. Data Warehousing

    Data Warehousing Architecture

    Data Warehousing Schemas & Objects

    Physical Design in Data Warehouse

    Definition of Data Warehousing

  • 7/28/2019 39932886 Conf Dwh Concepts

    2/78

    2

    Course Overview

    Data Warehousing basic Design

    Approaches

    Data Warehousing Operational

    Processes

    Technical Problems in Data

    Warehousing

    Representative DSS Tools

    Business Intelligence

  • 7/28/2019 39932886 Conf Dwh Concepts

    3/78

    3

    What is a Data Warehouse?

    A data warehouse is a relational database that is designed for query and analysisrather than for transaction processing. It usually contains historical data derived

    from transaction data.

    A data warehouse environment includes an extraction, transportation,

    transformation, and loading (ETL) solution, online analytical processing (OLAP)

    and data mining capabilities, client analysis tools, and other applications thatmanage the process of gathering data and delivering it to business users.

    It is a series of processes, procedures and tools (h/w & s/w) that help the

    enterprise understand more about itself, its products, its customers and the

    market it services

  • 7/28/2019 39932886 Conf Dwh Concepts

    4/78

    4

    NOT possible to

    purchase a DataWarehouse, but it is

    possible to build one.

    Data Warehouse is

    NOT a specifictechnology

    Facts !

  • 7/28/2019 39932886 Conf Dwh Concepts

    5/78 5

    Who are the potentialCustomers ?

    Which Products are sold themost ?

    What are the region-wisepreferences ?What are the competitorproducts ?

    What are the projectedsales ?

    What if you sale morequantity of a particularproduct ?

    What will be the impact

    on revenue ?Results of promotionschemes introduced ?

    Why Data Warehousing?

    Need of Intelligent Information in Competitive Market

  • 7/28/2019 39932886 Conf Dwh Concepts

    6/78 6

    Data Warehouse is a subject-oriented, integrated nonvolatile andtime-variant collection of data insupport of managementsdecisions.

    William Imon

    Defining Data warehouse

  • 7/28/2019 39932886 Conf Dwh Concepts

    7/787

    Subject Oriented

    The data in datawarehouse is organized

    around the major subjectof the enterprise ( i.e.the high level entities).

    The orientation aroundthe major subject areascauses the datawarehouse design to be

    data driven.

    The operational systemsare designed around theapplication andfunctions. e.g. Loans ,savings , credit cards in

    case of a Bank. WhereData Warehouse isdesigned around asubject like Customer ,Product , Vendor etc.

    OperationalSystems

    DataWarehouse

    Customer

    Supplier

    Product

    Organized by processesor tasks

    Organized bysubject

  • 7/28/2019 39932886 Conf Dwh Concepts

    8/788

    Data Warehouse Data

    Time Data

    {Key

    Time Variant

    Data is stored as a series of snapshots or views which record how it is collected

    across time.

    It helps in Business trend analysis

    In contrast to OLTP environment, data warehouses focus on

    change over time that is what we mean by time variant.

  • 7/28/2019 39932886 Conf Dwh Concepts

    9/789

    Integrated

    Data is stored once in a single integrated location

    Data Warehouse

    Database

    Subject = Customer

    Auto PolicyProcessing

    System

    Auto PolicyProcessing

    System

    Customer

    data

    storedin several

    databases

    Fire PolicyProcessing

    System

    Fire PolicyProcessing

    System

    FACTS, LIFECommercial, Accounting

    Applications

    FACTS, LIFECommercial, Accounting

    Applications

    It is closely related with subject orientation.

    Data from disparate sources need to be put in a consistent format.

    Resolving of problems such as naming conflicts and inconsistencies

  • 7/28/2019 39932886 Conf Dwh Concepts

    10/78

    10

    Non-Volatile

    Existing data in the warehouse is not overwritten or updated.

    ExternalSources

    Read-Only

    Data

    WarehouseDatabase

    DataWarehouse

    Environment

    DataWarehouse

    Environment

    ProductionDatabasesProduction

    Applications

    ProductionApplications

    Update

    InsertDelete

    Load

    This is logical because the purpose of a data warehouse is to enable you to analyzewhat has occurred.

  • 7/28/2019 39932886 Conf Dwh Concepts

    11/78

    11

    So, whats different between OLTP

    and Data Warehouse?

  • 7/28/2019 39932886 Conf Dwh Concepts

    12/78

    12

    OLTP vs. Data Warehouse

    OLTP systems are tuned for known transactions and workloads while workload is

    not known in a data warehouse

    Special data organization, access methods and implementation methods are

    needed to support data warehouse queries (typically multidimensional queries)

    e.g., average amount spent on phone calls between 9AM-5PM in Pune during

    the month of December

  • 7/28/2019 39932886 Conf Dwh Concepts

    13/78

    13

    OLTP vs. Data Warehouse

    OLTP

    Application Oriented

    Used to run business

    Detailed data

    Current up to date

    Isolated DataRepetitive access

    Clerical User

    WAREHOUSE (DSS)

    Subject Oriented

    Used to analyze business

    Summarized and refined

    Snapshot data

    Integrated DataAd-hoc access

    Knowledge User (Manager)

  • 7/28/2019 39932886 Conf Dwh Concepts

    14/78

    14

    OLTP vs Data Warehouse

    OLTP

    Performance Sensitive

    Few Records accessed at a time (tens)

    Read/Update Access

    No data redundancy

    Database Size 100MB -100 GB

    DATA WAREHOUSE

    Performance relaxed

    Large volumes accessed at a

    time(millions)

    Mostly Read (Batch Update)

    Redundancy present

    Database Size 100 GB -

    few terabytes

  • 7/28/2019 39932886 Conf Dwh Concepts

    15/78

    15

    OLTP vs Data Warehouse

    OLTP

    Transaction throughput is the

    performance metric

    Thousands of users

    Managed in entirety

    Data Warehouse

    Query throughput is the

    performance metric

    Hundreds of users

    Managed by subsets

  • 7/28/2019 39932886 Conf Dwh Concepts

    16/78

    16

    To summarize ...

    OLTP Systems are

    used to runa business

    The Data Warehouse helps to

    optimizethe business

  • 7/28/2019 39932886 Conf Dwh Concepts

    17/78

    17

    Data Warehouse Architectures

    Centralized

    In a centralized architecture, there exists only one data warehouse which stores alldata necessary for business analysis. As already shown in the previous section, the

    disadvantage is the loss of performance in opposite to distributed approaches.

    Central Architecture

  • 7/28/2019 39932886 Conf Dwh Concepts

    18/78

    18

    Federated

    In a federated architecture the data is logically consolidated but stored in separate

    physical databases, at the same or at different physical sites. The local data marts store

    only the relevant information for a department.

    The amount of data is reduced in contrast to a central data warehouse. The level of

    detail is enhanced.

    FederatedArchitecture

    Data Warehouse Architectures Contd

  • 7/28/2019 39932886 Conf Dwh Concepts

    19/78

    19

    Tiered:

    A tiered architecture is a distributed data approach. This processcan not be done in one step because many sources have to be integrated

    into the warehouse.

    On a first level, the data of all branches in one region is collected, in

    the second level the data from the regions is integrated into one data

    warehouse.

    Advantages:

    Faster response time

    because the data is located

    closer to the client

    applications and Reduced volume of data to

    be searched.

    Tiered Architecture

    Data Warehouse Architectures Contd

  • 7/28/2019 39932886 Conf Dwh Concepts

    20/78

    20

    Metadata

    Data SourcesData Sources Data ManagementData Management AccessAccess

    Complete Warehouse Solution Architecture

    Operational Data

    Legacy Data

    The Post

    VISA

    External DataSources

    EnterpriseData

    Warehouse

    Organizationally

    structured

    Extract

    Transform

    Load

    Data Information Knowledge

    Asset Assembly (and Management) Asset Exploitation

    DataMart

    DataMart

    Departmentallystructured

    Data

    Mart

    Sales

    Inventory

    Purchase

  • 7/28/2019 39932886 Conf Dwh Concepts

    21/78

    21

    Data Sources:

    Legacy data

    Operational data

    External data resources

    Data Management :

    Metadata - At all levels of the data warehouse, information is required to support the

    maintenance and use of the Data Warehouse.

    Data Mart A data mart is a subject oriented data warehouse.

    Data Warehouse Architecture Components

    Disparate datasources

  • 7/28/2019 39932886 Conf Dwh Concepts

    22/78

    22

    Introduction To Data Marts

    What is a Data Mart

    From the Data Warehouse , atomic data flows to various departments for their

    customized needs. If this data is periodically extracted from data warehouse

    and loaded into a local database, it becomes a data mart. The data in Data Mart

    has a different level of granularity than that of Data Warehouse. Since the data

    in Data Marts is highly customized and lightly summarized , the departments can

    do whatever they want without worrying about resource utilization. Also the

    departments can use the analytical software they find convenient. The cost of

    processing becomes very low.

  • 7/28/2019 39932886 Conf Dwh Concepts

    23/78

    23

    Data Mart Overview

    Data Marts

    Satisfy 80% of

    the local end-

    users requests

    Sales Representatives

    and Analysts

    Human

    Resources

    Financial Analysts,

    Strategic Planners,

    and Executives

    DM Marketing

    DM Finance

    DM Sales

    DM HR

    Data Warehouse

    DM Sales

    DM HR

    DM Marketing

  • 7/28/2019 39932886 Conf Dwh Concepts

    24/78

    24

    From TheData Warehouse To Data Marts

    Departmentally

    Structured

    Individually

    Structured

    Data WarehouseOrganizationallyStructured

    Less

    More

    History

    Normalized

    Detailed

    Data

    Information

  • 7/28/2019 39932886 Conf Dwh Concepts

    25/78

    25

    Operational Data Store (ODS)

    What is an ODS

    An Operational Data Store (ODS) integrates data from multiple business operationsources to address operational problems that span one or more business functions.

    An ODS has the following features:

    Subject-oriented Organized around major subjects of an organization(customer, product, etc.), not specific applications (order entry, accounts

    receivable, etc.).

    Integrated Presents an integrated image of subject-oriented data which ispulled from fragmented operational source systems.

    Current Contains a snapshot of the current content of legacy source systems.History is not kept, and might be moved to the data warehouse for analysis.

    Volatile Since ODS content is kept current, it changes frequently. Identicalqueries run at different times may yield different results.

    Detailed ODS data is generally more detailed than data warehouse data.Summary data is usually not stored in an ODS; the exact granularity depends on thesubject that is being supported.

  • 7/28/2019 39932886 Conf Dwh Concepts

    26/78

    26

    Operational Data Store (ODS) Contd

    The ODS provides an integrated view of data in operational systems.

    As the figure below indicates, there is a clear separation between the ODS and thedata warehouse.

    A

    B

    C

    EIS

    DSS

    Apps

    PC

    Operational

    Data Store

    Current or near

    current data

    Detailed data

    Updates allowed

    Historical data

    Summary and detail

    Non-volatile

    snapshots only

    Data Warehouse

  • 7/28/2019 39932886 Conf Dwh Concepts

    27/78

    27

    Benefits Of ODS

    Supports operational reporting needs of the organization

    Provides a complete view of customer relationships, the data for which might be

    stored in several operational databases -- this data can include data from an

    organizations internal systems, as well as external data from third-party vendors.

    Operates as a store for detailed data, updated frequently and used for drill-downs

    from the data warehouse which contains summary data.

    Reduces the burden placed on other operational or data warehouse platforms by

    providing an additional data store for reporting.

    Provides more current data than in a data warehouse and more integrated than an

    OLTP system

    Feeds other operational systems in addition to the data warehouse

  • 7/28/2019 39932886 Conf Dwh Concepts

    28/78

    28

    Data Warehousing SCHEMAS & OBJECTS

    A schema is a collection of database objects, including tables, views,indexes, and synonyms.

    There is a variety of ways of arranging schema objects in the schema

    models designed for data warehousing. The are:

    Star Schema

    Snowflake Schema

    Galaxy Schema

  • 7/28/2019 39932886 Conf Dwh Concepts

    29/78

    29

    Star Schema: It Consists of a fact table connected to a set of dimensional

    tables Data is in Dimension tables is De-Normalized

    Snowflake Schema: It is refinement of star schema where some dimensional

    hierarchy is normalized in to a set of dimensional tables

    Galaxy Schema:Multiple fact tables share dimension tables viewed as a

    collection of stars, therefore called galaxy schema

  • 7/28/2019 39932886 Conf Dwh Concepts

    30/78

    30

    Star SchemaA star schema a highly De-Normalized, query-centric model where information is

    broken into two groups: facts and dimensions.

    Time_DimTime_DimTime_DimTime_Dim

    TimeKeyTimeKeyTheDate...

    TheDate...

    Sales_FactSales_FactTimeKeyEmployeeKeyProductKeyCustomerKeyShipperKey

    TimeKeyEmployeeKeyProductKeyCustomerKeyShipperKey

    Required Data(Business Metrics)

    or (Measures)...

    Required Data(Business Metrics)

    or (Measures)...

    Employee_DimEmployee_DimEmployee_DimEmployee_DimEmployeeKeyEmployeeKey

    EmployeeID...

    EmployeeID...

    Branch_DimBranch_DimBranch_DimBranch_DimBranchIDBranchIDBranchno...

    Branchno...

    Customer_DimCustomer_DimCustomer_DimCustomer_DimCustomerKeyCustomerKey

    CustomerID...

    CustomerID...

    Shipper_DimShipper_DimShipper_DimShipper_DimShipperKeyShipperKey

    ShipperID...

    ShipperID...

    S fl k S h

  • 7/28/2019 39932886 Conf Dwh Concepts

    31/78

    31

    Sales_fact

    timeID {FK}

    propertyID {FK}

    branchID {FK}

    clientID {FK}

    promotionID {FK}

    staffID {FK}

    ownerID {FK}

    offerPrice

    sellingPrice

    saleCommission

    saleRevenue

    Branch_Dim

    branchID {PK}

    branchNo

    branchType

    city {FK}

    City

    city {PK}

    region {FK}

    Region

    region {PK}

    country

    Figure32.2

    Fact Table

    DimensionTables

    Snowflake Schema

  • 7/28/2019 39932886 Conf Dwh Concepts

    32/78

    32

    Multiple Groups of Facts links by few common dimensions

    Fact1

    Fact2 Fact3

    Dimension2Dimension1

    Dimension4

    Dimension5

    Dimension3

    Dimension7Dimension6

    Galaxy Schema

  • 7/28/2019 39932886 Conf Dwh Concepts

    33/78

    33

    Data Warehousing Objects

    All the three types of Schemas are described in the Data Modeling section

    Various Objects used in Data Warehousing are:

    Fact Tables

    Dimension Tables

    Hierarchies

    Unique Identifiers

    Relationships

  • 7/28/2019 39932886 Conf Dwh Concepts

    34/78

    34

    Data Warehousing Objects

    Fact Tables:

    Represent a business process, i.e., models the business process as an artifact in the

    data model

    Contain the measurements or metrics or facts of business processes

    "monthly sales number" in the Sales business process

    most are additive (sales this month), some are semi-additive (balance as of), someare not additive (unit price)

    The level of detail is called the grain of the table

    Contain foreign keys for the dimension tables

    F t T

  • 7/28/2019 39932886 Conf Dwh Concepts

    35/78

    35

    Additive facts:

    Additive facts are facts that can be summed up through all of the dimensions

    in the fact table

    Semi-Additive facts:

    Semi-additive facts are facts that can be summed up for some of the dimensions

    in the fact table

    Non-additive facts:

    Non-additive facts are facts that cannot be summed up for any of the

    dimensions Present in the fact table

    Fact Types :

    Examples to illustrate Additive, Semi-Additive& Non Additive facts:

  • 7/28/2019 39932886 Conf Dwh Concepts

    36/78

    36

    & Non-Additive facts:

    Date

    Store

    Product

    Sales_Amount

    The purpose of this table is to record the Sales_Amount for each product in each store

    On a daily basis. Sales_Amount is the fact.

    In this case, Sales_Amount is an additive fact, because we can sum up this fact along

    with any of the 3 dimensions present in the fact table date, store, and product

    Fact table:

    E f i Additi & N Additi f t

  • 7/28/2019 39932886 Conf Dwh Concepts

    37/78

    37

    Eg for semi-Additive & Non-Additive facts:

    Date

    Account

    Current_Balance

    Profit_Margin

    Fact table:

    The purpose of this table is to record the current balance for each account at the end ofeach day, as well as the profit margin for each account for each day

    Current_Balance & Profit_Margin are the facts

    Current_Balance is a semi additive fact, as it makes sense to add them up for all

    accounts (whats the total current balance for all accounts in the bank?), but it does not

    make sense to add them up through time

    Profit_Margin is a non additive fact, for it does not make sense to add them up for the

    account level or the day level

    t pes of fact tables

  • 7/28/2019 39932886 Conf Dwh Concepts

    38/78

    38

    Based on the above classifications, there are two types of fact tables

    Cumulative Snapshot

    Cumulative: This type of fact table describes what has happened over a period of timeFor example this fact table may describe the total sales by product by store by day

    The facts for this type of fact tables are mostly additive. The first example is a

    Cumulative fact table.

    Snapshot: This type of fact table describes the state of things in a particular instance

    Of time, and usually includes more semi additive and non-additive facts.

    The second example presented is a snapshot fact table

    types of fact tables :

    D t W h i Obj t C td

  • 7/28/2019 39932886 Conf Dwh Concepts

    39/78

    39

    Data Warehousing Objects Contd.

    Dimension Tables:

    Dimension tables

    Define business in terms already familiar to users

    Wide rows with lots of descriptive text

    Small tables (about a million rows)

    Joined to fact table by a foreign key

    heavily indexed

    typical dimensions

    time periods, geographic region (markets, cities), products, customers,salesperson, etc.

    Di i t bl T

  • 7/28/2019 39932886 Conf Dwh Concepts

    40/78

    40

    Dimension tables Types

    Dimension tables Types

    Slowly Changing dimensions

    Junk Dimensions

    Confirmed Dimensions

    Degenerated Dimensions.

    Sl l Ch i Di i (SCD)

  • 7/28/2019 39932886 Conf Dwh Concepts

    41/78

    41

    Various data elements in the dimension undergo changes (e.g. changes in

    attributes, hierarchical structures) which need to be captured for analysis.

    SCD problem is a common one particular to data warehousing.

    In a nutshell, this applies to cases where the attribute for a record varies over time.

    For eg:

    Customer key Name State

    1001 Christina Illinois

    Christina is a customer who first lived in chicago,illinois. At a later date, she moved to

    Los Angeles,California. Now how to modify the table to reflect this change?

    This is a Slowly Changing Dimension problem

    Slowly Changing Dimensions :(SCD)

    Types of SCD

  • 7/28/2019 39932886 Conf Dwh Concepts

    42/78

    42

    There are in general 3 ways to solve this type of problem, and they are

    categorized as follows:

    Type 1

    Type 2

    Type 3

    Type 1: New record places the original record. No trace of the old record exists

    Type 2: A new record is added to the customer dimension table

    Type 3: The Original record is modified to reflect the change

    Types of SCD

    TYPE 1:

  • 7/28/2019 39932886 Conf Dwh Concepts

    43/78

    43

    New record places the original record. No trace of the old record exists

    Eg:Customer key Name State

    1001 Christina Illinois

    After Christina moved from illinois to California, the new information replaces the

    new record and we have the following table:

    Customer key Name State

    1001 Christina California

    Advantages:

    This is the easiest way to handle the Slowly Changing dimension, Since there

    is no need to keep track of the old information.

    Disadvantages:

    All the history is lost. By applying this methodology, it is not possible to

    track back in history. For eg In the above case, the company would not able to know

    that Christina lived in Illinois before.

    TYPE 1:

    TYPE 2:

  • 7/28/2019 39932886 Conf Dwh Concepts

    44/78

    44

    In type 2 SCD a new record is added to the table to represent the new Information.Therefore both the original & the new record will be present

    Eg:

    After Christina moved from illinois to California, we add the new information as a

    new row into the tableAdvantages:

    This allows us to accurately keep all historical information

    Disadvantages:

    This will cause the size of the table to grow fast where the number of rows for the

    table is very high to start with, storage and performance can become a concern

    Customer key Name State

    1001 Christina Illinois

    1005 Christina California

    TYPE 2:

    TYPE 3:

  • 7/28/2019 39932886 Conf Dwh Concepts

    45/78

    45

    In type 3 SCD there will be two columns to indicate the particular attribute of interest, oneindicating the original value, and one indicating the current value. There will also be a

    column that indicates when the current value becomes active.

    Eg:

    After Christina moved from illinois to California, the original information gets updated,

    And we have the above table (Assuming the effective date of change is January 15,2003Advantages: This does not increase the size of the table, since new information is updated This allows us to keep some part of history

    Disadvantages:

    Type 3 will not be able to keep all history where an attribute is changed more than

    Once. For eg, if Christina later moves from to Texas on December 15,2003 the

    California information is lost

    Customer key Name Original State Current State Effective Date

    1001 Christina Illinois California 15-Jan-03

    TYPE 3:

    Degenerated Dimension:

  • 7/28/2019 39932886 Conf Dwh Concepts

    46/78

    46

    Degenerate dimension is a dimension which is derived from the fact tableand doesn't have its own dimension table.

    Degenerate dimensions are often used when a fact table's grain represents

    transactional level data and one wishes to maintain system specific identifiers

    such as order numbers, invoice numbers and the like without forcing their

    inclusion in their own dimension.

    Degenerated Dimension:

    Confirmed Dimensions :

  • 7/28/2019 39932886 Conf Dwh Concepts

    47/78

    47

    Dimension which is fixed and reusable.

    It is also called as fixed dimension. It is a dimension which doesn't effect

    with respect to time.

    Ex : if the name of the city is changed from Bombay to Mumbai, the name

    will not change from time to time, once the change is done ,The change is permanent.This type of dimensions are called confirmed or fixed dimensions.

    Confirmed Dimensions :

    Junk dimensions:

  • 7/28/2019 39932886 Conf Dwh Concepts

    48/78

    48

    A dimension where one can store random transactional codes,flags and text attributes that are not related to other dimensions

    and which provides a simple way for users to easily find those

    unrelated attributes.

    Ex: Martial Status : (Yes or No)

    Gender : (M or F) e.t.c.

    Junk dimensions:

    Data Warehousing Objects Contd.

  • 7/28/2019 39932886 Conf Dwh Concepts

    49/78

    49

    Data Warehousing Objects Contd.

    Hierarchies:

    Hierarchies are logical structures that use ordered levels as a meansof organizing data. A hierarchy can be used to define data aggregation.

    For example, in a time dimension, a hierarchy might aggregate data from

    the month level to the quarter level to the year level. A level represents a

    position in a hierarchy.

    Unique Identifiers:

    Unique identifiers are specified for one distinct record in a dimension table. Artificial uniqueidentifiers are often used to avoid the potential problem of

    unique identifiers changing.

    Relationships:

    Relationships guarantee business integrity. Designing a relationship betweenthe sales information in the fact table and the dimension tables products and customersenforces the business rules in databases.

    Physical Design In Datawarehouse

  • 7/28/2019 39932886 Conf Dwh Concepts

    50/78

    50

    Physical Design In Datawarehouse

    Physical design is the creation of the database with SQL statements. During the

    physical design process, you convert the data gathered during the logical designphase into a description of the physical database structure.

    Physical Design Structures:

    Table spaces: A tablespace consists of one or more data files, which are physical

    structures within the operating system you are using. A data file is associatedwith only one tablespace. From a design perspective, table spaces are containersfor physical design structures.

    Tables and Partitioned Tables: Tables are the basic unit of data storage. They arethe container for the expected amount of raw data in your data warehouse. Usingpartitioned tables instead of non-partitioned ones addresses the key problem of

    supporting very large data volumes by allowing you to decompose them intosmaller and more manageable pieces.

    Physical Design In Data Warehouse Contd.

  • 7/28/2019 39932886 Conf Dwh Concepts

    51/78

    51

    y g

    Views:

    A view is a tailored presentation of the data contained in one or more tables or otherviews. A view takes the output of a query and treats it as a table. Views do notrequire any space in the database.

    Integrity Constraints:

    Integrity constraints are used to enforce business rules associated with yourdatabase and to prevent having invalid information in the tables. Integrity

    constraints in data warehousing differ from constraints in OLTP environments. InOLTP environments, they primarily prevent the insertion of invalid data into a record,which is not a big problem in data warehousing environments because accuracy hasalready been guaranteed.

    Indexes:

    Indexes are optional structures associated with tables or clusters. In addition to theclassical B-tree indexes, bitmap indexes are very common in data warehousingenvironments.

    Definition Of Data Warehouse

  • 7/28/2019 39932886 Conf Dwh Concepts

    52/78

    52

    Ralph Kimball's paradigm:

    Data warehouse is the conglomerate of all data marts within the

    enterprise. Information is always stored in the dimensional model.

    Bill Inmon's paradigm:

    Data warehouse is one part of the overall business intelligence system.

    An enterprise has one data warehouse, and data marts source their

    information from the data warehouse. In the data warehouse, information

    is stored in 3rd normal form

    Basic Design Approaches of Data Warehouse

  • 7/28/2019 39932886 Conf Dwh Concepts

    53/78

    53

    g pp

    There are two major types of approaches to building or designing the

    Data Warehouse.

    The Top-Down Approach

    The Bottom-Up Approach

    The Top Down Approach

  • 7/28/2019 39932886 Conf Dwh Concepts

    54/78

    54

    The Dependent Data Mart structure or Hub & Spoke: The Top-Down Approach

    Inmon advocated a dependent data mart structure

    The data flow in the top down OLAP environment begins with data extractionfrom the operational data sources. This data is loaded into the staging area andvalidated and consolidated for ensuring a level of accuracy and then transferredto the Operational Data Store (ODS).

    Detailed data is regularly extracted from the ODS and temporarily hosted in thestaging area for aggregation, summarization and then extracted and loaded intothe Data warehouse.

    Once the Data warehouse aggregation and summarization processes arecomplete, the data mart refresh cycles will extract the data from the Datawarehouse into the staging area and perform a new set of transformations on

    them. This will help organize the data in particular structures required by datamarts. Then the data marts can be loaded with the data and the OLAPenvironment becomes available to the users.

    The Top Down Approach Contd

  • 7/28/2019 39932886 Conf Dwh Concepts

    55/78

    55

    Inmon Approach

    The data marts are treated as sub sets of the data warehouse. Each data

    mart is built for an individual department and is optimized for analysis needs

    of the particular department for which it is created.

    The Bottom-Up Approach

  • 7/28/2019 39932886 Conf Dwh Concepts

    56/78

    56

    1. The Data warehouse Bus Structure: The Bottom-Up Approach

    Ralph Kimball designed the data warehouse with the data marts connectedto it with a bus structure.

    The bus structure contained all the common elements that are used by datamarts such as conformed dimensions, measures etc defined for the enterpriseas a whole.

    This architecture makes the data warehouse more of a virtual reality than aphysical reality

    All data marts could be located in one server or could be located on differentservers across the enterprise while the data warehouse would be a virtualentity being nothing more than a sum total of all the data marts

    In this context even the cubes constructed by using OLAP tools could beconsidered as data marts.

    The Bottom-Up Approach Contd

  • 7/28/2019 39932886 Conf Dwh Concepts

    57/78

    57

    Kimball Approach

    The bottom-up approach reverses the positions of the Data warehouse and

    the Data marts. Data marts are directly loaded with the data from the operational

    systems through the staging area.

    The data flow in the bottom up approach starts with extraction of data from

    operational databases into the staging area where it is processed and

    consolidated and then loaded into the ODS.

    The Bottom-Up Approach Contd

  • 7/28/2019 39932886 Conf Dwh Concepts

    58/78

    58

    The data in the ODS is appended to or replaced by the fresh data being

    loaded. After the ODS is refreshed the current data is once again

    extracted into the staging area and processed to fit into the Data mart

    structure. The data from the Data Mart, then is extracted to the staging

    area aggregated, summarized and so on and loaded into the Data Warehouse and

    made available to the end user for analysis.

    DW Operational Processes (Overview of

    Extraction, Transformation & Loading)

  • 7/28/2019 39932886 Conf Dwh Concepts

    59/78

    59

    Typically host based, legacy applications

    Customized applications, COBOL, 3GL, 4GL

    Point of Contact Devices

    POS, ATM, Call switches

    External Sources

    Nielsens, Acxiom, CMIE, Vendors, Partners

    Sequential Legacy Relational ExternalOperational/Source Data

    SourceData

    DW Operational Processes (Overview of

    Extraction, Transformation & Loading) Contd

  • 7/28/2019 39932886 Conf Dwh Concepts

    60/78

    60

    These tools try to automate or support tasks such as:-

    Data Extraction (accessing diff source data bases)

    Data Cleansing (finding and resolving inconsistencies in the source data)

    Data Transformation (between different data formats, languages, etc.)

    Data Loading

    Replication (replicating source databases into the data warehouse)

    Analyzing & Checking of Data Quality (for correctness and completeness)

    Building derived data & views

    DW Operational Processes (Overview ofExtraction, Transformation & Loading) Contd

  • 7/28/2019 39932886 Conf Dwh Concepts

    61/78

    61

    Elements of a Data Warehouse

    DW Operational Processes (Overview ofExtraction, Transformation & Loading) Contd

  • 7/28/2019 39932886 Conf Dwh Concepts

    62/78

    62

    Loading the Warehouse

    Cleaning the data before it is loaded

    DW Operational Processes (Overview ofExtraction, Transformation & Loading) Contd

  • 7/28/2019 39932886 Conf Dwh Concepts

    63/78

    63

    These processes have been discussed in details in the ETL section.

    Some important definitions:

    Data Scrubbing: http://www.wisegeek.com/what-is-data-scrubbing.htm

    Data Cleansing: http://www.wisegeek.com/what-is-data-cleansing.htm

    Row level security: http://www.securityfocus.com/infocus/1743

    Staging Types: http://esj.com/Columns/article.aspx?EditorialsID=55

    Technical Problems in Data Warehouse

    http://www.wisegeek.com/what-is-data-scrubbing.htmhttp://www.wisegeek.com/what-is-data-cleansing.htmhttp://www.securityfocus.com/infocus/1743http://esj.com/Columns/article.aspx?EditorialsID=55http://esj.com/Columns/article.aspx?EditorialsID=55http://www.securityfocus.com/infocus/1743http://www.wisegeek.com/what-is-data-cleansing.htmhttp://www.wisegeek.com/what-is-data-scrubbing.htm
  • 7/28/2019 39932886 Conf Dwh Concepts

    64/78

    64

    Managing large amounts of data:

    The explosion of data volume came about because the data warehouse required that both detailand history be mixed in the same environ-ment.

    Large amounts of data need to be managed in many ways-through flexibility of addressability ofdata stored inside the processor and stored inside disk storage, through indexing, throughextensions of data, through the efficient management of overflow, and so forth. To be effective,the technology used must satisfy the requirements for both volume and efficiency.

    Index/Monitor Data:

    If data in the warehouse cannot be easily and efficiently indexed, the data warehouse will not be

    a success. Monitoring data warehouse data determines such factors as the following:If a reorganization needs to be done

    If an index is poorly structured

    If too much or not enough data is in overflow

    The statistical composition of the access of the data

    Available remaining space

    Technical Problems in Data Warehouse Contd

  • 7/28/2019 39932886 Conf Dwh Concepts

    65/78

    65

    Interfaces to many technologies:

    Data passes into the data warehouse from the operational environment

    and the ODS, and from the data warehouse into data marts, DSS applications,explo-ration and data mining warehouses, and alternate storage.

    This passage must be smooth and easy.

    The interface to different technologies requires several considerations:

    Does the data pass from one DBMS to another easily?

    Does it pass from one operating system to another easily?

    Does it change its basic format in passage (EBCDIC, ASCII, etc.)?

    Technical Problems in Data Warehouse Contd

  • 7/28/2019 39932886 Conf Dwh Concepts

    66/78

    66

    Meta Data Management:

    The data warehouse operates under a heuristic, iterative development life cycle.

    To be effective, the user of the data warehouse must have access to meta data

    that is accurate and up-to-date.

    Several types of meta data need to be managed in the data warehouse: distrib-

    uted meta data, central meta data, technical meta data, and business meta data.

    Technical Problems in Data Warehouse Contd

  • 7/28/2019 39932886 Conf Dwh Concepts

    67/78

    67

    Efficient Loading of Data

    Data is loaded into a data warehouse in two fundamental ways:

    a record at a time through a language interface or en masse with a utility.

    Indexes must be efficiently loaded at the same time the data is loaded. As the

    burden of the volume of loading becomes an issue, the load is often par-allelized.

    Another related approach to the efficient loading of very large amounts of data isstaging the data prior to loading.

    As a rule, large amounts of data are gathered into a buffer area before being

    processed by extract/transfer/load (ETL) software. The staged data is merged,

    perhaps edited, summarized, and so forth, before it passes into the ETL layer.

    Technical Problems in Data Warehouse Contd

  • 7/28/2019 39932886 Conf Dwh Concepts

    68/78

    68

    Lock Management:

    The lock manager ensures that two or more people are not updating the

    same record at the same time. But update is not done in the data warehouse;

    instead, data is stored in a series of snapshot records. When a change occurs

    a new snapshot record is added, rather than an update being done.

    Steps in Building a Data Warehouse:

  • 7/28/2019 39932886 Conf Dwh Concepts

    69/78

    69

    Identify key business drivers, sponsorship, risks, ROI

    Survey information needs and identify desired functionality and definefunctional requirements for initial subject area.

    Architect long-term, data warehousing architecture

    Evaluate and Finalize DW tool & technology

    Conduct Proof-of-Concept

    Design target data base schema

    Build data mapping, extract, transformation, cleansing and

    aggregation/summarization rules

    Build initial data mart, using exact subset of enterprise data warehousing

    architecture and expand to enterprise architecture over subsequent phases

    Maintain and administer data warehouse

    Representative DSS Tools

  • 7/28/2019 39932886 Conf Dwh Concepts

    70/78

    70

    Tool Category Products

    ETL Tools ETI Extract, Informatica, IBM Visual WarehouseOracle Warehouse Builder

    OLAP Server Oracle Express Server, Hyperion Essbase,IBM DB2 OLAP Server, Microsoft SQL Server OLAP

    Services, Seagate HOLOS, SAS/MDDB

    OLAP Tools Oracle Express Suite, Business Objects,Web Intelligence, SAS, Cognos Powerplay /Impromtu,

    KALIDO, MicroStrategy, Brio Query, MetaCube

    Data Warehouse Oracle, Informix, Teradata, DB2/UDB, Sybase, MicrosoftSQL Server, RedBricks

    Data Mining & Analysis SAS Enterprise Miner, IBM Intelligent Miner,SPSS/Clementine, TCS Tools

    Business Intelligence

  • 7/28/2019 39932886 Conf Dwh Concepts

    71/78

    71

    How intelligent can you make your business processes?

    What insight can you gain into your business?

    How integrated can your business processes be?

    How much more interactive can your business be with customers, partners,

    employees and managers?

    What is Business Intelligence (BI)?

  • 7/28/2019 39932886 Conf Dwh Concepts

    72/78

    72

    Business Intelligence is a generalized term applied to a broad category ofapplications and technologies for gathering, storing, analyzing and providingaccess to data to help enterprise users make better business decisions

    Business Intelligence applications include the activities of decision supportsystems, query and reporting, online analytical processing (OLAP), statisticalanalysis, forecasting, and data mining

    An alternative way of describing BI is: the technology required to turn raw datainto information to support decision-making within corporations and businessprocesses

    Why BI?

  • 7/28/2019 39932886 Conf Dwh Concepts

    73/78

    73

    BI technologies help bring decision-makers the data in a form they can quicklydigest and apply to their decision making.

    BI turns data into information for managers and executives and in general, peoplemaking decisions in a company.

    Companies want to use technology tactically to make their operations moreeffective and more efficient - Business intelligence can be the catalyst for thatefficiency and effectiveness.

    Benefits

  • 7/28/2019 39932886 Conf Dwh Concepts

    74/78

    74

    The benefits of a well-planned BI implementation are going to be closely tied tothe business objectives driving the project.

    Identify trends and anomalies in business operations more quickly, allowingfor more accurate and timelier decisions.

    Deliver actionable insight and information to the right place with less effort .

    Identify and operate based on a single version of the truth, allowing allanalysis to be completed on a core foundation with confidence.

    Business Intelligence Platform Requirements

  • 7/28/2019 39932886 Conf Dwh Concepts

    75/78

    75

    Data Warehouse Databases

    OLAP

    Data Mining

    Interfaces

    Build and Manage Capabilities

    The business intelligence platform should provide good integration across these

    technologies. It should be a coherent platform, not a set of diverse and heterogeneous

    technologies.

    Business Intelligence Components

  • 7/28/2019 39932886 Conf Dwh Concepts

    76/78

    76

    TRANSFORM

    LOAD

    EXTRACT

    OLAPDATA

    MINING

    Data

    Warehouse

    Operational Data

    Business Intelligence Architecture

  • 7/28/2019 39932886 Conf Dwh Concepts

    77/78

    77

    Business Intelligence Technologies

  • 7/28/2019 39932886 Conf Dwh Concepts

    78/78

    78

    Data Sources

    Paper, Files, Information Providers, Database Systems, OLTP

    Data Warehouses / Data Marts

    Data Exploration

    OLAP, DSS, EIS, Querying and Reporting

    Data Mining

    Information discovery

    Data Presentation

    Visualization Techniques

    Decision Making

    Increasing potential to support

    business decisions End User

    Business Analyst

    Data Analyst

    DB Admin