DWH Concentrated

download DWH Concentrated

of 71

Transcript of DWH Concentrated

  • 7/28/2019 DWH Concentrated

    1/71

    1

    DWH Material

    Version 1.0

    REVISION HISTORY

    The following table reflects all changes to this document.

    Date Author / Contributor Version Reason for Change

    01-Nov-2004 1.0 Initial Document

    14-Sep-2010 1.1 Updated Document

  • 7/28/2019 DWH Concentrated

    2/71

    2

    Detailed Design Document

    1 Introduction

    1.1 PurposeThe purpose of this document is to provide the detailed information about DWH Concepts and Informatica

    based on real-time training.

    2 ORACLE

    2.1 DEFINATIONSOrganizations can store data on various media and in different formats, such as a hard-copy document

    in a filing cabinet or data stored in electronic spreadsheets or in databases.

    A database is an organized collection of information.

    To manage databases, you need database management systems (DBMS). A DBMS is a program that

    stores, retrieves, and modifies data in the database on request. There are four main types of databases:

    hierarchical, network, relational, and more recently object relational(ORDBMS).

    NORMALIZATION:

    Some Oracle databases were modeled according to the rules of normalization that were intended to eliminate

    redundancy.

    Obviously, the rules of normalization are required to understand your relationships and functional dependencies

    First Normal Form:

    A row is in first normal form (1NF) if all underlying domains contain atomic values only.

    Eliminate duplicative columns from the same table.

    Create separate tables for each group of related data and identify each row with a unique column or set ofcolumns (the primary key).

    Second Normal Form:

    An entity is in Second Normal Form (2NF) when it meets the requirement of being in First Normal Form (1NF) and

    additionally:

    Does not have a composite primary key. Meaning that the primary key can not be subdivided into separatelogical entities.

    All the non-key columns are functionally dependent on the entire primary key.

    A row is in second normal form if, and only if, it is in first normal form and every non-key attribute is fully

    dependent on the key.

    2NF eliminates functional dependencies on a partial key by putting the fields in a separate table from those

    that are dependent on the whole key. An example is resolving many : many relationships using anintersecting entity.

    Third Normal Form:

    An entity is in Third Normal Form (3NF) when it meets the requirement of being in Second Normal Form (2NF) and

    additionally:

    Functional dependencies on non-key fields are eliminated by putting them in a separate table. At this level,

    all non-key fields are dependent on the primary key.

  • 7/28/2019 DWH Concentrated

    3/71

    3

    A row is in third normal form if and only if it is in second normal form and if attributes that do not contributeto a description of the primary key are move into a separate table. An example is creating look-up tables.

    Boyce-Codd Normal Form:

    Boyce Codd Normal Form (BCNF) is a further refinement of 3NF. In his later writings Codd refers to BCNF as 3NF. A

    row is in Boyce Codd normal form if, and only if, every determinant is a candidate key. Most entities in 3NF are

    already in BCNF.

    Fourth Normal Form:

    An entity is in Fourth Normal Form (4NF) when it meets the requirement of being in Third Normal Form (3NF) and

    additionally:

    Has no multiple sets of multi-valued dependencies. In other words, 4NF states that no entity can have more than a

    single one-to-many relationship.

    ORACLE SET OF STATEMENTS:

    Data Definition Language :(DDL)

    Create

    Alter

    Drop

    Truncate

    Data Manipulation Language (DML)

    Insert

    Update

    Delete

    Data Querying Language (DQL)

    Select

    Data Control Language (DCL)

    Grant

    Revoke

    Transactional Control Language (TCL)

    Commit

    Rollback

    Save point

    Syntaxes:

    CREATE OR REPLACE SYNONYM HZ_PARTIES FOR SCOTT.HZ_PARTIES

    CREATE DATABASE LINK CAASEDW CONNECT TO ITO_ASA IDENTIFIED BY exact123 USING ' CAASEDW

  • 7/28/2019 DWH Concentrated

    4/71

    4

    Materialized View syntax:

    CREATE MATERIALIZED VIEW EBIBDRO.HWMD_MTH_ALL_METRICS_CURR_VIEW

    REFRESH COMPLETE

    START WITH sysdate

    NEXT TRUNC(SYSDATE+1)+ 4/24

    WITH PRIMARY KEY

    AS

    select * from HWMD_MTH_ALL_METRICS_CURR_VW;

    Another Method to refresh:

    DBMS_MVIEW.REFRESH('MV_COMPLEX', 'C');

    Case Statement:

    Select NAME,(CASE

    WHEN (CLASS_CODE = 'Subscription')THEN ATTRIBUTE_CATEGORYELSE TASK_TYPE

    END) TASK_TYPE,CURRENCY_CODE

    From EMP

    Decode()

    Select empname,Decode(address,HYD,Hyderabad,Bang, Bangalore, address) as address from emp;

    Procedure:

    CREATE OR REPLACE PROCEDURE Update_bal (

    cust_id_IN In NUMBER,

    amount_IN In NUMBER DEFAULT 1) AS

    BEGIN

    Update account_tbl Set amount= amount_IN where cust_id= cust_id_IN

    End

    Trigger:

    CREATE OR REPLACE TRIGGER EMP_AUR

    AFTER/BEFORE UPDATE ON EMP

    REFERENCING

    NEW AS NEW

    OLD AS OLD

    FOR EACH ROW

  • 7/28/2019 DWH Concentrated

    5/71

  • 7/28/2019 DWH Concentrated

    6/71

    6

    WHERE worker.manager_id = manager.employee_id ;

    Natural Join

    Natural join compares all the common columns.

    Ex: SQL> select empno,ename,job,dname,loc from emp natural join dept;

    Cross Join

    This will gives the cross product.

    Ex: SQL> select empno,ename,job,dname,loc from emp cross join dept;

    Outer Join

    Outer join gives the non-matching records along with matching records.

    Left Outer Join

    This will display the all matching records and the records which are in left hand side table those that are not in right

    hand side table.

    Ex: SQL> select empno,ename,job,dname,loc from emp e left outer join dept d on(e.deptno=d.deptno);

    Or

    SQL> select empno,ename,job,dname,loc from emp e,dept d where

    e.deptno=d.deptno(+);

    Right Outer Join

    This will display the all matching records and the records which are in right hand side table those that are not in left

    hand side table.

    Ex: SQL> select empno,ename,job,dname,loc from emp e right outer join dept d on(e.deptno=d.deptno);

    Or

    SQL> select empno,ename,job,dname,loc from emp e,dept d where e.deptno(+) = d.deptno;

    Full Outer Join

    This will display the all matching records and the non-matching records from both tables.

    Ex: SQL> select empno,ename,job,dname,loc from emp e full outer join dept d on(e.deptno=d.deptno);

    OR

    SQL> select p.part_id, s.supplier_name2 from part p, supplier s3 where p.supplier_id = s.supplier_id (+)4 union5 select p.part_id, s.supplier_name

    6 from part p, supplier s7 where p.supplier_id (+) = s.supplier_id;

  • 7/28/2019 DWH Concentrated

    7/71

    7

    Whats the difference between View and Materialized View?

    View:

    Why Use Views?

    To restrict data access

    To make complex queries easy

    To provide data independence

    A simple view is one that:

    Derives data from only one table

    Contains no functions or groups of data

    Can perform DML operations through the view.

    A complex view is one that:

    Derives data from many tables

    Contains functions or groups of data

    Does not always allow DML operations through the view

    A view has a logical existence but a materialized view hasa physical existence.Moreover a materialized view can beIndexed, analysed and so on....that is all the things thatwe can do with a table can also be done with a materializedview.

    We can keep aggregated data into materialized view. we can schedule the MV to refresh but table cant.MV can becreated based on multiple tables.

    Materialized View:

    In DWH materialized views are very essential because in reporting side if we do aggregate calculations as per the

    business requirement report performance would be de graded. So to improve report performance rather than doing

    report calculations and joins at reporting side if we put same logic in the MV then we can directly select the data

    from MV without any joins and aggregations. We can also schedule MV (Materialize View).

    Inline view:

    If we write a select statement in from clause that is nothing but inline view.

    Ex:Get dept wise max sal along with empname and emp no.

    Select a.empname, a.empno, b.sal, b.deptnoFrom EMP a, (Select max (sal) sal, deptno from EMP group by deptno) bWherea.sal=b.sal anda.deptno=b.deptno

  • 7/28/2019 DWH Concentrated

    8/71

    8

    What is the difference between view and materialized view?

    View Materialized view

    A view has a logical existence. It does not contain

    data.

    A materialized view has a physical existence.

    Its not a database object. It is a database object.

    We cannot perform DML operation on view. We can perform DML operation on materialized view.

    When we do select * from view it will fetch the data

    from base table.

    When we do select * from materialized view it will

    fetch the data from materialized view.

    In view we cannot schedule to refresh. In materialized view we can schedule to refresh.

    We can keep aggregated data into materialized view.

    Materialized view can be created based on multiple

    tables.

    What is the Difference between Delete, Truncate and Drop?

    DELETE

    The DELETE command is used to remove rows from a table. A WHERE clause can be used to only remove some

    rows. If no WHERE condition is specified, all rows will be removed. After performing a DELETE operation you need to

    COMMIT or ROLLBACK the transaction to make the change permanent or to undo it.

    TRUNCATE

    TRUNCATE removes all rows from a table. The operation cannot be rolled back. As such, TRUCATE is faster anddoesn't use as much undo space as a DELETE.

    DROP

    The DROP command removes a table from the database. All the tables' rows, indexes and privileges will also be

    removed. The operation cannot be rolled back.

    Difference between Rowid and Rownum?

    ROWID

    A globally unique identifier for a row in a database. It is created at the time the row is inserted into a table,

    and destroyed when it is removed from a table.'BBBBBBBB.RRRR.FFFF' where BBBBBBBB is the block

    number, RRRR is the slot(row) number, and FFFF is a file number.

    ROWNUM

    For each row returned by a query, the ROWNUM pseudo column returns a number indicating the order in

    which Oracle selects the row from a table or set of joined rows. The first row selected has a ROWNUM of 1,

    the second has 2, and so on.

    You can use ROWNUM to limit the number of rows returned by a query, as in this example:

    SELECT * FROM employees WHERE ROWNUM < 10;

  • 7/28/2019 DWH Concentrated

    9/71

    9

    Rowid Row-num

    Rowid is an oracle internal id that is allocated

    every time a new record is inserted in a table.

    This ID is unique and cannot be changed by the

    user.

    Row-num is a row number returned by a select

    statement.

    Rowid is permanent. Row-num is temporary.

    Rowid is a globally unique identifier for a row in a

    database. It is created at the time the row is

    inserted into the table, and destroyed when it is

    removed from a table.

    The row-num pseudocoloumn returns a number

    indicating the order in which oracle selects the row

    from a table or set of joined rows.

    Order of where and having:

    SELECT column, group_function

    FROM table

    [WHERE condition]

    [GROUP BY group_by_expression]

    [HAVING group_condition]

    [ORDER BY column];

    The WHERE clause cannot be used to restrict groups. you use the

    HAVING clause to restrict groups.

    Differences between where clause and having clause

    Where clause Having clause

    Both where and having clause can be used to filter the data.

    Where as in where clause it is not mandatory. But having clause we need to use it with the group

    by.

    Where clause applies to the individual rows. Where as having clause is used to test some

    condition on the group rather than on individual

    rows.

    Where clause is used to restrict rows. But having clause is used to restrict groups.

    Restrict normal query by where Restrict group by function by having

    In where clause every record is filtered based on

    where.

    In having clause it is with aggregate records (group

    by functions).

  • 7/28/2019 DWH Concentrated

    10/71

    10

    MERGE Statement

    You can use merge command to perform insert and update in a single command.

    Ex: Merge into student1 s1

    Using (select * from student2) s2

    On (s1.no=s2.no)

    When matched then

    Update set marks = s2.marks

    When not matched then

    Insert (s1.no, s1.name, s1.marks) Values (s2.no, s2.name, s2.marks);

    What is the difference between sub-query & co-related sub query?

    A sub query is executed once for the parent statement

    whereas the correlated sub query is executed once for each

    row of the parent query.

    Sub Query:

    Example:

    Select deptno, ename, sal from emp a where sal in (select sal from Grade where sal_grade=A or sal_grade=B)

    Co-Related Sun query:

    Example:

    Find all employees who earn more than the average salary in their department.

    SELECT last-named, salary, department_id FROM employees A

    WHERE salary > (SELECT AVG (salary)

    FROM employees B WHERE B.department_id =A.department_id

    Group by B.department_id)

    EXISTS:

    The EXISTS operator tests for existence of rows in

    the results set of the subquery.

    Select dname from dept where exists

    (select 1 from EMP

    where dept.deptno= emp.deptno);

  • 7/28/2019 DWH Concentrated

    11/71

    11

    Sub-query Co-related sub-query

    A sub-query is executed once for the parent

    Query

    Where as co-related sub-query is executed once

    for each row of the parent query.

    Example:

    Select * from emp where deptno in (select

    deptno from dept);

    Example:

    Select a.* from emp e where sal >= (select

    avg(sal) from emp a where a.deptno=e.deptnogroup by a.deptno);

    Indexes:

    1. Bitmap indexes are most appropriate for columns having low distinct valuessuch as GENDER,

    MARITAL_STATUS, and RELATION. This assumption is not completely accurate, however. In reality, a

    bitmap index is always advisable for systems in which data is not frequently updated by many

    concurrent systems. In fact, as I'll demonstrate here, a bitmap index on a column with 100-percent

    unique values (a column candidate for primary key) is as efficient as a B-tree index.

    2. When to Create an Index

    3. You should create an index if:

    4. A column contains a wide range of values5. A column contains a large number of null values6. One or more columns are frequently used together in a WHERE clause or a join condition 7. The table is large and most queries are expected to retrieve less than 2 to 4 percent of the rows

    8. By default if u create index that is nothing but b-tree index.

    Why hints Require?

    It is a perfect valid question to ask why hints should be used. Oracle comes with an optimizer that promises to

    optimize a query's execution plan. When this optimizer is really doing a good job, no hints should be required at all.

    Sometimes, however, the characteristics of the data in the database are changing rapidly, so that the optimizer (or

    more accuratly, its statistics) are out of date. In this case, a hint could help.

    You should first get the explain plan of your SQL and determine what changes can be done to make the code operatewithout using hints if possible. However, hints such as ORDERED, LEADING, INDEX, FULL, and the various AJ and SJ

    hints can take a wild optimizer and give you optimal performance

    Tables analyze and update Analyze Statement

    The ANALYZE statement can be used to gather statistics for a specific table, index or cluster. The statistics can be

    computed exactly, or estimated based on a specific number of rows, or a percentage of rows:

    ANALYZE TABLE employees COMPUTE STATISTICS;

    ANALYZE TABLE employees ESTIMATE STATISTICS SAMPLE 15 PERCENT;

    EXEC DBMS_STATS.gather_table_stats('SCOTT', 'EMPLOYEES');

  • 7/28/2019 DWH Concentrated

    12/71

    12

    Automatic Optimizer Statistics Collection

    By default Oracle 10g automatically gathers optimizer statistics using a scheduled job called GATHER_STATS_JOB.

    By default this job runs within maintenance windows between 10 P.M. to 6 A.M. week nights and all day on

    weekends. The job calls the DBMS_STATS.GATHER_DATABASE_STATS_JOB_PROC internal procedure which gathers

    statistics for tables with either empty or stale statistics, similar to the DBMS_STATS.GATHER_DATABASE_STATS

    procedure using the GATHER AUTO option. The main difference is that the internal job prioritizes the work such that

    tables most urgently requiring statistics updates are processed first.

    Hint categories:

    Hints can be categorized as follows:

    ALL_ROWS

    One of the hints that 'invokes' the Cost based optimizer

    ALL_ROWS is usually used for batch processing or data warehousing systems.

    (/*+ ALL_ROWS */)

    FIRST_ROWS

    One of the hints that 'invokes' the Cost based optimizer

    FIRST_ROWS is usually used for OLTPsystems.

    (/*+ FIRST_ROWS */)

    CHOOSE

    One of the hints that 'invokes' the Cost based optimizer

    This hint lets the server choose (between ALL_ROWS and FIRST_ROWS, based on statistics gathered.

    Hints for Join Orders,

    Hints for Join Operations,

    Hints for Parallel Execution, (/*+ parallel(a,4) */) specify degree either 2 or 4 or 16

    Additional Hints

    HASH

    Hashes one table (full scan) and creates a hash index for that table. Then hashes other table and uses hash

    index to find corresponding records. Therefore not suitable for < or > join conditions.

    /*+ use_hash */

    Use Hint to force using index

    SELECT /*+INDEX (TABLE_NAME INDEX_NAME) */ COL1,COL2 FROM TABLE_NAME

    Select ( /*+ hash */ ) empno from

    ORDERED- This hint forces tables to be joined in the order specified. If you know table X has fewer rows, then

    ordering it first may speed execution in a join.

    PARALLEL (table, instances)This specifies the operation is to be done in parallel.

    If index is not able to create then will go for /*+ parallel(table, 8)*/-----For select and update example---in where

    clase like st,not in ,>,< , then we will use.

    Explain Plan:

    Explain plan will tell us whether the query properly using indexes or not.whatis the cost of the table whether it isdoing full table scan or not, based on these statistics we can tune the query.

    The explain plan process stores data in the PLAN_TABLE. This table can be located in the current schema or a shared

  • 7/28/2019 DWH Concentrated

    13/71

    13

    schema and is created using in SQL*Plus as follows:

    SQL> CONN sys/password AS SYSDBAConnectedSQL> @$ORACLE_HOME/rdbms/admin/utlxplan.sqlSQL> GRANT ALL ON sys.plan_table TO public;

    SQL> CREATE PUBLIC SYNONYM plan_table FOR sys.plan_table;

    What is your tuning approach if SQL query taking long time? Or how do u tune SQL query?

    If query taking long time then First will run the query in Explain Plan, The explain plan process stores data in the

    PLAN_TABLE.

    it will give us execution plan of the query like whether the query is using the relevant indexes on the joining

    columns or indexes to support the query are missing.

    If joining columns doesnt have index then it will do the full table scan if it is full table scan the cost will be more

    then will create the indexes on the joining columns and will run the query it should give better performance and

    also needs to analyze the tables if analyzation happened long back. The ANALYZE statement can be used to gather

    statistics for a specific table, index or cluster using

    ANALYZE TABLE employees COMPUTE STATISTICS;

    If still have performance issue then will use HINTS, hint is nothing but a clue. We can use hints like

    ALL_ROWS

    One of the hints that 'invokes' the Cost based optimizer

    ALL_ROWS is usually used for batch processing or data warehousing systems.

    (/*+ ALL_ROWS */)

    FIRST_ROWS

    One of the hints that 'invokes' the Cost based optimizer

    FIRST_ROWS is usually used for OLTPsystems.

    (/*+ FIRST_ROWS */)

    CHOOSE

    One of the hints that 'invokes' the Cost based optimizer

    This hint lets the server choose (between ALL_ROWS and FIRST_ROWS, based on statistics gathered.

    HASH

    Hashes one table (full scan) and creates a hash index for that table. Then hashes other table and uses hash

    index to find corresponding records. Therefore not suitable for < or > join conditions.

    /*+ use_hash */

    Hints are most useful to optimize the query performance.

    Store Procedure:

    What are the differences between stored procedures and triggers?

    Stored procedure normally used for performing tasks

    But the Trigger normally used for tracing and auditing logs.

    Stored procedures should be called explicitly by the user in order to execute

    But the Trigger should be called implicitly based on the events defined in the table.

    Stored Procedure can run independently

    But the Trigger should be part of any DML events on the table.

  • 7/28/2019 DWH Concentrated

    14/71

    14

    Stored procedure can be executed from the Trigger

    But the Trigger cannot be executed from the Stored procedures.

    Stored Procedures can have parameters.

    But the Trigger cannot have any parameters.

    Stored procedures are compiled collection of programs or SQL statements in the database.

    Using stored procedure we can access and modify data present in many tables.

    Also a stored procedure is not associated with any particular database object.

    But triggers are event-driven special procedures which are attached to a specific database object say a table.

    Stored procedures are not automatically run and they have to be called explicitly by the user. But triggers get

    executed when the particular event associated with the event gets fired.

    Packages:

    Packages provide a method of encapsulating related procedures, functions, and associated cursors and variablestogether as a unit in the database.

    package that contains several procedures and functions that process related to same transactions.

    A package is a group of related procedures and functions, together with the cursors and variables they use,

    Packages provide a method of encapsulating related procedures, functions, and associated cursors and variables

    together as a unit in the database.

    Triggers:

    Oracle lets you define procedures called triggers that run implicitly when an INSERT, UPDATE, or DELETE statement

    is issued against the associated table

    Triggers are similar to stored procedures. A trigger stored in the database can include SQL and PL/SQL

    Types of Triggers

    This section describes the different types of triggers:

    Row Triggers and Statement Triggers

    BEFORE and AFTER Triggers

    INSTEAD OF Triggers

    Triggers on System Events and User Events

    Row Triggers

    A row trigger is fired each time the table is affected by the triggering statement. For example, if an UPDATE

    statement updates multiple rows of a table, a row trigger is fired once for each row affected by the UPDATE

    statement. If a triggering statement affects no rows, a row trigger is not run.

    BEFORE and AFTER Triggers

    When defining a trigger, you can specify the trigger timing--whether the trigger action is to be run before or afterthe triggering statement. BEFORE and AFTER apply to both statement and row triggers.

    BEFORE and AFTER triggers fired by DML statements can be defined only on tables, not on views.

  • 7/28/2019 DWH Concentrated

    15/71

    15

    Difference between Trigger and Procedure

    Triggers Stored Procedures

    In trigger no need to execute manually. Triggers will

    be fired automatically.

    Triggers that run implicitly when an INSERT, UPDATE,

    or DELETE statement is issued against the associated

    table.

    Where as in procedure we need to execute manually.

    Differences between stored procedure and functions

    Stored Procedure Functions

    Stored procedure may or may not return values. Function should return at least one output parameter.

    Can return more than one parameter using OUT

    argument.

    Stored procedure can be used to solve the business

    logic.

    Function can be used to calculations

    Stored procedure is a pre-compiled statement. But function is not a pre-compiled statement.

    Stored procedure accepts more than one argument. Whereas function does not accept arguments.

    Stored procedures are mainly used to process the

    tasks.

    Functions are mainly used to compute values

    Cannot be invoked from SQL statements. E.g. SELECT Can be invoked form SQL statements e.g. SELECT

    Can affect the state of database using commit. Cannot affect the state of database.

    Stored as a pseudo-code in database i.e. compiled

    form.

    Parsed and compiled at runtime.

    Data files Overview:

    A tablespace in an Oracle database consists of one or more physical datafiles. A datafile can be associated with onlyone tablespace and only one database.

    Table Space:

    Oracle stores data logically in tablespaces and physically in datafiles associated with the corresponding tablespace.

    A database is divided into one or more logical storage units called tablespaces. Tablespaces are divided into logicalunits of storage called segments.

    Control File:

    A control file contains information about the associated database that is required for access by an instance, both atstartup and during normal operation. Control file information can be modified only by Oracle; no databaseadministrator or user can edit a control file.

    2.2 IMPORTANT QUERIES1. Get duplicate rows from the table:

    Select empno, count (*) from EMP group by empno having count (*)>1;

    2. Remove duplicates in the table:

  • 7/28/2019 DWH Concentrated

    16/71

    16

    Delete from EMP where rowid not in (select max (rowid) from EMP group by empno);

    3. Below query transpose columns into rows.Name No Add1 Add2

    abc 100 hyd bang

    xyz 200 Mysore pune

    Select name, no, add1 from A

    UNION

    Select name, no, add2 from A;

    4. Below query transpose rows into columns.select

    emp_id,

    max(decode(row_id,0,address))as address1,

    max(decode(row_id,1,address)) as address2,

    max(decode(row_id,2,address)) as address3

    from (select emp_id,address,mod(rownum,3) row_id from temp order by emp_id )

    group by emp_id

    Other query:

    select

    emp_id,

    max(decode(rank_id,1,address)) as add1,

    max(decode(rank_id,2,address)) as add2,

    max(decode(rank_id,3,address))as add3

    from

    (select emp_id,address,rank() over (partition by emp_id order by emp_id,address )rank_id from temp )

    group by

    emp_id

    5. Rank query:Select empno, ename, sal, r from (select empno, ename, sal, rank () over (order by sal desc) r from EMP);

    6. Dense rank query:The DENSE_RANK function works acts like the RANK function except that it assigns consecutive ranks:

  • 7/28/2019 DWH Concentrated

    17/71

    17

    Select empno, ename, Sal, from (select empno, ename, sal, dense_rank () over (order by sal desc) r from emp);

    7. Top 5 salaries by using rank:Select empno, ename, sal,r from (select empno,ename,sal,dense_rank() over (order by sal desc) r from emp) where

    r

  • 7/28/2019 DWH Concentrated

    18/71

    18

    In general a Data Warehouse is used on an enterprise level and a Data Marts is used on a business

    division/department level.

    Subject Oriented:

    Data that gives information about a particular subject instead of about a company's ongoing operations.

    Integrated:

    Data that is gathered into the data warehouse from a variety of sources and merged into a coherent whole.

    Time-variant:

    All data in the data warehouse is identified with a particular time period.

    Non-volatile:

    Data is stable in a data warehouse. More data is added but data is never removed.

    What is a DataMart?

    Datamart is usually sponsored at the department level and developed with a specific details or subject in mind, aData Mart is a subset of data warehouse with a focused objective.

    What is the difference between a data warehouse and a data mart?

    In terms of design data warehouse and data mart are almost the same.

    In general a Data Warehouse is used on an enterprise level and a Data Marts is used on a business

    division/department level.

    A data mart only contains data specific to a particular subject areas.

    Difference between data mart and data warehouse

    Data Mart Data Warehouse

    Data mart is usually sponsored at the department

    level and developed with a specific issue or subject in

    mind, a data mart is a data warehouse with a focused

    objective.

    Data warehouse is a Subject-Oriented, Integrated,

    Time-Variant, Nonvolatile collection of data in support

    of decision making.

    A data mart is used on a business division/

    department level.

    A data warehouse is used on an enterprise level

    A Data Mart is a subset of data from a DataWarehouse. Data Marts are built for specific user

    groups.

    A Data Warehouse is simply an integratedconsolidation of data from a variety of sources that is

    specially designed to support strategic and tactical

    decision making.

    By providing decision makers with only a subset of

    data from the Data Warehouse, Privacy, Performance

    and Clarity Objectives can be attained.

    The main objective of Data Warehouse is to provide an

    integrated environment and coherent picture of the

    business at a point in time.

    what is fact less fact table?

    A fact table that contains only primary keys from the dimension tables, and that do not contain any measures

    that type of fact table is called fact less fact table .

  • 7/28/2019 DWH Concentrated

    19/71

    19

    What is a Schema?

    Graphical Representation of the datastructure.

    First Phase in implementation of Universe

    What are the most important features of a data warehouse?

    DRILL DOWN, DRILL ACROSS, Graphs, PI charts, dashboards and TIME HANDLING

    To be able to drill down/drill across is the most basic requirement of an end user in a data warehouse. Drilling down

    most directly addresses the natural end-user need to see more detail in an result. Drill down should be as generic as

    possible becuase there is absolutely no good way to predict users drill-down path.

    What does it mean by grain of the star schema?

    In Data warehousing grain refers to the level of detail available in a given fact table as well as to the level of detail

    provided by a star schema.

    It is usually given as the number of records per key within the table. In general, the grain of the fact table is the

    grain of the star schema.

    What is a star schema?

    Star schema is a data warehouse schema where there is only one "fact table" and many denormalized dimension

    tables.

    Fact table contains primary keys from all the dimension tables and other numeric columns columns of additive,

    numeric facts.

    What is a snowflake schema?

    Unlike Star-Schema, Snowflake schema contain normalized dimension tables in a tree like structure with many

    nesting levels.

    Snowflake schema is easier to maintain but queries require more joins.

  • 7/28/2019 DWH Concentrated

    20/71

    20

    What is the difference between snow flake and star schema

    Star Schema Snow Flake Schema

    The star schema is the simplest data warehouse

    scheme.

    Snowflake schema is a more complex data

    warehouse model than a star schema.

    In star schema each of the dimensions is

    represented in a single table .It should not have any

    hierarchies between dims.

    In snow flake schema at least one hierarchy should

    exists between dimension tables.

    It contains a fact table surrounded by dimension

    tables. If the dimensions are de-normalized, we say

    it is a star schema design.

    It contains a fact table surrounded by dimension

    tables. If a dimension is normalized, we say it is a

    snow flaked design.

    In star schema only one join establishes the

    relationship between the fact table and any one of

    the dimension tables.

    In snow flake schema since there is relationship

    between the dimensions tables it has to do many

    joins to fetch the data.

    A star schema optimizes the performance by

    keeping queries simple and providing fast response

    time. All the information about the each level is

    stored in one row.

    Snowflake schemas normalize dimensions to

    eliminated redundancy. The result is more complex

    queries and reduced query performance.

    It is called a star schema because the diagram

    resembles a star.

    It is called a snowflake schema because the diagram

    resembles a snowflake.

    What is Fact and Dimension?

    A "fact" is a numeric value that a business wishes to count or sum. A "dimension" is essentially an entry point forgetting at the facts. Dimensions are things of interest to the business.

    A set of level properties that describe a specific aspect of a business, used for analyzing the factual measures.

    What is Fact Table?

    A Fact Table in a dimensional model consists of one or more numeric facts of importance to a business. Examples of

    facts are as follows:

    the number of products sold

  • 7/28/2019 DWH Concentrated

    21/71

    21

    the value of products sold

    the number of products produced

    the number of service calls received

    What is Factless Fact Table?

    Factless fact table captures the many-to-many relationships between dimensions, but contains no numeric or textual

    facts. They are often used to record events or coverage information.

    Common examples of factless fact tables include:

    Identifying product promotion events (to determine promoted products that didnt sell)

    Tracking student attendance or registration events

    Tracking insurance-related accident events

    Types of facts?

    There are three types of facts:

    Additive: Additive facts are facts that can be summed up through all of the dimensions in the fact table.

    Semi-Additive: Semi-additive facts are facts that can be summed up for some of the dimensions in the

    fact table, but not the others.

    Non-Additive: Non-additive facts are facts that cannot be summed up for any of the dimensions present in

    the fact table.

    What is Granularity?

    Principle: create fact tables with the most granular data possible to support analysis of the business process.

    In Data warehousing grain refers to the level of detail available in a given fact table as well as to the level of detail

    provided by a star schema.

    It is usually given as the number of records per key within the table. In general, the grain of the fact table is the

    grain of the star schema.

    Facts: Facts must be consistent with the grain.all facts are at a uniform grain.

    Watch for facts of mixed granularity

    Total sales for day & montly total

    Dimensions: each dimension associated with fact table must take on a single value for each fact row.

    Each dimension attribute must take on one value.

    Outriggers are the exception, not the rule.

  • 7/28/2019 DWH Concentrated

    22/71

    22

    Dimensional Model

    What is slowly Changing Dimension?

    Slowly changing dimensions refers to the change in dimensional attributes over time.

    An example of slowly changing dimension is a Resource dimension where attributes of a particular employee change

    over time like their designation changes or dept changes etc.

    What is Conformed Dimension?

    Conformed Dimensions (CD): these dimensions are something that is built once in your model and can be reused

    multiple times with different fact tables. For example, consider a model containing multiple fact tables, representing

    different data marts. Now look for a dimension that is common to these facts tables. In this example lets consider

    that the product dimension is common and hence can be reused by creating short cuts and joining the different fact

    tables.Some of the examples are time dimension, customer dimensions, product dimension.

    What is Junk Dimension?

    A "junk" dimension is a collection of random transactional codes, flags and/or text attributes that are unrelated to

    any particular dimension. The junk dimension is simply a structure that provides a convenient place to store the junk

    attributes. A good example would be a trade fact in a company that brokers equity trades.

    When you consolidate lots of small dimensions and instead of having 100s of small dimensions, that will have few

    records in them, cluttering your database with these mini identifier tables, all records from all these small

    dimension tables are loaded into ONE dimension table and we call this dimension table Junk dimension table. (Since

  • 7/28/2019 DWH Concentrated

    23/71

    23

    we are storing all the junk in this one table) For example: a company might have handful of manufacture plants,

    handful of order types, and so on, so forth, and we can consolidate them in one dimension table called junked

    dimension table

    Its a dimension table which is used to keep junk attributes

    What is De Generated Dimension?

    An item that is in the fact table but is stripped off of its description, because the description belongs in dimension

    table, is referred to as Degenerated Dimension. Since it looks like dimension, but is really in fact table and has been

    degenerated of its description, hence is called degenerated dimension..

    Degenerated Dimension: a dimension which is located in fact table known as Degenerated dimension

    Dimensional Model:

    A type of data modeling suited for data warehousing. In a dimensional model, there are two types of tables:

    dimensional tables and fact tables. Dimensional table records information on each dimension, and fact table

    records all the "fact", or measures.

    Data modeling There are three levels of data modeling. They are conceptual, logical, and physical. This section will

    explain the difference among the three, the order with which each one is created, and how to go from one level tothe other.

    Conceptual Data Model

    Features of conceptual data model include:

    Includes the important entities and the relationships among them.

    No attribute is specified.

    No primary key is specified.

    At this level, the data modeler attempts to identify the highest-level relationships among the different entities.

    Logical Data Model Features of logical data model include:

    Includes all entities and relationships among them.

    All attributes for each entity are specified.

    The primary key for each entity specified.

    Foreign keys (keys identifying the relationship between different entities) are specified.

    Normalization occurs at this level.

    At this level, the data modeler attempts to describe the data in as much detail as possible, without regard to how

    they will be physically implemented in the database.

    In data warehousing, it is common for the conceptual data model and the logical data model to be combined into a

    single step (deliverable).

    The steps for designing the logical data model are as follows:

    1. Identify all entities.

    2.

    Specify primary keys for all entities.

    3. Find the relationships between different entities.

    4. Find all attributes for each entity.

  • 7/28/2019 DWH Concentrated

    24/71

    24

    5. Resolve many-to-many relationships.

    6. Normalization.

    Physical Data Model

    Features of physical data model include:

    Specification all tables and columns.

    Foreign keys are used to identify relationships between tables.

    Demoralization may occur based on user requirements.

    Physical considerations may cause the physical data model to be quite different from the logical data model.

    At this level, the data modeler will specify how the logical data model will be realized in the database schema.

    The steps for physical data model design are as follows:

    1. Convert entities into tables.

    2. Convert relationships into foreign keys.

    3. Convert attributes into columns.

    9. http://www.learndatamodeling.com/dm_standard.htm

    10. Modeling is an efficient and effective way to represent the organizations needs; It provides information

    in a graphical way to the members of an organization to understand and communicate the business rules

    and processes. Business Modeling and Data Modeling are the two important types of modeling.

    The differences between a logical data model and physical data model is shown below.

    Logical vs Physical Data Modeling

    Logical Data Model Physical Data Model

    Represents business information and defines

    business rules

    Represents the physical implementation of the model in a

    database.

    Entity Table

    Attribute Column

    Primary Key Primary Key Constraint

    Alternate Key Unique Constraint or Unique Index

    Inversion Key Entry Non Unique Index

    Rule Check Constraint, Default Value

    Relationship Foreign Key

    Definition Comment

  • 7/28/2019 DWH Concentrated

    25/71

    25

    Below is the simple data model

    Below is the sq for project dim

  • 7/28/2019 DWH Concentrated

    26/71

    26

  • 7/28/2019 DWH Concentrated

    27/71

    27

    EDIII Logical Design

    ACW_PCBA_A PPROVAL_F

    Primary Key

    PCBA_APPROVAL_KEY[PK1]

    Non-Key Attributes

    PART_KEYCISCO_PART_NUMBER

    SUPPLY_CHANNEL_KEY

    NPI

    APPROVAL_FLAGADJUSTMENTAPPROVAL_DATE

    ADJUSTMENT_AMT

    SPEND_BY_ASSEMBLY

    COMM_MGR_KEYBUYER_ID

    RFQ_CREATED

    RFQ_RESPONSECSSD_CREATED_BY

    D_CREATED_DATE

    D_LAST_UPDATED_BY

    D_LAST_UPDATE_DATE

    ACW_DF_APPROVAL_F

    Primary Key

    DF_APPROVAL_KEY[PK1]

    Non-Key Attributes

    PART_KEYCISCO_PART_NUMBERSUPPLY_CHANNEL_KEYPCBA_ITEM_FLAG

    APPROVED

    APPROVAL_DATE

    BUYER_IDRFQ_CREATEDRFQ_RESPONSE

    CSS

    D_CREATED_BYD_CREATION_DATE

    D_LAST_UPDATED_BY

    D_LAST_UPDATE_DATE

    ACW_USERS_D

    Primary Key

    USER_KEY [PK1]

    Non-Key Attributes

    PERSON_ID

    EMAIL_ADDRESSLAST_NAME

    FIRST_NAME

    FULL_NAME

    EFFECTIVE_START_DATEEFFECTIVE_END_DATEEMPLOYEE_NUMBER

    LAST_UPDATED_BY

    LAST_UPDATE_DATECREATION_DATE

    CREATED_BYD_LAST_UPDATED_BYD_LAST_UPDATE_DATE

    D_CREATION_DATED_CREATED_BY

    ACW_DF_FEES_F

    Primary Key

    ACW_DF_FEES_KEY

    [PK1]

    Non-Key Attributes

    PRODUCT_KEY

    ORG_KEY

    DF_MGR_KEY

    COST_REQUIRED

    DF_FEESCOSTED_BY

    COSTED_DATE

    APPROVING_MGRAPPROVED_DATE

    D_CREATED_BYD_CREATION_DATE

    D_LAST_UPDATE_BY

    D_LAST_UPDATED_DATE

    ACW_PCBA_A PPROVAL_STG

    Non-Key Attributes

    INVENTORY_ITEM_ID

    LATEST_REVLOCATION_IDLOCATION_CODE

    APPROVAL_FLAG

    ADJUSTMENTAPPROVAL_DATETOTAL_A DJUSTMENTTOTAL_ITEM_COST

    DEMAND

    COMM_MGRBUYER_ID

    BUYER

    RFQ_CREATED

    RFQ_RESPONSECSS

    ACW_SUPPLY_CHANNEL_D

    Primary Key

    SUPPLY_CHANNEL_KEY

    [PK1]

    Non-Key Attributes

    SUPPLY_CHANNELDESCRIPTION

    LAST_UPDATED_BYLAST_UPDATE_DATE

    CREATED_BYCREATION_DATE

    D_LAST_UPDATED_BY

    D_LAST_UPDATE_DATE

    D_CREATED_BYD_CREATION_DATE

    ACW_PRODUCTS_D

    Primary Key

    PRODUCT_KEY [PK1]

    Non-Key Attributes

    PRODUCT_NAME

    BUSINESS_UNIT_IDBUSINESS_UNIT

    PRODUCT_FAMILY_IDPRODUCT_FAMILY

    ITEM_TYPE

    D_CREATED_BY

    D_CREATION_DATED_LAST_UPDATE_BYD_LAST_UPDATED_DATE

    ACW_DF_FEES_STG

    Non-Key Attributes

    SEGMENT1ORGANIZATION_ID

    ITEM_TYPEBUYER_ID

    COST_REQUIRED

    QUARTER_1_COSTQUARTER_2_COST

    QUARTER_3_COST

    QUARTER_4_COSTCOSTED_BYCOSTED_DATE

    APPROVED_BY

    APPROVED_DATE

    ACW_DF_APPROVAL_STG

    Non-Key Attributes

    INVENTORY_ITEM_IDCISCO_PART_NUMBER

    LATEST_REV

    PCBA_ITEM_FLAGAPPROVAL_FLAGAPPROVAL_DATE

    LOCATION_IDLOCATION_CODE

    BUYERBUYER_ID

    RFQ_CREATEDRFQ_RESPONSE

    CSS

    ACW_PART_TO_PID_D

    Primary Key

    PART_TO_PID_KEY [PK1]

    Non-Key Attributes

    PART_KEYCISCO_PART_NUMBER

    PRODUCT_KEY

    PRODUCT_NAMELATEST_REVISION

    D_CREATED_BYD_CREATION_DATE

    D_LAST_UPDATED_BY

    D_LAST_UPDATE_DATE

    ACW_ORGANIZATION_D

    Primary Key

    ORG_KEY [PK1]

    Non-Key Attributes

    ORGANIZATION_CODE

    CREATED_BY

    CREATION_DATE

    LAST_UPDATE_DATELAST_UPDATED_BY

    D_CREATED_BY

    D_CREATION_DATED_LAST_UPDATE_DATE

    D_LAST_UPDATED_BY

    EDW_TIME_HIERARCHY

    PID for DF Fees

    Users

  • 7/28/2019 DWH Concentrated

    28/71

    28

    EDII Physical Design

    ACW_PCBA_APPROVAL_F

    Column

    PCBA_APPROVAL _KEY CHAR(10) [PK1]

    PAR T_K EY NUM BE R(1 0)

    CISCO_PART_NUMBE RCHAR(10)

    SUPPLY_CHANNEL_KEYNUMBER(10)

    NPI CHAR(1)

    APPROVAL_FLAG CHAR(1 )

    A DJ US TM E NT CHA R(1 )

    APPROVAL_DATE DATE

    ADJUSTMENT_AMT FLOAT(12)

    SPEND_BY_ASSEMBLYFLOAT(12)

    COMM_MGR_KEY NUMBER(10)

    BUY ER_ ID NUM BE R(1 0)

    RFQ_ CR EATED CH AR (1 )

    RFQ_RESPONSE CHAR(1 )CSS CHAR(10)

    D_CREATED_BY CHAR(10)

    D_CREATED_DATE CHAR(10)

    D_LAST_UPDATED_BY CHAR(10)

    D_LAST_UPDAT E_DATEDATE

    ACW_DF_FEES_F

    Column

    ACW_DF_FEES_KEY NUMBER(10) [PK1]

    PRO DU CT_ KEY N UM BER(1 0)

    ORG_KEY NUMB ER(10)

    DF _M GR _K EY NUM B ER(1 0)COST_REQUIRED CHAR(1 )

    DF_FE ES FLOAT(12)

    CO ST ED_B Y NUM BER(1 0)

    CO ST ED_D AT E DAT EAPPROVING_MGR NUMBER(10)

    APPRO VED_ DATE D ATE

    D_CREATED_BY CHAR(10)

    D_CREATION_DATE DATE

    D_LAST_UPDATE_BY CHAR(10)

    D_LAST_UPDATED_DATECHAR(10)

    ACW_USERS_D

    Column

    US ER_K EY NUM BE R(1 0) [P K1]

    PERSON_I D CHA R(10 )

    EMAIL_ADDRESS CHAR(10)

    L AS T_ NA ME V AR CHA R2 (5 0)

    F IRS T _N AM E V AR CHA R2 (5 0)

    F UL L_ NA ME CHA R(1 0)EFFECTIVE_START_DATEDATE

    EFFECTIVE_END_DATE DAT E

    EMPLOYEE_NUMBER NUMBER(10)

    SEX NUMBER

    LAST_UPDATE_DATE DATE

    C REATI ON_ DATE D ATE

    CRE AT ED_ BY NUM BE R(1 0)

    D_LAST_UPDATED_BY CHAR(10)

    D_LAST_UPDATE_DATEDATED_CREATION_DATE DATE

    D _C REATED_ BY C HAR(1 0 )

    ACW_DF_APPROVAL_F

    Column

    DF_APPROVAL_KEY NUMBER(10) [PK1]PART _K EY NUMB ER(10 )

    CISCO_PART_NUMBE RCHA R(30)

    SUPPLY_CHANNEL_KEYNUMBER(10)

    PCBA_ITEM_FLAG CHAR(1)

    APP ROV ED CHA R(1)

    APPROVAL_ DATE D ATE

    BUY ER_ID NUMB ER(10 )

    R FQ _C REATED C HAR(1 )

    RFQ_RESPONSE CHAR(1 )

    CSS CHAR(10)

    D_CREATED_BY CHAR(10)

    D_CREATION_DATE DATE

    D_LAST_UPDATED_BY CHAR(10)

    D_LAST_ UPDATE_DAT EDATE

    ACW_PCBA_APPROVAL_STG

    Column

    INVENTORY_IT EM_IDNUMBER(10)

    L ATEST_R EV CH AR (1 0 )

    LOCATION_ID NUMBER(10)

    LOCATION_CODE CHAR(10)

    APPROVAL_FLAG CHAR(1)

    ADJUSTMENT CHAR(1 )

    APPROVAL_DATE DATE

    TOTAL_ADJUSTMENTCHAR(10)

    TOTAL_ITEM _COST FLOAT(10)

    DEMAND NUMBER

    CO M M_ M GR CH AR (1 0 )B UY ER_ ID NUM BE R(1 0)

    BUY ER VARCHAR2(24 0)RFQ_CREATED CHAR(1)

    RFQ_RESPONSE CHAR(1)

    CSS CHAR(10)

    ACW_PRODUCTS_D

    Column

    PRODUCT_KEY NUMBER(10) [PK1 ]

    PRODUCT_NAME CHAR(30)

    BUSINESS_UNIT_ID NUMBER(10)

    BUSINESS_UNIT VARCHAR2(60 )

    PRODUCT_FAMILY_ID NUMBER(10)

    PRODUCT_FAMILY VARCHAR2(180)

    IT EM _T YPE CHA R(3 0)

    D_CREATED_BY CHAR(10)

    D_CREATION_DATE DATE

    D_LAST_UPDATE_BY CHAR(10)

    D_LAST_UPDATED_DATECHAR(10)

    ACW_SUPPLY_CHANNEL_D

    Column

    SUPPLY_CHANNEL_KEYNUMBER(10) [P K1

    SUPPLY_CHANNEL CHAR(60)

    DESCRIPTION VARCHAR2(240)

    LAST_UPDATED_BY NUMBER

    LAST_UPDATE_DATE DATE

    C REATED_ BY N UM BER (1 0)

    C REATI ON_ DATE D ATE

    D_LAST_UPDATED_BY CHAR(10)

    D_LAST_ UPDATE_DATEDATE

    D_CREATED_BY CHAR(10)

    D_CREATION_DATE DATE

    ACW_DF_FEES_STG

    Column

    SEGMENT1 VARCHAR2(40 )

    ORGANIZATION_IDNUMBER(10)

    I TEM _ TYPE C HAR(3 0)BUYER_ ID N UM BER(1 0 )

    COST_REQUIRED CHAR(1)

    QUARTER_1_COSTFLOAT(12)

    QUARTER_2_COSTFLOAT(12)QUARTER_3_COSTFLOAT(12)

    QUARTER_4_COSTFLOAT(12)

    COSTED_BY NUMBER(10)

    COSTED_DATE DATE

    APPROVED_BY NUMBER(10)

    APPROVED_DATE DAT E

    ACW_DF_APPROVAL_STG

    Column

    INVENTORY_IT EM_ID NUMBER(10)

    CISCO_P ART _NUMBE RCHAR(30)

    L AT ES T _R EV CHA R(1 0)

    PCBA_ITEM_FLAG CHAR(1)

    APPROVAL_FLAG CHAR(1)

    APPROVAL_DATE DATE

    L OC ATIO N_ ID NU MB E R(1 0 )

    SUPPLY_CHANNEL CHAR(10)

    BUY ER VARCHAR2(240)

    BUY ER_ ID NUM BE R(1 0)

    RFQ_CREATED CHAR(1 )

    RFQ_RESPONSE CHAR(1)

    CSS CHAR(10)

    ACW_PART_TO_PID_D

    Column

    PART_TO_PID_KEY NUMBER(10) [PK1

    PAR T_K EY NUM BER(1 0)

    CISCO_PART_NUMBE RCHAR(30)

    PRODUCT_KEY NUMBER(10)

    PRODUCT_NAME CHAR(30)

    LATEST_REVISION CHAR(10)D_CREATED_BY CHAR(10)

    D_CREATION_DATE DATE

    D_LAST_UPDATED_BYCHAR(10)

    D_LAST_UPDATE_DATEDATE

    ACW_ORGANIZATION_D

    Column

    O RG_ KE Y NUM B ER(1 0) [P K1]

    ORGANIZATION_CODE CHAR(30)

    CR EATED_ BY N UM BER (1 0)

    CREATION_DATE DATE

    LAST_UPDATE_DATE DATE

    LAST_UPDATED_BY NUMBER

    D_CREATED_BY CHAR(10)

    D_CREATION_DATE DATED_LAST_UPDATE_DATEDATE

    D_LAST_UPDATED_BYCHAR(10)

    EDW_TIME_HIERARCHY

    PID_for_ DF_Fees

    Users

    Types of SCD Implementation:

    Type 1 Slowly Changing Dimension

    In Type 1 Slowly Changing Dimension, the new information simply overwrites the original information. In other

    words, no history is kept.

    In our example, recall we originally have the following table:

    Customer Key Name State

    1001 Christina Illinois

  • 7/28/2019 DWH Concentrated

    29/71

    29

    After Christina moved from Illinois to California, the new information replaces the new record, and we have the

    following table:

    Customer Key Name State

    1001 Christina California

    Advantages:

    - This is the easiest way to handle the Slowly Changing Dimension problem, since there is no need to keep track of

    the old information.

    Disadvantages:

    - All history is lost. By applying this methodology, it is not possible to trace back in history. For

    example, in this case, the company would not be able to know that Christina lived in Illinois before.

    - Usage:

    About 50% of the time.

    When to use Type 1:

    Type 1 slowly changing dimension should be used when it is not necessary for the data warehouse to keep track of

    historical changes.

    Type 2 Slowly Changing Dimension

    In Type 2 Slowly Changing Dimension, a new record is added to the table to represent the new information.

    Therefore, both the original and the new record will be present. The newe record gets its own primary key.

    In our example, recall we originally have the following table:

    Customer Key Name State

    1001 Christina Illinois

    After Christina moved from Illinois to California, we add the new information as a new row into the table:

    Customer Key Name State

    1001 Christina Illinois

    1005 Christina California

    Advantages:

    - This allows us to accurately keep all historical information.

    Disadvantages:

    - This will cause the size of the table to grow fast. In cases where the number of rows for the table is very high to

    start with, storage and performance can become a concern.

    - This necessarily complicates the ETL process.

  • 7/28/2019 DWH Concentrated

    30/71

    30

    Usage:

    About 50% of the time.

    When to use Type 2:

    Type 2 slowly changing dimension should be used when it is necessary for the data warehouse to track historical

    changes.

    Type 3 Slowly Changing Dimension

    In Type 3 Slowly Changing Dimension, there will be two columns to indicate the particular attribute of interest, one

    indicating the original value, and one indicating the current value. There will also be a column that indicates when

    the current value becomes active.

    In our example, recall we originally have the following table:

    Customer Key Name State

    1001 Christina Illinois

    To accommodate Type 3 Slowly Changing Dimension, we will now have the following columns:

    Customer Key

    Name

    Original State

    Current State

    Effective Date

    After Christina moved from Illinois to California, the original information gets updated, and we have the following

    table (assuming the effective date of change is January 15, 2003):

    Customer Key Name Original State Current State Effective Date

    1001 Christina Illinois California 15-JAN-2003

    Advantages:

    - This does not increase the size of the table, since new information is updated.

    - This allows us to keep some part of history.

    Disadvantages:

    - Type 3 will not be able to keep all history where an attribute is changed more than once. For example, if Christina

    later moves to Texas on December 15, 2003, the California information will be lost.

    Usage:

    Type 3 is rarely used in actual practice.

    When to use Type 3:

    Type III slowly changing dimension should only be used when it is necessary for the data warehouse to track

    historical changes, and when such changes will only occur for a finite number of time.

  • 7/28/2019 DWH Concentrated

    31/71

    31

    What is Staging area why we need it in DWH?

    If target and source databases are different and target table volume is high it contains some millions of records in thisscenario without staging table we need to design your informatica using look up to find out whether the record existsor not in the target table since target has huge volumes so its costly to create cache it will hit the performance.

    If we create staging tables in the target database we can simply do outer join in the source qualifier to determineinsert/update this approach will give you good performance.

    It will avoid full table scan to determine insert/updates on target.And also we can create index on staging tables since these tables were designed for specific application it will notimpact to any other schemas/users.

    While processing flat files to data warehousing we can perform cleansing.Data cleansing, also known as data scrubbing, is the process of ensuring that a set of data is correct and accurate.During data cleansing, records are checked for accuracy and consistency.

    Since it is one-to-one mapping from ODS to staging we do truncate and reload.

    We can create indexes in the staging state, to perform our source qualifier best.

    If we have the staging area no need to relay on the informatics transformation to known whether

    the record exists or not.

    Data cleansing

    Weeding out unnecessary or unwanted things (characters and spaces etc) from incoming data to make itmore meaningful and informative

    Data merging

    Data can be gathered from heterogeneous systems and put together

    Data scrubbing

    Data scrubbing is the process of fixing or eliminating individual pieces of data that are incorrect, incompleteor duplicated before the data is passed to end user.

    Data scrubbing is aimed at more than eliminating errors and redundancy. The goal is also to bringconsistency to various data sets that may have been created with different, incompatible business rules.

    ODS (Operational Data Sources):

    My understanding of ODS is, its a replica of OLTP system and so the need of this, is to reduce the burden on

    production system (OLTP) while fetching data for loading targets. Hence its a mandate Requirement for every

    Warehouse.

    So every day do we transfer data to ODS from OLTP to keep it up to date?

    OLTP is a sensitive database they should not allow multiple select statements it may impact the performance as well

    as if something goes wrong while fetching data from OLTP to data warehouse it will directly impact the business.

    ODS is the replication of OLTP.

    ODS is usually getting refreshed through some oracle jobs.

    enables management to gain a consistent picture of the business.

    What is a surrogate key?

    A surrogate key is a substitution for the natural primary key. It is a unique identifier or number ( normally created

    by a database sequence generator ) for each record of a dimension table that can be used for the primary key to the

    table.

    A surrogate key is useful because natural keys may change.

  • 7/28/2019 DWH Concentrated

    32/71

    32

    What is the difference between a primary key and a surrogate key?

    A primary key is a special constraint on a column or set of columns. A primary key constraint ensures that the

    column(s) so designated have no NULL values, and that every value is unique. Physically, a primary key is

    implemented by the database system using a unique index, and all the columns in the primary key must have been

    declared NOT NULL. A table may have only one primary key, but it may be composite (consist of more than one

    column).

    A surrogate key is any column or set of columns that can be declared as the primary key instead of a "real" or

    natural key. Sometimes there can be several natural keys that could be declared as the primary key, and these are

    all called candidate keys. So a surrogate is a candidate key. A table could actually have more than one surrogate

    key, although this would be unusual. The most common type of surrogate key is an incrementing integer, such as an

    auto increment column in MySQL, or a sequence in Oracle, or an identity column in SQL Server.

    4 ETL-INFORMATICA

    4.1 Informatica OverviewInformatica is a powerful Extraction, Transformation, and Loading tool and is been deployed at GE Medical Systemsfor data warehouse development in the Business Intelligence Team. Informatica comes with the following clients toperform various tasks.

    Designer used to develop transformations/mappings

    Workflow Manager / Workflow Monitor replace the Server Manager - used to create sessions /workflows/ worklets to run, schedule, and monitor mappings for data movement

    Repository Manager used to maintain folders, users, permissions, locks, and repositories.

    Integration Services the workhorse of the domain. Informatica Server is the component responsible forthe actual work of moving data according to the mappings developed and placed into operation. It containsseveral distinct parts such as the Load Manager, Data Transformation Manager, Reader and Writer.

    Repository Services- Informatica client tools and Informatica Server connect to the repository databaseover the network through the Repository Server.

    Informatica Transformations:

    Mapping: Mapping is the Informatica Object which contains set of transformations including source and target. Its

    look like pipeline.

    Mapplet:

    Mapplet is a set of reusable transformations. We can use this mapplet in any mapping within the Folder.

    A mapplet can be active or passive depending on the transformations in the mapplet. Active mapplets contain one or

    more active transformations. Passive mapplets contain only passive transformations.

    When you add transformations to a mapplet, keep the following restrictions in mind:

    If you use a Sequence Generator transformation, you must use a reusable Sequence Generator

    transformation.

    If you use a Stored Procedure transformation, you must configure the Stored Procedure Type to be Normal.

    You cannot include the following objects in a mapplet:

    o Normalizer transformations

    o COBOL sources

    o XML Source Qualifier transformations

    o XML sources

    o Target definitions

    o Other mapplets

  • 7/28/2019 DWH Concentrated

    33/71

    33

    The mapplet contains Input transformations and/or source definitions with at least one port connected to a

    transformation in the mapplet.

    The mapplet contains at least one Output transformation with at least one port connected to a

    transformation in the mapplet.

    Input Transformation: Input transformations are used to create a logical interface to a mapplet in order to allow

    data to pass into the mapplet.

    Output Transformation: Output transformations are used to create a logical interface from a mapplet in order to

    allow data to pass out of a mapplet.

    System Variables

    $$$SessStartTime returns the initial system date value on the machine hosting the Integration Service when the

    server initializes a session. $$$SessStartTime returns the session start time as a string value. The format of the

    string depends on the database you are using.

    Session: A session is a set of instructions that tells informatica Server how to move data from sources to targets.

    WorkFlow: A workflow is a set of instructions that tells Informatica Server how to execute tasks such as sessions,

    email notifications and commands. In a workflow multiple sessions can be included to run in parallel or sequentialmanner.

    Source Definition: The Source Definition is used to logically represent database table or Flat files.

    Target Definition: The Target Definition is used to logically represent a database table or file in the Data

    Warehouse / Data Mart.

    Aggregator: The Aggregator transformation is used to perform Aggregate calculations on group basis.

    Expression: The Expression transformation is used to perform the arithmetic calculation on row by row basis and

    also used to convert string to integer vis and concatenate two columns.

    Filter: The Filter transformation is used to filter the data based on single condition and pass through next

    transformation.

    Router: The router transformation is used to route the data based on multiple conditions and pass through next

    transformations.

    It has three groups

    1) Input group

    2) User defined group

    3) Default group

    Joiner: The Joiner transformation is used to join two sources residing in different databases or different locations

    like flat file and oracle sources or two relational tables existing in different databases.

    Source Qualifier: The Source Qualifier transformation is used to describe in SQL the method by which data is to be

    retrieved from a source application system and also

    used to join two relational sources residing in same databases.

    What is Incremental Aggregation?

    A. Whenever a session is created for a mapping Aggregate Transformation, the session option for IncrementalAggregation can be enabled. When PowerCenter performs incremental aggregation, it passes new source data

    through the mapping and uses historical cache data to perform new aggregation calculations incrementally.

  • 7/28/2019 DWH Concentrated

    34/71

    34

    Lookup: Lookup transformation is used in a mapping to look up data in a flat file or a relational table, view, or

    synonym.

    Two types of lookups:

    1) Connected

    2) UnconnectedDifferences between connected lookup and unconnected lookup

    Connected Lookup Unconnected Lookup

    This is connected to pipleline and receives the

    input values from pipleline.

    Which is not connected to pipeline and receives

    input values from the result of a: LKP expression

    in another transformation via arguments.

    We cannot use this lookup more than once in a

    mapping.

    We can use this transformation more than once

    within the mapping

    We can return multiple columns from the same

    row.

    Designate one return port (R), returns one column

    from each row.

    We can configure to use dynamic cache. We cannot configure to use dynamic cache.

    Pass multiple output values to another

    transformation. Link lookup/output ports to

    another transformation.

    Pass one output value to another transformation.

    The lookup/output/return port passes the value to

    the transformation calling: LKP expression.

    Use a dynamic or static cache Use a static cache

    Supports user defined default values. Does not support user defined default values.

    Cache includes the lookup source column in the

    lookup condition and the lookup source columns

    that are output ports.

    Cache includes all lookup/output ports in the

    lookup condition and the lookup/return port.

    Lookup Caches:

    When configuring a lookup cache, you can specify any of the following options:

    Persistent cache

    Recache from lookup source

    Static cache

    Dynamic cache

    Shared cache

    Dynamic cache: When you use a dynamic cache, the PowerCenter Server updates the lookup cache as it passes

    rows to the target.

    If you configure a Lookup transformation to use a dynamic cache, you can only use the equality operator (=) in the

    lookup condition.

  • 7/28/2019 DWH Concentrated

    35/71

    35

    NewLookupRow Port will enable automatically.

    NewLookupRow Value Description

    0 The PowerCenter Server does not update or insert the row in the cache.

    1 The PowerCenter Server inserts the row into the cache.

    2 The PowerCenter Server updates the row in the cache.

    Static cache: It is a default cache; the PowerCenter Server doesnt update the lookup cache as it passes rows to

    the target.

    Persistent cache: If the lookup table does not change between sessions, configure the Lookup transformation to

    use a persistent lookup cache. The PowerCenter Server then saves and reuses cache files from session to session,

    eliminating the time required to read the lookup table.

    Differences between dynamic lookup and static lookup

    Dynamic Lookup Cache Static Lookup Cache

    In dynamic lookup the cache memory will get

    refreshed as soon as the record get inserted or

    updated/deleted in the lookup table.

    In static lookup the cache memory will not get

    refreshed even though record inserted or updated

    in the lookup table it will refresh only in the next

    session run.

    When we configure a lookup transformation to use

    a dynamic lookup cache, you can only use the

    equality operator in the lookup condition.

    NewLookupRow port will enable automatically.

    It is a default cache.

    Best example where we need to use dynamic

    cache is if suppose first record and last record both

    are same but there is a change in the address.

    What informatica mapping has to do here is first

    record needs to get insert and last record should

    get update in the target table.

    If we use static lookup first record it will go to

    lookup and check in the lookup cache based on

    the condition it will not find the match so it will

    return null value then in the router it will send

    that record to insert flow.

    But still this record dose not available in the

    cache memory so when the last record comes to

    lookup it will check in the cache it will not find the

    match so it returns null value again it will go to

    insert flow through router but it is suppose to go

    to update flow because cache didnt get refreshed

    when the first record get inserted into target

    table.

    Normalizer: The Normalizer transformation is used to generate multiple records from a single record based on

    columns (transpose the column data into rows)

    We can use normalize transformation to process cobol sources instead of source qualifier.

  • 7/28/2019 DWH Concentrated

    36/71

    36

    Rank: The Rank transformation allows you to select only the top or bottom rank of data. You can use a Rank

    transformation to return the largest or smallest numeric value in a port or group.

    The Designer automatically creates a RANKINDEX port for each Rank transformation.

    Sequence Generator: The Sequence Generator transformation is used to generate numeric key values in

    sequential order.

    Stored Procedure: The Stored Procedure transformation is used to execute externally stored database procedures

    and functions. It is used to perform the database level operations.

    Sorter: The Sorter transformation is used to sort data in ascending or descending order according to a specified sort

    key. You can also configure the Sorter transformation for case-sensitive sorting, and specify whether the output rows

    should be distinct. The Sorter transformation is an active transformation. It must be connected to the data flow.

    Union Transformation:

    The Union transformation is a multiple input group transformation that you can use to merge data from multiple

    pipelines or pipeline branches into one pipeline branch. It merges data from multiple sources similar to the UNION

    ALL SQL statement to combine the results from two or more SQL statements. Similar to the UNION ALL statement,

    the Union transformation does not remove duplicate rows.Input groups should have similar structure.

    Update Strategy: The Update Strategy transformation is used to indicate the DML statement.

    We can implement update strategy in two levels:

    1) Mapping level

    2) Session level.

    Session level properties will override the mapping level properties.

    Aggregator Transformation:

    Transformation type:

    Active

    Connected

    The Aggregator transformation performs aggregate calculations, such as averages and sums. The Aggregator

    transformation is unlike the Expression transformation, in that you use the Aggregator transformation to perform

    calculations on groups. The Expression transformation permits you to perform calculations on a row-by-row basis

    only.

    Components of the Aggregator Transformation:

    The Aggregator is an active transformation, changing the number of rows in the pipeline. The Aggregator

    transformation has the following components and options

    Aggregate cache: The Integration Service stores data in the aggregate cache until it completes aggregate

    calculations. It stores group values in an index cache and row data in the data cache.

    Group by port: Indicate how to create groups. The port can be any input, input/output, output, or variable port.

    When grouping data, the Aggregator transformation outputs the last row of each group unless otherwise specified.

    Sorted input: Select this option to improve session performance. To use sorted input, you must pass data to the

    Aggregator transformation sorted by group by port, in ascending or descending order.

    Aggregate Expressions:

    The Designer allows aggregate expressions only in the Aggregator transformation. An aggregate expression can

    include conditional clauses and non-aggregate functions. It can also include one aggregate function nested within

  • 7/28/2019 DWH Concentrated

    37/71

    37

    another aggregate function, such as:

    MAX (COUNT (ITEM))

    The result of an aggregate expression varies depending on the group by ports used in the transformation

    Aggregate Functions

    Use the following aggregate functions within an Aggregator transformation. You can nest one aggregate function

    within another aggregate function.

    The transformation language includes the following aggregate functions:

    (AVG,COUNT,FIRST,LAST,MAX,MEDIAN,MIN,PERCENTAGE,SUM,VARIANCE and STDDEV)

    When you use any of these functions, you must use them in an expression within an Aggregator transformation.

    Perfomance Tips in Aggregator

    Use sorted input to increase the mapping performance but we need to sort the data before sending to aggregator

    transformation.

    Filter the data before aggregating it.

    If you use a Filter transformation in the mapping, place the transformation before the Aggregator transformation to

    reduce unnecessary aggregation.

    SQL Transformation

    Transformation type:

    Active/Passive

    Connected

    The SQL transformation processes SQL queries midstream in a pipeline. You can insert, delete, update, and retrieve

    rows from a database. You can pass the database connection information to the SQL transformation as input data at

    run time. The transformation processes external SQL scripts or SQL queries that you create in an SQL editor. The

    SQL transformation processes the query and returns rows and database errors.

    For example, you might need to create database tables before adding new transactions. You can create an SQL

    transformation to create the tables in a workflow. The SQL transformation returns database errors in an output port.

    You can configure another workflow to run if the SQL transformation returns no errors.

    When you create an SQL transformation, you configure the following options:

    Mode. The SQL transformation runs in one of the following modes:

    Script mode. The SQL transformation runs ANSI SQL scripts that are externally located. You pass a script name to

    the transformation with each input row. The SQL transformation outputs one row for each input row.

    Query mode. The SQL transformation executes a query that you define in a query editor. You can pass strings or

    parameters to the query to define dynamic queries or change the selection parameters. You can output multiple rows

    when the query has a SELECT statement.

    Database type. The type of database the SQL transformation connects to.

    Connection type. Pass database connection information to the SQL transformation or use a connection object.

    Script Mode

  • 7/28/2019 DWH Concentrated

    38/71

  • 7/28/2019 DWH Concentrated

    39/71

    39

    Transaction control expression

    Enter the transaction control expression in the Transaction Control Condition field. The transaction control expression

    uses the IIF function to test each row against the condition. Use the following syntax for the expression:

    IIF (condition, value1, value2)

    The expression contains values that represent actions the Integration Service performs based on the return value of

    the condition. The Integration Service evaluates the condition on a row-by-row basis. The return value determines

    whether the Integration Service commits, rolls back, or makes no transaction changes to the row. When the

    Integration Service issues a commit or roll back based on the return value of the expression, it begins a new

    transaction. Use the following built-in variables in the Expression Editor when you create a transaction control

    expression:

    TC_CONTINUE_TRANSACTION.The Integration Service does not perform any transaction change for this row.

    This is the default value of the expression.

    TC_COMMIT_BEFORE. The Integration Service commits the transaction, begins a new transaction, and writesthe current row to the target. The current row is in the new transaction.

    TC_COMMIT_AFTER. The Integration Service writes the current row to the target, commits the transaction, and

    begins a new transaction. The current row is in the committed transaction.

    TC_ROLLBACK_BEFORE. The Integration Service rolls back the current transaction, begins a new transaction,

    and writes the current row to the target. The current row is in the new transaction.

    TC_ROLLBACK_AFTER. The Integration Service writes the current row to the target, rolls back the transaction,

    and begins a new transaction. The current row is in the rolled back transaction.

    Transaction Control transformation. Create the following transaction control expression to commit data when the

    Integration Service encounters a new order entry date:

    IIF(NEW_DATE = 1, TC_COMMIT_BEFORE, TC_CONTINUE_TRANSACTION)

    What is the difference between joiner and lookup

    Joiner Lookup

    In joiner on multiple matches it will return all matching

    records.

    In lookup it will return either first record or last record

    or any value or error value.

    In joiner we cannot configure to use persistence

    cache, shared cache, uncached and dynamic cache

    Where as in lookup we can configure to use

    persistence cache, shared cache, uncached and

    dynamic cache.

    We cannot override the query in joiner We can override the query in lookup to fetch the data

    from multiple tables.

    We can perform outer join in joiner transformation. We cannot perform outer join in lookup

    transformation.

    We cannot use relational operators in joiner

    transformation.(i.e. ,

  • 7/28/2019 DWH Concentrated

    40/71

    40

    What is the difference between source qualifier and lookup

    Source Qualifier Lookup

    In source qualifier it will push all the matching

    records.

    Where as in lookup we can restrict whether to

    display first value, last value or any value

    In source qualifier there is no concept of cache. Where as in lookup we concentrate on cache

    concept.

    When both source and lookup are in same database

    we can use source qualifier.

    When the source and lookup table exists in different

    database then we need to use lookup.

    Have you done any Performance tuning in informatica?

    1) Yes, One of my mapping was taking 3-4 hours to process 40 millions rows into staging table we dont have

    any transformation inside the mapping its 1 to 1 mapping .Here nothing is there to optimize the mapping

    so I created session partitions using key range on effective date column. It improved performance lot,

    rather than 4 hours it was running in 30 minutes for entire 40millions.Using partitions DTM will creates

    multiple reader and writer threads.

    2) There was one more scenario where I got very good performance in the mapping level .Rather than using

    lookup transformation if we can able to do outer join in the source qualifier query override this will give you

    good performance if both lookup table and source were in the same database. If lookup tables is huge

    volumes then creating cache is costly.

    3) And also if we can able to optimize mapping using less no of transformations always gives you good

    performance.

    4) If any mapping taking long time to execute then first we need to look in to source and target statistics in

    the monitor for the throughput and also find out where exactly the bottle neck by looking busy percentage

    in the session log will come to know which transformation taking more time ,if your source query is the

    bottle neck then it will show in the end of the session log as query issued to database that means there

    is a performance issue in the source query.we need to tune the query using .

    Informatica Session Log shows busy percentage

    If we look into session logs it shows busy percentage based on that we need to find out where is bottle neck.

    ***** RUN INFO FOR TGT LOAD ORDER GROUP [1], CONCURRENT SET [1] ****

    Thread [READER_1_1_1] created for [the read stage] of partition point [SQ_ACW_PCBA_APPROVAL_STG] has

    completed: Total Run Time = [7.193083] secs, Total Idle Time = [0.000000] secs, Busy Percentage = [100.000000]

    Thread [TRANSF_1_1_1] created for [the transformation stage] of partition point [SQ_ACW_PCBA_APPROVAL_STG]

    has completed. The total run time was insufficient for any meaningful statistics.

    Thread [WRITER_1_*_1] created for [the write stage] of partition point [ACW_PCBA_APPROVAL_F1,

    ACW_PCBA_APPROVAL_F] has completed: Total Run Time = [0.806521] secs, Total Idle Time = [0.000000] secs,

    Busy Percentage = [100.000000]

    If suppose I've to load 40 lacs records in the target table and the workflow

    is taking about 10 - 11 hours to finish. I've already increased

    the cache size to 128MB.

    There are no joiner, just lookups

    and expression transformations

  • 7/28/2019 DWH Concentrated

    41/71

    41

    Ans:

    (1) If the lookups have many records, try creating indexes

    on the columns used in the lkp condition. And try

    increasing the lookup cache.If this doesnt increase

    the performance. If the target has any indexes disable

    them in the target pre load and enable them in the

    target post load.

    (2)Three things you can do w.r.t it.

    1. Increase the Commit intervals ( by default its 10000)

    2. Use bulk mode instead of normal mode incase ur target doesn't have

    primary keys or use pre and post session SQL to

    implement the same (depending on the business req.)

    3. Uses Key partitionning to load the data faster.

    (3)If your target consists key constraints and indexes u slow

    the loading of data. To improve the session performance in

    this case drop constraints and indexes before you run the

    session and rebuild them after completion of session.

    What is Constraint based loading in informatica?

    By setting Constraint Based Loading property at session level in Configaration tab we can load the data into parent

    and child relational tables (primary foreign key).

    Genarally What it do is it will load the data first in parent table then it will load it in to child table.

    What is use of Shortcuts in informatica?

    If we copy source definaltions or target definations or mapplets from Shared folder to any other folders that will

    become a shortcut.

    Lets assume we have imported some source and target definitions in a shared folder after that we are using those

    sources and target definitions in another folders as a shortcut in some mappings.

    If any modifications occur in the backend (Database) structure like adding new columns or drop existing columns

    either in source or target I f we reimport into shared folder those new changes automatically it would reflect in all

    folder/mappings wherever we used those sources or target definitions.

    Target Update Override

    If we dont have primary key on target table using Target Update Override option we can perform updates.By

    default, the Integration Service updates target tables based on key values. However, you can override the default

    UPDATE statement for each target in a mapping. You might want to update the target based on non-key columns.

    Overriding the WHERE Clause

    You can override the WHERE clause to include non-key columns. For example, you might want to update records for

    employees named Mike Smith only. To do this, you edit the WHERE clause as follows:

    UPDATE T_SALES SET DATE_SHIPPED =:TU.DATE_SHIPPED,

    TOTAL_SALES = :TU.TOTAL_SALES WHERE EMP_NAME = :TU.EMP_NAME and

    EMP_NAME = 'MIKE SMITH'

    If you modify the UPDATE portion of the statement, be sure to use :TU to specify ports.

  • 7/28/2019 DWH Concentrated

    42/71

    42

    How do you perform incremental logic or Delta or CDC?

    Incremental means suppose today we processed 100 records ,for tomorrow run u need to extract whatever the

    records inserted newly and updated after previous run based on last updated timestamp (Yesterday run) this

    process called as incremental or delta.

    Approach_1: Using set max var ()

    1) First need to create mapping var ($$Pre_sess_max_upd)and assign initial value as old date (01/01/1940).

    2) Then override source qualifier query to fetch only LAT_UPD_DATE >=$$Pre_sess_max_upd (Mapping var)

    3) In the expression assign max last_upd_date value to $$Pre_sess_max_upd(mapping var) using set max

    var

    4) Because its var so it stores the max last upd_date value in the repository, in the next run our source

    qualifier query will fetch only the records updated or inseted after previous run.

    Approach_2: Using parameter file

    1 First need to create mapping parameter ($$Pre_sess_start_tmst )and assign initial value as old date

    (01/01/1940) in the parameterfile.

    2 Then override source qualifier query to fetch only LAT_UPD_DATE >=$$Pre_sess_start_tmst

    (Mapping var)

    3 Update mapping parameter($$Pre_sess_start_tmst) values in the parameter file using shell script or

    another mapping after first session get completed successfully

    4 Because its mapping parameter so every time we need to update the value in the parameter fileafter comptetion of main session.

    Approach_3: Using oracle Control tables

    1 First we need to create two control tables cont_tbl_1 and cont_tbl_1 with structure of

    session_st_time,wf_name

    2 Then insert one record in each table with session_st_time=1/1/1940 and workflow_name

    3 create two store procedures one for update cont_tbl_1 with session st_time, set property of store

    procedure type as Source_pre_load .

    4 In 2nd store procedure set property of store procedure type as Target _Post_load.this proc will

    update the session _st_time in Cont_tbl_2 from cnt_tbl_1.

    5 Then override source qualifier query to fetch only LAT_UPD_DATE >=(Select session_st_time from

    cont_tbl_2 where workflow name=Actual work flow name.

    SCD Type-II Effective-Date Approach

    We have one of the dimension in current project called resource dimension. Here we are maintaining the

    history to keep track of SCD changes.

    To maintain the history in slowly changing dimension or resource dimension. We followed SCD Type-II

    Effective-Date approach.

    My resource dimension structure would be eff-start-date, eff-end-date, s.k and source columns.

    Whenever I do a insert into dimension I would populate eff-start-date with sysdate, eff-end-date with futuredate and s.k as a sequence number.

    If the record already present in my dimension but there is change in the source data. In that case what I

    need to do is

  • 7/28/2019 DWH Concentrated

    43/71

    43

    Update the previous record eff-end-date with sysdate and insert as a new record with source data.

    Informatica design to implement SDC Type-II effective-date approach

    Once you fetch the record from source qualifier. We will send it to lookup to find out whether t