2 Printout Dwh Q A

download 2 Printout Dwh Q A

of 40

Transcript of 2 Printout Dwh Q A

  • 8/6/2019 2 Printout Dwh Q A

    1/40

    Q1. What are data marts?

    Ans: Data marts are smaller data warehouse.

    Data mart is a subset of data warehouse fact and summary

    data that provides users with information specific to theirrequirement.

    Data marts can be considered of three types: 1. Dependent2. Independent. 3. Hybrid

    The categorization is based primarily on the data sourcethat feeds the data mart.

    1. Dependent Data Mart:a. Source is the data warehouse. Dependent data marts relyon the data warehouse for context.b. The extraction, transformation, and transportation(ETT)process is easy. Dependent data marts draw data from acentral data warehouse that has already been created. Thusthe main effort in building a mart, the data cleansing andextraction has already been performed. The dependent datamart simply requires data to be moved from one database toanother.c. The data mart is part of the enterprise plan. Dependentdata marts are usually built to achieve improvedperformance and availability, better control, and lowertelecommunication cost resulting from local access to data

    relevant to a specific department.

    2.Independent Data Mart: Independent data marts are standalone systems built from scratch that drew data directlyfrom operational and/or external sources of data.a. Sources are operational systems and external sources.b. ETT process is difficult. Because independent data martsdraw data from unclean or inconsistent data sources,efforts are directly towards error processing andintegration of data.c. The data marts are built to satisfy analytical needs.

    The creation of independent data marts is often driven bythe need for a quick solution to analytical demands.

    3. Hybrid Data Marts: Hybrid data marts are the combinationof Dependent and Independent data marts. It contains thedata from the data warehouse as well as data from externalsources.

  • 8/6/2019 2 Printout Dwh Q A

    2/40

    Q2. What are Slowly Changing Dimensions?

    Ans: A type of CDC process, which applies to cases whereattribute for a record changes over time. Here the data isnot only growing, but also changing.

    a.SCD Type 1: If any update happens at source, it shouldupdate at the target. If any insert happens at source,it should be inserted at target. (Update+Insert+RWO).If beyond time frame, no update. If within time frame,Update + Insert.

    b.SCD Type 2: In Type 2, we are creating history. Anyupdation at source, we will treat as insert at target.Previous data will be present. Updated data will alsobe inserted. In this situation, primary keys will alsobe repeated. So to maintain uniqueness, we needsurrogate key. So second time when the first record isreinserted, first record will be version 0, secondrecord will be version 1, third time will be 2, and soon. In Flag, only two versions 0 and 1 will bepresent. The recent record 1 and the recent previous0, it goes on changing as the record updates. In timemodification, only start value and end value arepresent.

    c.SCD Type 3: Vertical Versioning. SCD Type 2 ishorizontal versioning. SCD Type 3 is column levelmaintenance. Suppose 1 lakh products have their priceincreased by Rs 50. If we do SCD Type 1 or SCD Type 2,row wise maintenance, then 1 lakh rows will become 2

    lakh. It will create a lot of problem. So better add acolumn like old_value and new_value. By this, we cansave a lot of space. But it is not advisable as itneeds change in database table structure.

    SCD Type 1: The new record replaces the original record.Only one record exists in the database current data.

    SCD Type 2: A new record is added into the dimension table.Two records exist in database current data and previoushistory data.

    SCD Type 3: The original data is modified to include newdata. One record exists in database new information isattached with old information in the same row.

    The "Slowly Changing Dimension" problem is a common one

  • 8/6/2019 2 Printout Dwh Q A

    3/40

    particular to data warehousing. In a nutshell, this appliesto cases where the attribute for a record varies over time.We give an example below:

    Christina is a customer with ABC Inc. She first lived in

    Chicago, Illinois. So, the original entry in the customerlookup table has the following record:CustomerKey

    NameState

    1001 Christina IllinoisAt a later date, she moved to Los Angeles, California onJanuary, 2003. How should ABC Inc. now modify its customertable to reflect this change? This is the "Slowly ChangingDimension" problem.There are in general three ways to solve this type ofproblem, and they are categorized as follows:Type 1: The new record replaces the original record. Notrace of the old record exists.Type 2: A new record is added into the customer dimensiontable. Therefore, the customer is treated essentially astwo people.Type 3: The original record is modified to reflect thechange.

    We next take a look at each of the scenarios and how thedata model and the data looks like for each of them.Finally, we compare and contrast among the threealternatives.

    In Type 1 Slowly Changing Dimension, the new informationsimply overwrites the original information. In other words,no history is kept.In our example, recall we originally have the followingtable:

    Customer Key Name State1001 Christina Illinois

    After Christina moved from Illinois to California, the newinformation replaces the new record, and we have the

    following table:

    CustomerKey

    Name State

    1001 Christina California

    Advantages:- This is the easiest way to handle the Slowly Changing

  • 8/6/2019 2 Printout Dwh Q A

    4/40

    Dimension problem, since there is no need to keep track ofthe old information.

    Disadvantages:- All history is lost. By applying this methodology, it is

    not possible to trace back in history. For example, in thiscase, the company would not be able to know that Christinalived in Illinois before.

    Usage:About 50% of the time.

    When to use Type 1:Type 1 Slowly Changing Dimension should be used when it isnot necessary for the data warehouse to keep track ofhistorical changes.

    In Type 2 Slowly Changing Dimension, a new record is addedto the table to represent the new information. Therefore,both the original and the new record will be present. Thenew record gets its own primary key.

    In our example, recall we originally have the followingtable:

    CustomerKey

    Name State

    1001 Christina Illinois

    After Christina moved from Illinois to California, we addthe new information as a new row into the table:

    CustomerKey

    Name State

    1001 Christina Illinois1005 Christina California

    Advantages:- This allows us to accurately keep all historicalinformation.

    Disadvantages:- This will cause the size of the table to grow fast. Incases where the number of rows for the table is very highto start with, storage and performance can become a concern.- This necessarily complicates the ETL process.

    Usage:

  • 8/6/2019 2 Printout Dwh Q A

    5/40

    About 50% of the time.

    When to use Type 2:Type 2 slowly changing dimension should be used when it isnecessary for the data warehouse to track historical

    changes.

    In Type 3 Slowly Changing Dimension, there will be twocolumns to indicate the particular attribute of interest,one indicating the original value, and one indicating thecurrent value. There will also be a column that indicateswhen the current value becomes active.In our example, recall we originally have the followingtable:

    CustomerKey Name State1001 Christina Illinois

    To accommodate Type 3 Slowly Changing Dimension, we willnow have the following columns:

    Customer Key Name Original State Current State Effective Date

    After Christina moved from Illinois to California, the

    original information gets updated, and we have thefollowing table (assuming the effective date of change isJanuary 15, 2003):

    CustomerKey

    NameOriginalState

    CurrentState

    EffectiveDate

    1001 Christina Illinois California 15-JAN-2003

    Advantages:- This does not increase the size of the table, since newinformation is updated.

    - This allows us to keep some part of history.

    Disadvantages:- Type 3 will not be able to keep all history where anattribute is changed more than once. For example, ifChristina later moves to Texas on December 15, 2003, theCalifornia information will be lost.

  • 8/6/2019 2 Printout Dwh Q A

    6/40

    Usage:Type 3 is rarely used in actual practice.

    When to use Type 3:Type III slowly changing dimension should only be used when

    it is necessary for the data warehouse to track historicalchanges, and when such changes will only occur for a finitenumber of time.

    Q3. What is Snowflake Schema?

    Ans: Any star schema with a flake in it. Flake meansextension. In snowflake schema, each dimension has aprimary dimension table to which one or more additionaldimensions can join. The primary dimension table is theonly table that can join to the fact table. In snowflakeschema, dimensions may be interlinked or may have one tomany relationships with other tables.

    The main advantage of the snowflake schema is theimprovement in query performance due to minimized diskstorage requirements and joining smaller lookup tables. Themain disadvantage of the snowflake schema is the additionalmaintenance efforts needed due to increased number oflookup tables.

    The snowflake schema is an extension of the star schema,where each point of the star explodes into more points.

    Q4. What is a Star schema?

    Ans: Join between central fact table and dimension table isknown as star schema design. In a star schema design, alldimensions will be linked directly with a fact table. Asimple star schema consist of one fact table, a complexstar schema have more than one fact table.

    In the star schema design, a single object (the fact table)

    sits in the middle and is radially connected to othersurrounding objects (dimension lookup tables) like a star.

    Q5. What is the difference between OLTP and OLAP?

  • 8/6/2019 2 Printout Dwh Q A

    7/40

    Ans:Difference between data warehouse and OLTP1. Workload: Data warehouses are designed to accommodate adhoc queries. You might not know the workload of your data

    warehouse in advance. So a data warehouse should beoptimized to perform well for a wide variety of possiblequery operations.

    OLTP systems support only predefined operations. Yourapplications might be specifically tuned or designed tosupport only these operations.

    2. Data Modifications: A data warehouse is updated on aregular basis by the ETL process(run nightly or weekly)using bulk data modification techniques. The end-user of adata warehouse does not directly update the data warehouse.

    In OLTP systems, end-users routinely issue individualdata modification statements to the database. The OLTPdatabase is always up to date and reflects the currentstate of each business transaction.

    3. Schema design: Data warehouses often use denormalized orpartially denormalized schemas(such as star schema) tooptimize query performance.

    OLTP systems often use fully normalized schemas tooptimize update/insert/delete performance and to guarantee

    data consistency.

    4. Typical Operations: A typical data warehouse query scansthousands or millions of records. For ex, Find the totalsales for all customers last month

    A typical OLTP operation accesses only a handful ofrecords. Retrieve the current order for this customer

    5. Historical Data: Data warehouse usually stores many

    months or years of data. This is to support historicalanalysis.

    OLTP systems usually store data from only a few weeksor months. The OLTP systems stores only historical data asneeded to successfully meet the requirement of the currenttransaction.

    Q6. What is a dimension table?

  • 8/6/2019 2 Printout Dwh Q A

    8/40

    Ans: Dimensional tables also known as lookup or referencetables contain the relatively static data in the warehouse.Dimension tables stores the information you normally use tocontain queries. Dimensional tables are usually textual and

    descriptive and you can use them as the headers of theresult set. Ex are customers or products.

    Q7. What is a fact table?

    Ans: Fact tables are the large tables in your warehouseschemas that store business measurements. Fact tablestypically contain facts and foreign keys to the dimensiontables. Fact tables represent data, usually numeric andadditive, that can be analyzed and examined. Exampleincludes sales, cost, and profit. A fact table typicallyhas two types of columns: those that contain numericfacts(often called measurements), and those that areforeign keys to dimensional tables. A fact table containseither detail level facts or facts that have beenaggregated. Fact tables that contain aggregated facts arecalled summary tables. A fact table usually contains factswith the same level of aggregation. Facts can be additive,semi-additive, and non-additive. The primary key of thefact table is usually a composite key that is made up ofall of its foreign keys. There are three types of facts:

    1.Additive: Additive facts are facts that can be summed up

    through all of the dimensions in the fact table.

    2. Semi-Additive: Semi-additive facts are facts that can besummed up for some of the dimensions in the fact table, butnot the others.

    3. Non-Additive: Non-additive facts are facts that cannotbe summed up for any of the dimensions present in the facttable.

    Let us use examples to illustrate each of the three types

    of facts. The first example assumes that we are a retailer,and we have a fact table with the following columns:

    DateStoreProductSales_Amount

  • 8/6/2019 2 Printout Dwh Q A

    9/40

    The purpose of this table is to record the sales amount foreach product in each store on a daily basis. Sales_Amountis the fact. In this case, Sales_Amount is an additivefact, because you can sum up this fact along any of thethree dimensions present in the fact table -- date, store,

    and product. For example, the sum of Sales_Amount for all 7days in a week represent the total sales amount for thatweek.

    Say we are a bank with the following fact table:

    DateAccountCurrent_BalanceProfit_Margin

    The purpose of this table is to record the current balancefor each account at the end of each day, as well as theprofit margin for each account for each day.Current_Balance and Profit_Margin are the facts.Current_Balance is a semi-additive fact, as it makes senseto add them up for all accounts (what's the total currentbalance for all accounts in the bank?), but it does notmake sense to add them up through time (adding up allcurrent balances for a given account for each day of themonth does not give us any useful information).Profit_Margin is a non-additive fact, for it does not makesense to add them up for the account level or the day level.

    Q8. What is ETL?

    Ans: ETL stands for extract, transform, and load, theprocesses that enable companies to move data from multiplesources, reformat and cleanse it, and load it into anotherdatabase, a data mart or a data warehouse for analysis, oron another operational system to support a business process.

    Q9. What are conformed dimensions?

    Ans: Conformed dimensions are dimensions that are common tomore than one data mart, or business process across theorganization. Common examples may be time, organizationstructure, and product. Conformed dimensions should bemodeled at the lowest level of granularity used, this waythey can service any fact table that needs to use them,either at their natural granularity, or at a higher level.They provide a consistent view of the business, including

  • 8/6/2019 2 Printout Dwh Q A

    10/40

    attributes and keys to all the data marts in theorganization. Conformed dimensions can either beimplemented as a single physical table or maybe be areplicated table used by each data mart.

    Q10. What is ODS?

    Ans: ODS stores tactical data from production system thatis subject-oriented and integrated to address operationalneeds. The detailed current information in the ODS istransactional in nature, updated frequently(at leastdaily), and is only held for a short period of time.

    This is the database used to capture daily businessactivities and thus is normalized database. ODS capturesday to day transactions and you can generate reports on ODS.

    An Operational Data Store is the operational systemwhose function is to capture the transactions of thebusiness. The source system should be thought of as beingoutside the data warehouse, since the data warehouse systemhas no control over the context and format of the data. Thedata in these systems can be in many formats from flatfiles to hierarchical and relational databases.

    Objectives of Operational Data Store are to:1.Integrate information from the production systems.2.Relieve the production systems of reporting and

    analytical demands.

    3.Provide access to current data.

    Q11. What is degenerated dimension table?

    Ans: A degenerated dimension is data that is dimensional innature, but stored in a fact table. For ex, if you have adimension that only has Order number and order line number,you would have a 1:1 relationship with the fact table. Doyou want to have two tables with a billion rows or onetable with a billion rows. Therefore, this would be adegenerate dimension and Order Number and Order Line Number

    would be stored in the fact table.

    When a fact table has dimensional value stored, it iscalled degenerated dimension.

    When the cardinality of column value is high, instead ofmaintaining a separate dimension and having the overhead ofmaking a join with fact table, degenerated dimensions can

  • 8/6/2019 2 Printout Dwh Q A

    11/40

    be build. For example, In sales fact table, Invoice numberis a degenerated dimension. Since Invoice Number is nottied up to an order header table, hence there is no needfor invoice number to join a dimensional table; hence it isreferred as degenerated dimension.

    A Degenerate dimension is a dimension which has only asingle attribute. This dimension is typically representedas a single field in a fact table. Degenerate Dimensionsare the fastest way to group similar transactions.Degenerate Dimensions are used when fact tables representtransactional data. They can be used as primary key for thefact table but they cannot act as foreign keys.

    Q12. What is dimensional modeling?

    Ans: Warehouse designing is known as dimensional modeling.Aim is to optimize query performance.

    Dimensional modeling is the name of a logical designtechnique often used for data warehouses. ER is a logicaldesign technique that seeks to remove the redundancy indata.

    DM is a logical design technique that seeks to present thedata in a standard, intuitive framework that allows forhigh-performance access. It is inherently dimensional andit adheres to a discipline that uses the relational modelwith some important restrictions. Every dimensional model

    is composed of one table with a multipart key, called thefact table and a set of smaller tables called dimensionaltables. Each dimensional table has a single-part primarykey that corresponds exactly to one of the components ofthe multipart key in the fact table. This characteristicstar like structure is often called a star join. A facttable, because it has a multipart primary key made up oftwo or more foreign keys, always expresses a many-to-manyrelationship. The most useful fact tables also contain oneor more numerical measures or facts that occur for thecombination of keys that define each record. The most

    useful facts in a fact table are numeric and additive.Additivity is crucial because data warehouse applicationsalmost never retrieve a single fact table record, ratherthey fetch back hundreds, thousands, or even millions ofthese records at a time, and the only useful thing to dowith so many records is to add them up.

    Q13. What is a lookup table?

  • 8/6/2019 2 Printout Dwh Q A

    12/40

    Ans: Lookup Table: The lookup table provides the detailedinformation about the attributes. For example, the lookuptable for the Quarter attribute would include a list of allof the quarters available in the data warehouse. Each row

    (each quarter) may have several fields, one for the uniqueID that identifies the quarter, and one or more additionalfields that specifies how that particular quarter isrepresented on a report (for example, first quarter of 2001may be represented as "Q1 2001" or "2001 Q1").A dimensional model includes fact tables and lookup tables.Fact tables connect to one or more lookup tables, but facttables do not have direct relationships to one another.Dimensions and hierarchies are represented by lookuptables. Attributes are the non-key columns in the lookuptables.

    Dimensional tables are also known as lookup tables orreference tables.

    Q14. What is conformed fact?

    Ans: Two facts are conformed if they have the same name,units, and definition. If two facts are do not representthe same thing to the business, then they must be givendifferent names.

    Q15. What is junk dimension? What is the difference between

    junk dimension and degenerated dimension?

    Ans: A junk dimension is a collection of randomtransactional codes, flags and/or text attributes that areunrelated to any particular dimension.

    A number of very small dimensions might be lumped togetherto form a single dimension, a junk dimension theattributes are not closely related. Grouping of randomflags and text attributes in a dimension and moving them toa separate subdimension is known as junk dimension.

    A junk dimension is a dimension that is created and storedin a separate location for future use. While developing adimensional model, there are lots of miscellaneous flagsand indicators that dont logically belong to the coredimensional tables. Neither you can put them into a facttable as they would unnecessarily increase the size nor youcan create a dimensional table for them as they would

  • 8/6/2019 2 Printout Dwh Q A

    13/40

    dramatically increase the size of dimensional tables. Thethird option is to create a junk dimension and put allthese flags and indicators into that junk dimension thatcan be referred for future use, which could not else bedeleted as they have little significance in the dimensional

    model.

    Q16. What are the various ETL tools in the market?

    Ans:1. Informatica Power Center2. Ascential Data Stage3. ESS Base Hyperion4. AbIntio5. BO Data Integrator6. SAS ETL7. MS DTS8. Oracle OWB9. Pervasive Data Junction10. Cognos Decision Stream11. Hummingbird12. Sunopsis

    Q17. What is the main functional difference between ROLAP,MOLAP, and HOLAP?(refer 1keydata.com word file to add more)

    Ans: In the OLAP world, there are mainly two differenttypes: Multidimensional OLAP (MOLAP) and Relational OLAP

    (ROLAP). Hybrid OLAP (HOLAP) refers to technologies thatcombine MOLAP and ROLAP.MOLAP: This is the more traditional way of OLAP analysis.In MOLAP, data is stored in a multidimensional cube. Thestorage is not in the relational database, but inproprietary formats.Advantages:

    Excellent performance: MOLAP cubes are built for fastdata retrieval, and is optimal for slicing and dicingoperations.

    Can perform complex calculations: All calculationshave been pre-generated when the cube is created.Hence, complex calculations are not only doable, butthey return quickly.

    Disadvantages:

    Limited in the amount of data it can handle: Becauseall calculations are performed when the cube is built,

  • 8/6/2019 2 Printout Dwh Q A

    14/40

    it is not possible to include a large amount of datain the cube itself. This is not to say that the datain the cube cannot be derived from a large amount ofdata. Indeed, this is possible. But in this case, onlysummary-level information will be included in the cube

    itself. Requires additional investment: Cube technology are

    often proprietary and do not already exist in theorganization. Therefore, to adopt MOLAP technology,chances are additional investments in human andcapital resources are needed.

    ROLAP: This methodology relies on manipulating the datastored in the relational database to give the appearance oftraditional OLAP's slicing and dicing functionality. Inessence, each action of slicing and dicing is equivalent toadding a "WHERE" clause in the SQL statement.Advantages:

    Can handle large amounts of data: The data sizelimitation of ROLAP technology is the limitation ondata size of the underlying relational database. Inother words, ROLAP itself places no limitation on dataamount.

    Can leverage functionalities inherent in therelational database: Often, relational databasealready comes with a host of functionalities. ROLAPtechnologies, since they sit on top of the relational

    database, can therefore leverage thesefunctionalities.

    Disadvantages: Performance can be slow: Because each ROLAP report is

    essentially a SQL query (or multiple SQL queries) inthe relational database, the query time can be long ifthe underlying data size is large.

    Limited by SQL functionalities: Because ROLAPtechnology mainly relies on generating SQL statementsto query the relational database, and SQL statements

    do not fit all needs (for example, it is difficult toperform complex calculations using SQL), ROLAPtechnologies are therefore traditionally limited bywhat SQL can do. ROLAP vendors have mitigated thisrisk by building into the tool out-of-the-box complexfunctions as well as the ability to allow users todefine their own functions.

  • 8/6/2019 2 Printout Dwh Q A

    15/40

    HOLAP: HOLAP technologies attempt to combine the advantagesof MOLAP and ROLAP. For summary-type information, HOLAPleverages cube technology for faster performance. Whendetail information is needed, HOLAP can "drill through"from the cube into the underlying relational data.

    The functional difference between these is how theinformation is stored. In all cases, the users see the dataas a cube of dimensions and facts.ROLAP: Detailed data is stored in a relational database in3NF, star or snowflake form. Queries must summarize data onthe fly.MOLAP: Data is stored in multidimensional form dimensionsand facts stored together. You can think of this as apersistent cube. Level of detail is determined by theintersection of the dimension hierarchies.HOLAP: Data is stored using a combination of relational andmultidimensional storage. Summary data might persist as acube, while detail data is stored relationally, buttransitioning between the two is invisible to the end-user.

    Q18. What is a level of granularity of a fact table?

    Ans: Level of granularity means level of detail that youput into the fact table in a data warehouse. For ex, basedon design you can decide to put the sales data in eachtransaction. Now, level of granularity would mean what

    detail are you willing to put for each transactional fact.Product sales with respect to each minute or you want toaggregate it up to minute and put that data.

    Q19. Which column go to fact table and which column go todimension table?

    Ans: Foreign key elements along with business measures goto fact table and detail information go to dimension table.

    Q20. Which type of indexing mechanism do we need to use for

    a typical data warehouse?

    Ans: Bitmap index.

    Q21. What is data mining?

    Ans: Data mining is the process of extracting hidden trendswithin a data warehouse. For ex, an insurance data

  • 8/6/2019 2 Printout Dwh Q A

    16/40

    warehouse can be used to mine data for the most high riskpeople to insure in a certain geographical area.

    Data mining is the process of analyzing data from differentperspectives and summarizing it into useful information.

    Q22. What are the modeling tools available in the market?

    Ans:1. Oracle Designer2. Erwin (Entity Relationship for windows)3. Informatica (Cubes/Dimensions)4. Embarcadero5. Power Designer Sybase

    Q23. What is a general purpose scheduling tool?

    Ans: A scheduling tool is a tool which is used to schedulethe data warehouse jobs. All the jobs which does someprocess are scheduled using this tool, which eliminates themanual intervention. The basic purpose of a scheduling toolin a DW application is to streamline the flow of data fromsource to target at specific time or based on somecondition.

    Q24. What are the various reporting tools in the market?

    Ans:1. MS-Excel2. Business Objects (Crystal Reports)3. Cognos (Impromptu, Power Play, ReportNet)4. Microstrategy5. MS reporting services6. Informatica Power Analyzer7. Actuate8. Hyperion (BRIO)9. Oracle Express OLAP10. Proclarity

    11. SAS

    Q25. What is ER diagram?Ans: ER is a logical design technique that seeks to removethe redundancy in data.The ER model is a conceptual data model that views the realworld as entities and relationships. A basic component ofthe model is the Entity-Relationship diagram which is used

  • 8/6/2019 2 Printout Dwh Q A

    17/40

    to visually represent data objects.

    Q26. What is the difference between BO, Microstrategy, andCognos?

    Ans: BO is a ROLAP tool, Cognos is a MOLAP tool,Microstrategy is a HOLAP tool.

    Q27. What is the difference between a star schema and asnowflake schema and when we use those schemas?

    Ans: Star schema is a schema design used in data warehousewhere a fact table is in the center and all dimensiontables are connected to it directly by one to onerelationship. In snowflake schema, the dimension tables arefurther normalized into different tables. One primarydimension table is joined to fact table and other dimensiontables may be joined to the dimension table. This schema isdenormalized and results in simple join and less complexquery as well as faster results.

    The use depends on the requirement.

    A snowflake schema is a way to handle problems that do notfit within the star schema. It consists of outriggertables which relate to dimensions rather than to the facttable. This schema is normalized and results in complexjoin and very complex query as well as slower results.

    The amount of space taken up by dimensions is so smallcompared to the space required for a fact table as to beinsignificant. Therefore, table space or disk space isnot a considered a reason to create a snowflake schema.

    The main reason for creating a snowflake is to make itsimpler and faster for a report writer to create drop downboxes. Rather than having to write a select distinctstatement, they can simply select * from the code table.

    Junk dimensions and mini dimensions are another reason tocreate add outriggers. The junk dimensions contain datafrom a normal dimension that you wish to separate out, suchas fields that change quickly. Updates are so slow thatthey can add hours to the load process. With a junkdimension, it is possible to drop and add records ratherthan update.

  • 8/6/2019 2 Printout Dwh Q A

    18/40

    Mini dimensions contain data that is so dissimilar betweentwo or more source systems that would cause a very sparsemain dimension. The conformed data that can be obtainedfrom all source systems is contained in the parentdimension and the data from each source system that does

    not match is contained in the child dimension.

    Finally, if you are unlucky enough to have end usersactually adding or updating data to the data warehouserather than just batch loads, it may be necessary to addthese outriggers to maintain referential integrity in thedata being loaded.

    Star schema is good for simple queries and logic. Butsnowflake schema is good for complex queries and logic.Snowflake schema is nothing but an extension of the starschema in which the dimension tables are further normalizedto reduce redundancy.

    In Star Schema when we try to access many attributes or fewattributes from a single dimension table the performance ofthe query falls. So we denormalize this dimension tableinto two or sub dimensions. Now the same star schema istransformed into snow Flake schema. By doing so theperformance improves.

    If the table space is not enough to maintain a STAR schemathen we should go for Snowflake instead of STAR schema. i.e

    The table should be splited in to multiple tables. Ex. ifyou want to maintain time data in one table like year,month, day in one table in star schema, you need to splitthis data into 3 tables like year, month, day in Snowflakeschema.

    Q28. What is slicing and dicing? Explain with real timeusage and business reasons of its use?

    Ans: Slicing and dicing is a feature that helps us inseeing the more detailed information about a particular

    thing. For ex, you have a report which shows the quarterlybased performance of a particular product. But you want tosee it in a monthly wise. So you can use slicing and dicingtechnique to drill down to monthly level.

    Q29. What are the various attributes in a time dimension?

    (see this in orator cd or dilip book)

  • 8/6/2019 2 Printout Dwh Q A

    19/40

    Ans:Date and time (if datetime data type is not supported thenyou have to have hour/min/sec on separate columns.WeekMonth

    QuarterYear

    Q30. What is the role of surrogate keys in data warehouseand how will you generate them?

    Ans: Surrogate key is a substitution for the naturalprimary key. It is just a unique identifier or number foreach row that can be used for the primary key to the table.The only requirement for a surrogate primary key is that itis unique for each row in the table. It is useful becausethe natural primary key can change and this makes updatesmore difficult. Surrogate keys are always integer ornumeric.

    A surrogate key is a simple primary key which maps one toone with a natural compound primary key. The reason forusing them is to alleviate the need for the query writer toknow the full compound key and also to speed queryprocessing by removing the need for the RDBMS to processthe full compound key when considering a join.

    For example, a shipment could have a natural key of ORDER +

    ITEM + SHIPMENT_SEQ. By giving it a unique SHIPMENT_ID,subordinate tables can access it with a single attribute,rather than 3. However, it's important to create a uniqueindex on the natural key as well.

    A surrogate key is a substitution for the natural primarykey. We tend to use our own Primary keys (surrogate keys)rather than depend on the primary key that is available inthe source system. When integrating the data, trying towork with the source system primary keys will be a littlehard to handle. Thats the reason why a surrogate key will

    be useful even though it serves the same purpose as that ofa primary key. Another important need for it is becausethe natural primary key (i.e. Customer Number in Customertable) can change and this makes updates more difficult.

    Q31. What is source qualifier?

    Ans: Source qualifier is a transformation which extracts

  • 8/6/2019 2 Printout Dwh Q A

    20/40

    data from the source. Source qualifier acts as SQL querywhen the source is a relational database and it acts as adata interpreter if the source is a flat file.

    Source qualifier is a transformation which every mapping

    should have; it defines the source metadata to theInformatica Repository. The source qualifier has differentproperties like filtering records, and joining of twosource, and so on.

    Q32. Summarize the difference between OLTP, ODS, and DATAWAREHOUSE?

    Ans: The OLTP, online transaction processing is a databasethat handles real time transactions which have some specialrequirements.

    ODS is the database used to capture daily businessactivities and thus is normalized database. ODS capturesday to day transactions and you can generate reports on ODS.

    Data warehouse is a relational database that isdesigned for analysis and query rather than for transactionprocessing.

    Q33. Do you need separate space for data warehouse and datamart?

    Ans: We dont require any separate space for data mart anddata warehouse unless until those marts are too big orclient require. We can maintain both in a same schema.

    Q34. What is data cleansing? How is it done?

    Ans: Data cleansing is the process of standardizing andformatting the data before we store data in the datawarehouse. It is done by several methods implemented byInformatica, like using LTRIM RTRIM to delete extra spaces.

    Q35. What is the difference between data warehouse anddatawarehousing?

    Ans: Data warehouse is the relational database that isdesigned for query and analysis, but the process is calleddatawarehousing. Datawarehousing is a concept.

    Q36. What is the need of surrogate key; why primary key not

  • 8/6/2019 2 Printout Dwh Q A

    21/40

    used as surrogate key?

    Ans: Refer to Question No. 30.

    If a column is made a primary key and later there needs a

    change in the data type or the length for that column thenall the foreign keys that are dependent on that primary keyshould be changed making the database unstable. SurrogateKeys make the database more stable because it insulates thePrimary and foreign key relationships from changes in thedata types and length.

    As data is extracted from disparate sources, where in eachsource might have primary keys with data types or formatsinherent to the underlying database, if the same primarykeys are used in the DW, there would be inconsistencies inrepresentation of data which would make querying of thedatabase a difficult job, so the surrogate keys areimplemented to circumvent these kind of situations.

    Q37. For 80 GB data warehouse, how many records are therein fact table. There are 25 dimensions and 12 fact tables.

    Ans: The estimation process is as follows:1. We have to estimate the size table wise, for example onefact table having say 50 cols. One of the col size isvarchar 200. We have to take all col sizes and find themaximum length of each record. Take the required column

    sizes and find the minimum length of each record. Find theaverage and multiply with the number of records availablein the fact table.2. We can also estimate future growth, table wise. Take 3to 4 years data. Find the increment in data quarterly. Thiswill help you to estimate the future growth table wise.

    Q38. Can a dimension table contain numeric values?

    Ans: A dimension table can contain anything you want - itis the business requirements around each data element that

    drive the data model.Ex: Your company makes bicycles. Frame size is part of theproduct.It's not a fact - it makes no sense to add/average/sum theframe size.

    It is something you group/sort/filter on. "How many bikeswith 19 inch frames were sold by region?"

  • 8/6/2019 2 Printout Dwh Q A

    22/40

    WHERE ProductDimension.FrameSize = 19

    A dimension table can contain numeric values. For example,a perishable product in a grocery store might have

    SHELF_LIFE (in days) as part of the product dimension. Thisvalue may, for example, be used to calculate optimuminventory levels for the product. Too much inventory and anexcess of product expires on the shelf, while too littleresults in lost sales.

    Q39. What is hybrid slowly changing dimension?Ans: Hybrid SCDs are combination of both SCD 1 and SCD 2.It may happen that in a table, some columns are importantand we need to track changes for them i.e capture thehistorical data for them whereas in some columns even ifthe data changes, we don't care.For such tables we implement Hybrid SCDs, where in somecolumns are Type 1 and some are Type 2.

    Q40. What is VLDB?

    Ans: VLDB stands for Very Large Database, any database toolarge (normally more than 1 terabyte) considered as VLDB.It is sometimes used to describe databases occupyingmagnetic storage in the terabyte range and containingbillions of table rows.

    Q41. How do you load the time dimension?

    Ans: Time dimensions are usually loaded by a program thatloops through all possible dates that may appear in thedata. 100 years may be represented in a time dimension,with one row per day.

    Time dimension in data warehouse must be loaded manually.We load data into time dimension using PL/SQL scripts.

    create a procedure to load data into Time Dimension. Theprocedure needs to run only once to populate all the data.For eg, the code below fills up till 2015. You can modifythe code to suit the fields in your table.create or replace procedure QISODS. Insert_W_DAY_D_PR asLastSeqID number default 0;loaddate Date default to_date('12/31/1979','mm/dd/yyyy');

  • 8/6/2019 2 Printout Dwh Q A

    23/40

    beginLoopLastSeqID := LastSeqID + 1;loaddate := loaddate + 1;INSERT into QISODS.W_DAY_D values(

    LastSeqID,Trunc(loaddate),Decode(TO_CHAR(loaddate,'Q'),'1',1,decode(to_char(loaddate,'Q'),'2',1,2)),TO_FLOAT(TO_CHAR(loaddate, 'MM')),TO_FLOAT(TO_CHAR(loaddate, 'Q')),trunc((ROUND(TO_DECIMAL(to_char(loaddate,'DDD'))) +ROUND(TO_DECIMAL(to_char(trunc(loaddate, 'YYYY'), 'D')))+5) / 7),TO_FLOAT(TO_CHAR(loaddate, 'YYYY')),TO_FLOAT(TO_CHAR(loaddate, 'DD')),TO_FLOAT(TO_CHAR(loaddate, 'D')),TO_FLOAT(TO_CHAR(loaddate, 'DDD')),1,1,1,1,1,TO_FLOAT(TO_CHAR(loaddate, 'J')),((TO_FLOAT(TO_CHAR(loaddate, 'YYYY')) + 4713) * 12) +TO_number(TO_CHAR(loaddate, 'MM')),((TO_FLOAT(TO_CHAR(loaddate, 'YYYY')) + 4713) * 4) +

    TO_number(TO_CHAR(loaddate, 'Q')),TO_FLOAT(TO_CHAR(loaddate, 'J'))/7,TO_FLOAT (TO_CHAR (loaddate,'YYYY')) + 4713,TO_CHAR(load_date, 'Day'),TO_CHAR(loaddate, 'Month'),Decode(To_Char(loaddate,'D'),'7','weekend','6','weekend','weekday'),Trunc(loaddate,'DAY') + 1,Decode(Last_Day(loaddate),loaddate,'y','n'),to_char(loaddate,'YYYYMM'),to_char(loaddate,'YYYY') || ' Half' ||

    Decode(TO_CHAR(loaddate,'Q'),'1',1,decode(to_char(loaddate,'Q'),'2',1,2)),TO_CHAR(loaddate, 'YYYY / MM'),TO_CHAR(loaddate, 'YYYY') ||' Q ' ||TRUNC(TO_number( TO_CHAR(loaddate,'Q')) ) ,TO_CHAR(loaddate, 'YYYY') ||' Week'||

  • 8/6/2019 2 Printout Dwh Q A

    24/40

    TRUNC(TO_number( TO_CHAR(loaddate,'WW'))),TO_CHAR(loaddate,'YYYY'));If loaddate=to_Date('12/31/2015','mm/dd/yyyy') ThenExit;

    End If;End Loop;commit;end Insert_W_DAY_D_PR;

    Q42. Why are OLTP database designs not generally a goodidea for a data warehouse?

    Ans: Because OLTP databases are transactional data relateddatabases. The meaning of this is these databases are usedin real time to insert, update, and delete data. Toaccomplish these tasks in real time, the model used in OLTPdatabases is highly normalized. The problem of using thismodel in datawarehousing is we have to join multiple tablesto get a single piece of data. With the amount ofhistorical data we deal with in data warehouse, it ishighly desirable not to have a highly normalized data modellike OLTP.

    OLTP cannot store historical information about theorganization. It is used for storing the details of dailytransactions while a data warehouse is a huge storage ofhistorical information obtained from different data marts

    for making intelligent decisions about the organization.

    OLTP database tables are normalized and it will addadditional time to queries to return results. Additionally,OLTP database is smaller and it does not contain longerperiod(many years) data, which needs to be analyzed. A OLTPsystem is basically ER model and not dimensional model. Ifa complex query is executed on a OLTP system, it may causea heavy overhead on the OLTP server that will affect thenormal business processes.

    Q43. Explain the advantages of RAID 1, 1/0, and 5. Whattype of RAID setup would you put your Tx logs?

    Ans: Raid 0 - Make several physical hard drives look likeone hard drive. No redundancy but very fast. May use fortemporary spaces where loss of the files will not result inloss of committed data.

  • 8/6/2019 2 Printout Dwh Q A

    25/40

    Raid 1- Mirroring. Each hard drive in the drive array has atwin. Each twin has an exact copy of the other twins dataso if one hard drive fails, the other is used to pull thedata. Raid 1 is half the speed of Raid 0 and the read andwrite performance are good.

    Raid 1/0 - Striped Raid 0, then mirrored Raid 1. Similar toRaid 1. Sometimes faster than Raid 1. Depends on vendorimplementation.

    Raid 5 - Great for read-only systems. Write performance is1/3rd that of Raid 1 but Read is same as Raid 1. Raid 5 isgreat for DW but not good for OLTP.

    Hard drives are cheap now so I always recommend Raid 1.

    Q44. Is it correct/feasible to develop a data mart using an

    ODS?

    Ans: Yes, it is correct to develop a data mart using anODS, because ODS which is used to store transaction dataand few days(less historic data), this is what data mart isrequired, so it is correct to develop data mart using ODS.

    You can build data mart directly having ODS as source andcalled Independent data marts.

    Q45. What is a CUBE in data warehouse concept?

    Ans: Cubes are logical representation of multidimensionaldata. The edge of the cube contains dimension members andthe body of the cube contains data values. The linking incube ensures that the data in the cube remains consistent.

    CUBE is used in data warehouse for representingmultidimensional data logically. Using the cube, it is easyto carry out certain activity ex. rollup, drill down/drillup, slice and dice, etc. which enables the business users

    to understand the trend of the business. It is good to havethe design of the cube in the star schema so as tofacilitate the effective use of the cube.

    Q46. Difference between snowflake and star schema? What aresituations where snowflake schema is better than starschema and when the opposite is true?

  • 8/6/2019 2 Printout Dwh Q A

    26/40

    Ans: Star Schema means a centralized fact table andsurrounded by different dimensions. Snowflake means in thesame star schema dimensions split into another dimensions.Star Schema contains Highly Denormalized Data. Snowflakecontains Partially normalized.

    Star can not have parent table, but snow flake containparent tables.

    Why need to go for Star:Here 1)less joiners contains2)simply database3)support drilling up options

    Why need to go Snowflake schema:Here some times we used to provide separate dimensions fromexisting dimensions that time we will go to snowflakeDisadvantage Of snowflake:Query performance is very low because more joiners is there

    Q47. What is the main difference between schema in RDBMSand schemas in data warehouse?

    Ans:RDBMS schema:Used for OLTP system.Traditional and old schema.Normalized.Difficult to understand and navigate.

    Cannot solve extract and complex problems.Poorly modeled.More no. of transactions.Less time for query execution.More no. of users.Have insert, update, and delete transactions.

    DWH schema:Used for OLAP systems.New generation schema.Denormalized.

    Easy to understand and navigate.Extract and complex problems can be easily solved.Very good model.Less no. of transactions.Less no. of users.More time for query execution.Will not have more insert, delete, and updates.

  • 8/6/2019 2 Printout Dwh Q A

    27/40

    Q48. What are possible data marts in retail sales?

    Ans: Product, sales, location, store, time.

    Q49. What is meant by metadata in context of a datawarehouse and how is it important?

    Ans: Meta data is data about data. Examples of metadatainclude data element descriptions, data type descriptions,attribute/property descriptions, range/domain descriptions,and process/method descriptions. The repository environmentencompasses all corporate metadata resources: databasecatalogs, data dictionaries, and navigation services.Metadata includes things like name, length, valid values,and descriptions of a data element. Metadata is stored in adata dictionary and repository. It insulates the datawarehouse from changes in the schema of operationalsystems. Metadata synchronization, the process ofconsolidating, relating, and synchronizing data elementswith the same or similar meaning from different systems.Metadata synchronization joins these different elementstogether in the data warehouse to allow for easier access.

    Q50. What is a surrogate key? Where we use it? Explain withexamples?

    Ans: For definition, refer Q30.

    It is useful because the natural primary key (i.e. customernumber in customer table) can change and this makes updatemore difficult. Some tables have columns such asAIRPORT_NAME or CITY_NAME, which are stated as the primarykeys (according to the business uses), but not only thesecan change, indexing on a numerical value is probablybetter and you could consider create a surrogate key,called say AIRPORT_ID. This should be internal to thesystem and as far as the client is concerned, you maydisplay only the AIRPORT_NAME.

    Another benefit you can get from surrogate keys (SID) is:

    Tracking the SCD - Slowly Changing Dimension

    Let me give you a simple, classical example:

    On the 1st of January 2002, Employee 'E1' belongs toBusiness Unit 'BU1' (that's what would be in your Employee

  • 8/6/2019 2 Printout Dwh Q A

    28/40

    Dimension). This employee has a turnover allocated to himon the Business Unit 'BU1' But on the 2nd of June theEmployee 'E1' is muted from Business Unit 'BU1' to BusinessUnit 'BU2.' All the new turnover have to belong to the newBusiness Unit 'BU2' but the old one should Belong to the

    Business Unit 'BU1.'

    If you used the natural business key 'E1' for your employeewithin your data warehouse everything would be allocated toBusiness Unit 'BU2' even what actually belongs to 'BU1.'

    If you use surrogate keys, you could create on the 2nd ofJune a new record for the Employee 'E1' in your EmployeeDimension with a new surrogate key.

    This way, in your fact table, you have your old data(before 2nd of June) with the SID of the Employee 'E1' +'BU1.' All new data (after 2nd of June) would take the SIDof the employee 'E1' + 'BU2.'

    You could consider Slowly Changing Dimension as anenlargement of your natural key: natural key of theEmployee was Employee Code 'E1' but for you it becomesEmployee Code + Business Unit - 'E1' + 'BU1' or 'E1' +'BU2.' But the difference with the natural key enlargementprocess, is that you might not have all part of your newkey within your fact table, so you might not be able to dothe join on the new enlarge key -> so you need another id.

    Q51. What is the main difference between Inmon and Kimballphilosophies of data warehousing?

    Ans: Both differed in the concept of building the datawarehouse.

    According to Kimball, data warehouse is the conglomerate ofall data marts within the enterprise. Information is alwaysstored in the dimensional models.Kimball views data warehousing as a constituency of data

    marts. Data marts are focused on delivering businessobjectives for departments in the organization. And thedata warehouse is a conformed dimension of the data marts.Hence a unified view of the enterprise can be obtained fromthe dimension modeling on a local departmental level.

    Inmon beliefs in creating a data warehouse on a subject-by-subject area basis. Hence the development of the data

  • 8/6/2019 2 Printout Dwh Q A

    29/40

    warehouse can start with data from the online store. Othersubject areas can be added to the data warehouse as theirneeds arise. Point-of-sale (POS) data can be added later ifmanagement decides it is necessary.

    According to Inmon, data warehouse is one part of overallbusiness intelligence system. An enterprise has one datawarehouse, and data marts source their information from thedata warehouse. In the data warehouse, information isstored I 3rd normal form.i.e.,Kimball--First Data Marts--Combined way ---Data warehouseInmon---First Data warehouse--Later----Data marts

    Q52. What is the difference between view and materializedview?

    Ans: View stores the SQL statement in the database and letyou use it as a table. Every time you access the view, theSQL statement executes. Materialized view stores theresults of the SQL in table form in the database. SQLstatement only executes once and after that every time yourun the query, the stored result set is used. Pros includequick query results. Views do not take any space, butmaterialized view take space.

    Q53. What are the advantages of data mining over

    traditional approaches?

    Ans: Data Mining is used for the estimation of future. Forexample, if we take a company/business organization, byusing the concept of Data Mining, we can predict the futureof business in terms of Revenue (or) Employees (or)Customers (or) Orders etc.

    Traditional approaches use simple algorithms for estimatingthe future. But, it does not give accurate results whencompared to Data Mining.

    Q54. What are the different architecture of data warehouse?

    Ans: There are three types of architectures.

    Date warehouse Basic Architecture:In this architecture end users access data that is derivedfrom several sources through the data warehouse.

  • 8/6/2019 2 Printout Dwh Q A

    30/40

    Architecture: Source --> Warehouse --> End Users

    Data warehouse with staging area Architecture:Whenever the data that is derived from sources need to becleaned and processed before putting it into warehouse then

    staging area is used.Architecture: Source --> Staging Area -->Warehouse --> EndUsers

    Data warehouse with staging area and data martsArchitecture:Customization of warehouse architecture for differentgroups in the organization then data marts are added andused.Architecture: Source --> Staging Area --> Warehouse -->Data Marts --> End Users

    Q55. What are the steps to build the data warehouse?

    Ans: Gathering business requirementsIdentifying SourcesIdentifying FactsDefining DimensionsDefine AttributesRedefine Dimensions & AttributesOrganize Attribute Hierarchy & Define RelationshipAssign Unique IdentifiersAdditional conventions: Cardinality/Adding ratios

    1.Understand the business requirements.2.Once the business requirements are clear then identifythe Grains(Levels).3.Grains are defined ,design the Dimensional tables withthe Lower level Grains.4.Once the Dimensions are designed, design the Fact tableWith the Key Performance Indicators(Facts).5.Once the dimensions and Fact tables are designed definethe relation ship between the tables by using primary keyand Foreign Key. In logical phase data base design looks

    like Star Schema design so it is named as Star Schema Design

    Q56. Give example of degenerated dimension?

    Ans: Degenerated Dimension is a dimension key withoutcorresponding dimension. Example:

    In the PointOfSale Transaction Fact table, we have:Date Key (FK), Product Key (FK), Store Key (FK),

  • 8/6/2019 2 Printout Dwh Q A

    31/40

    Promotion Key (FP), and POS Transaction NumberDate Dimension corresponds to Date Key, ProductionDimension corresponds to Production Key. In a traditionalparent-child database, POS Transactional Number wouldbe the key to the transaction header record that contains

    all the info valid for the transaction as a whole, such asthe transaction date and store identifier. But inthis dimensional model, we have already extracted this infointo other dimension. Therefore, POS TransactionNumber looks like a dimension key in the fact table butdoes not have the corresponding dimension table.Therefore, POS Transaction Number is a degenerateddimension.

    Q57. What is the data type of the surrogate key?

    Ans: Data type of surrogate key is either numeric orinteger.

    Q58. What is real-time data warehousing?

    Ans: Real-time data warehousing is a combination of twothings: 1) real-time activity and 2) data warehousing.Real-time activity is activity that is happening right now.The activity could be anything such as the sale of widgets.Once the activity is complete, there is data about it.

    Data warehousing captures business activity data. Real-timedata warehousing captures business activity data as itoccurs. As soon as the business activity is complete andthere is data about it, the completed activity data flowsinto the data warehouse and becomes available instantly. Inother words, real-time data warehousing is a framework forderiving information from data as the data becomesavailable.

    A real time data warehouse provide live data for DSS (maynot be 100% up to that moment, some latency will be there).

    Data warehouse have access to the OLTP sources, data isloaded from the source to the target not daily or weekly,but may be every 10 minutes through replication orlogshipping or something like that. SAP BW is providingreal time DW, with the help of extended star schema, sourcedata is shared.

    In real-time data warehousing, your warehouse contains

  • 8/6/2019 2 Printout Dwh Q A

    32/40

    completely up-to-date data and is synchronized with thesource systems that provide the source data. In near-real-time data warehousing, there is a minimal delay betweensource data being generated and being available in the datawarehouse. Therefore, if you want to achieve real-time or

    near-real-time updates to your data warehouse, youll needto do three things:1. Reduce or eliminate the time taken to get

    new and changed data out of your sourcesystems.

    2. Eliminate, or reduce as much as possible,the time required to cleanse, transform andload your data.

    3. Reduce as much as possible the time requiredto update your aggregates.

    Starting with version 9i, and continuing with the latest10g release, Oracle has gradually introduced features intothe database to support real-time, and near-real-time, datawarehousing. These features include:

    Change data capture External tables, table functions,

    pipelining, and the MERGE command, and Fast refresh materialized views

    Q59. What is normalization, first normal form, secondnormal form, third normal form?

    Ans: Normalization can be defined as segregating of table

    into two different tables, so as to avoid duplication ofvalues.The normalization is a step by step process of removingredundancies and dependencies of attributes in datastructure

    The condition of data at completion of each step isdescribed as a normal form.Needs for normalization : improves data base design.Ensures minimum redundancy of data.Reduces need to reorganize data when design is modified or

    enhanced.Removes anomalies for database activities.

    First normal form : A table is in first normal form when it contains norepeating groups. The repeating column or fields in a unnormalized tableare removed from the table and put in to tables of their

  • 8/6/2019 2 Printout Dwh Q A

    33/40

    own. Such a table becomes dependent on the parent table fromwhich it is derived. The key to this table is called concatenated key, withthe key of the parent table forming a part it.

    Second normal form: A table is in second normal form if all its non_keyfields fully dependent on the whole key. This means that each field in a table ,must depend on theentire key. Those that do not depend upon the combination key, aremoved to another table on whose key they depend on. Structures which do not contain combination keys areautomatically in second normal form.Third normal form: A table is said to be in third normal form , if all thenon key fields of the table are independent of all othernon key fields of the same table.

    Q60. What is the difference between static and dynamiccaches?

    Ans: Static cache stores overloaded values in the memoryand it wont change throughout the running of the sessionWhere as dynamic cache stores the values in the memory andchanges dynamically during the running of the session usedin scd types -- where target table changes and is cache is

    dynamically changes.

    Q61. What is meant by Aggregate fact table?

    Ans: An aggregate fact table stores information that hasbeen aggregated, or summarized from a detail fact table.Aggregate fact table are useful in improving queryperformance.

    Often an aggregate fact table can be maintained through theuse of materialized views, which, under certain databases,

    can automatically be substituted for the detailed facttable if appropriate in resolving a query.

    Q62. What is the life cycle of data warehouse projects?

    Ans: STRAGEGY & PROJECT PLANNINGDefinition of scope, goals, objectives & purpose, andexpectations

  • 8/6/2019 2 Printout Dwh Q A

    34/40

    Establishment of implementation strategyPreliminary identification of project resourcesAssembling of project teamEstimation of project scheduleREQUIREMENTS DEFINITION

    Definition of requirements gathering strategyConducting of interviews and group sessions with usersReview of existing documentsStudy of source operational systemsDerivation of business metrics and dimensions needed foranalysisANALYSIS & DESIGNDesign of the logical data modelDefinition of data extraction, transformation, and loadingfunctionsDesign of the information delivery frameworkEstablishment of storage requirementsDefinitions of the overall architecture and supportinginfrastructureCONSTRUCTIONSelection and installation of infrastructure hardware andsoftwareSelection and installation of the DBMSSelection and installation of ETL and information deliverytoolsCompletion of the design of the physical data modelCompletion of the metadata componentDEPLOYMENT

    Completion of user acceptance testsPerformance of initial data loadsMaking user desktops ready for the data warehouseTraining and support for the initial set of usersProvision for data warehouse security, backup, and recoveryMAINTENANCEOngoing monitoring of the data warehouseContinuous performance tuningOngoing user trainingProvision of ongoing support to usersOngoing data warehouse management

    Q63. What is a CUBE and why we are creating a cube. What isthe difference between ETL and OLAP cubes?

    Ans. Any schema or Table or Report which gives youmeaningful information of one attribute with respect tomore than one attribute is called a cube.For Ex: In a product table with Product ID and Sales

  • 8/6/2019 2 Printout Dwh Q A

    35/40

    columns , we can analyze Sales with respect to ProductName , but if you analyze Sales with respect to Product aswell as Region( region being attribute in LocationTable) the report or Resultant table or schema would beCube.

    ETL Cubes: Built in the staging area to load frequentlyaccessed reports to the target.Reporting Cubes : Built after the actual load of all thetables to the target depending on the customer requirementfor his business analysis.

    Q64. Explain the flow of data starting with OLTP to OLAPincluding staging, summary tables, facts, and dimensions?

    Ans: OLTP(1)---->ODS(2)------>DWH(3)-------->OLAP(4)------------>Reports(5)------>decision(6)

    1-2 (extraction)

    2-3 (Transformation and here ODS is itself staging area)

    3-4-5 (Use of reporting tool and generate reports)

    6-decision is taken i.e. purpose of data warehouse isserved

    Q65. What is the definition of normalized and denormalizedview and what are the differences between them?

    Ans: Normalized View-> Process of eliminating theredundancies.Denormalized View-> process the data where duplicationtakes place. Which means it is not stop the replication.

    I would like to add one more pt. here, as OLTP is inNormalized form, more no. of tables are scanned or referredfor a single query, as through primary key and foreign keydata needs to be fetched from its respective Master tables.Whereas in OLAP, as the data is in De-normalized form, for

    a query the no. of tables queried are less.For eq.:- If we have a banking application. in OLTPenvironment., we will have a separate table for customerpersonal details , Address details, its transaction detailsetc.. Whereas in OLAP environment. These all details can bestored in one single table thus decreasing the scanning ofmultiple tables for a single record of a customer details.

  • 8/6/2019 2 Printout Dwh Q A

    36/40

    Q66. What is BUS schema?

    Ans: BUS schema is composed of a master suite of conformeddimension and standardized definition of facts.

    Bus Schema : Let we consider/explain these in x,y axis

    Dimension Table : A,B,C,D,E,FFact Table: R,S

    Relation between Fact and Dim tables as follows:======================================

    R->> A,B,E,F and S->>D,C,A

    A confirmed must be identified across differentSubject.(Any dimension which is found in two fact say R,S :we need to takes these as Vertical axis say x axis and weneed to take Dimensional table as Horizontal axis as Yaxis. this type of construction of matrix is called busmatrix. these are initially constructed before the universeis created. we can say this initial layout in designing aschema. Every Schema which is started as a Star Schema andthen it expands to Multi Star, Snow Flake, andConstellation Schema.

    Q67. Which automation tool is used in data warehousingtesting?

    Ans: No tool testing is done in data warehouse, only manualtesting is done.

    Q68. What is data warehousing hierarchy?

    Ans: Hierarchies are logical structures that use orderedlevels as a means of organizing data. A hierarchy can beused to define data aggregation. For example, in a timedimension, a hierarchy might aggregate data from the monthlevel to the quarter level to the year level. A hierarchycan also be used to define a navigational drill path and to

    establish a family structure.

    Q69. What is data validation strategies for data martsvalidation after loading process?

    Ans: Data validation is to make sure that the loaded datais accurate and meets the business requirements.Strategies are different methods followed to meet the

  • 8/6/2019 2 Printout Dwh Q A

    37/40

    validation requirements

    Q70. Why should you put your data warehouse on a differentsystem other than your OLTP system?

    Ans: Data Warehouse is a part of OLAP (On-Line AnalyticalProcessing). It is the source from which any BI tools fetchdata for Analytical, reporting or data mining purposes. Itgenerally contains the data through the whole life cycle ofthe company/product. DWH contains historical, integrated,denormalized, subject oriented data.However, on the other hand the OLTP system contains datathat is generally limited to last couple of months or ayear at most. The nature of data in OLTP is: current,volatile and highly normalized. Since, both systems aredifferent in nature and functionality we should always keepthem in different systems.

    An DW is typically used most often for intensive querying.Since the primary responsibility of an OLTP system is tofaithfully record on going transactions(inserts/updates/deletes), these operations will beconsiderably slowed down by the heavy querying that the DWis subjected to.

    Q71. What are semi-additive and factless facts and in whichscenario will you use such kind of fact tables?

    Ans: Semi-additive facts are facts that can be summed upfor some of the dimensions in the fact table, but not theothers. For example:Current_Balance and Profit_Margin are the facts.Current_Balance is a semi-additive fact, as it makes senseto add them up for all accounts (what's the total currentbalance for all accounts in the bank?), but it does notmake sense to add them up through time (adding up allcurrent balances for a given account for each day of themonth does not give us any useful information.

    A factless fact is a fact table that does not contain anyfact. They may consist of nothing but keys. These arecalled factless fact tables.

    first type of factless fact table is a table that recordsan event. Many event-tracking tables in dimensional datawarehouses turn out to be factless.

  • 8/6/2019 2 Printout Dwh Q A

    38/40

    A second kind of factless fact table is called a coveragetable. Coverage tables are frequently needed when a primaryfact table in a dimensional data warehouse is sparse.

    A factless fact table captures the many-to-many

    relationships betweendimensions, but contains no numeric or textual facts. Theyare often used to record events orcoverage information. Common examples of factless facttables include:- Identifying product promotion events (to determinepromoted products that didnt sell)- Tracking student attendance or registration events- Tracking insurance-related accident events- Identifying building, facility, and equipment schedulesfor a hospital or university.

    Q72. What are aggregate tables?

    Ans: Aggregate tables contains the summary of existingwarehouse data which is grouped to certain levels ofdimensions. It is always easy to retrieve data fromaggregated tables than visiting original table which hasmillion records. Aggregate tables reduce the load in thedatabase server and increases the performance of the queryand can retrieve the result quickly.

    These are the tables which contain aggregated / summarizeddata. E.g. Yearly, monthly sales information. These tableswill be used to reduce the query execution time.

    Q73. What are the different methods of loading Dimensiontables?

    Ans: There are two ways to load data in dimension tables.

    Conventional (Slow): All the constraints and keys arevalidated against the data before it is loaded, this way

    data integrity is maintained.

    Direct (Fast): All the constraints and keys are disabledbefore the data is loaded. Once data is loaded, it isvalidated against all the constraints and keys. If data isfound invalid or dirty, it is not included in index and allfuture processes are skipped on this data.

  • 8/6/2019 2 Printout Dwh Q A

    39/40

  • 8/6/2019 2 Printout Dwh Q A

    40/40