Solved

1. With necessary diagram, Explain about Data Warehouse Development Life Cycle.

ANS: B.Inmon advocates own approach to data warehousing development cycle. In contrast to the classic Software Development Life Cycle which starts from requirements, through analysis, design, programming, testing, and integration to implementation, the data warehousing life cycle starts from implementation and ends with requirements understanding:

implement data warehouse integrate data

test for bias

program against data

design DSS system

analyze results of programs

understand requirements

make adjustments to the design

start the cycle all over again for a different set of data

A kind of justification for this is following:“The classical system development life cycle (SDLC) does not work in the world of the DSS analyst. The SDLC assumes that requirements are known at the start of the design (or at least can be discovered). However, in the world of the DSS analyst, requirements are usually the last thing to be discovered in the DSS development life cycle”“The users of the data warehouse environment have a completely different approach to using the system. Unlike operational users who have a straightforward approach to defining their requirements, the data warehouse user operates in a mindset of discovery. The end user of the data warehouse says, “Give me what I say I want, then I can tell you what I really want.”How much data should be available in data warehouse in order to allow the DSS analyst to finally understand the requirements? Are short iterations of the CLDS possible? DSS analyst need to have some solid piece of data warehouse to start using it.

We have seen how operational data is usually application-oriented and, as a consequence, is unintegrated, whereas data warehouse data must be integrated. Other major differences also exist between the operational level of data and processing, and the data warehouse level of data and processing. The underlying development life cycles of these systems can be a profound concern, as shown in Figure 1.13. The system development life cycle for the data warehouse environment is almost exactly the opposite of the classical SDLC.

2. What is Metadata? What is its use in Data Warehouse Architecture?

ANS: Meta Data ContentsThe Entity Relationship Diagram in Figure 1 shows the contents of a meta data repository for a data warehouse. There are three broad categories of meta data:1. Meta data for the business users. Meta data is an like a complete itinerary from AAA(American Automobile Association) showing where they can find what information, how they can access it, how long will it take to access it and what quality can they expect when they finally get it. In Figure 1, entities marked with a "U" indicate they are of great importance for the business user.2. Meta data for the data warehouse administrator. The data warehouse administrator (responsible for populating, maintaining and ensuring availability of the data warehouse) can make his tedious tasks simpler through his own special view of meta data which includes profile and growth metrics in addition to other things marked with an "A" in Figure 1.3. Meta data for the data warehouse developer. Meta data for developers affects their ability to maintain and enhance data marts. Without up-to-date meta data, developers will not be able to maintain and enhance these data marts which can easily grow into conflicting islands of information. Meta data entities marked with a "D" in Figure 1 are of special interest to developers.FIGURE 1: Entity Relationship Diagram

Architecting for Meta DataWhether you are building a single data mart or a giant enterprise-wide data warehouse, architecture should be an integral part of the planning and design process. Developing a long-term architecture early in the project helps sets the vision for the future, guiding the data warehousing team through the phases. Meta data architecture should have the following characteristics:

Mandatory. Meta data has been very important even for OLTP systems. However, for OLTP systems, most of the meta data was required by the IT community consisting of programmers and analysts. It never acquired the importance it has with the onset of data warehousing where most of the meta data is required by end users. They would be totally lost if they didn't know what was available in the warehouse, what it means and how to access it. Data warehouse users should be provided not just accurate meta data but accurate contextual meta data. As described earlier, without accurate and up-to-date contextual meta data information, they can get misleading and ambiguous information, which can lead to wrong decisions. Thus meta data architecture for a data warehousing project should not be an afterthought, but a mandatory and well planned part of the overall architecture.

3. Write briefly any four ETL tools. What is transformation? Briefly explain the basic transformation types.

ANS: ETL (Extract-Transform-Load)

ETL comes from Data Warehousing and stands for Extract-Transform-Load. ETL covers a process of how the data are loaded from the source system to the data warehouse. Currently, the ETL encompasses a cleaning step as a separate step. The sequence is then Extract-Clean-Transform-Load. Let us briefly describe each step of the ETL process.

Process

Extract

The Extract step covers the data extraction from the source system and makes it accessible for further

processing. The main objective of the extract step is to retrieve all the required data from the source system with

as little resources as possible. The extract step should be designed in a way that it does not negatively affect the

source system in terms or performance, response time or any kind of locking.

Clean

The cleaning step is one of the most important as it ensures the quality of the data in the data warehouse.

Cleaning should perform basic data unification rules, such as:

Transform

The transform step applies a set of rules to transform the data from the source to the target. This includes

converting any measured data to the same dimension (i.e. conformed dimension) using the same units so that

they can later be joined. The transformation step also requires joining data from several sources, generating

aggregates, generating surrogate keys, sorting, deriving new calculated values, and applying advanced validation

rules.

Load

During the load step, it is necessary to ensure that the load is performed correctly and with as little resources as

possible. The target of the Load process is often a database. In order to make the load process efficient, it is

helpful to disable any constraints and indexes before the load and enable them back only after the load

completes. The referential integrity needs to be maintained by ETL tool to ensure consistency.

Transformation Flow

From an architectural perspective, you can transform your data in two ways:

Multistage Data Transformation Pipelined Data Transformation

Multistage Data Transformation

The data transformation logic for most data warehouses consists of multiple steps. For example, in transforming new records to be inserted into a sales table, there may be separate logical transformation steps to validate each dimension key.

http://docs.oracle.com/cd/B19306_01/server.102/b14223/transform.htm#BABEAIHD

http://docs.oracle.com/cd/B19306_01/server.102/b14223/transform.htm#i1006213

Figure 14-1 offers a graphical way of looking at the transformation logic.

Figure 14-1 Multistage Data Transformation

Pipelined Data Transformation

The ETL process flow can be changed dramatically and the database becomes an integral part of the ETL solution.

The new functionality renders some of the former necessary process steps obsolete while some others can be remodeled to enhance the data flow and the data transformation to become more scalable and non-interruptive. The task shifts from serial transform-then-load process (with most of the tasks done outside the database) or load-then-transform process, to an enhanced transform-while-loading.

Oracle offers a wide variety of new capabilities to address all the issues and tasks relevant in an ETL scenario. It is important to understand that the database offers toolkit functionality rather than trying to address a one-size-fits-all solution. The underlying database has to enable the most appropriate ETL process flow for a specific customer need, and not dictate or constrain it from a technical perspective. Figure 14-2 illustrates the new functionality, which is discussed throughout later sections.

Figure 14-2 Pipelined Data Transformation

http://docs.oracle.com/cd/B19306_01/server.102/b14223/transform.htm#i1007647

http://docs.oracle.com/cd/B19306_01/server.102/b14223/transform.htm#BABCAHAJ

4. What are ROLAP, MOLAP and HOLAP? What is Multidimensional Analysis? How do we achieve it?

ANS: In the OLAP world, there are mainly two different types: Multidimensional OLAP (MOLAP) and Relational OLAP (ROLAP). Hybrid OLAP (HOLAP) refers to technologies that combine MOLAP and ROLAP.

MOLAP

This is the more traditional way of OLAP analysis. In MOLAP, data is stored in a multidimensional cube. The storage is not in the relational database, but in proprietary formats.

Advantages:

Excellent performance: MOLAP cubes are built for fast data retrieval, and is optimal for slicing and dicing operations.

Can perform complex calculations: All calculations have been pre-generated when the cube is created. Hence, complex calculations are not only doable, but they return quickly.

Disadvantages:

Limited in the amount of data it can handle: Because all calculations are performed when the cube is built, it is not possible to include a large amount of data in the cube itself. This is not to say that the data in the cube cannot be derived from a large amount of data. Indeed, this is possible. But in this case, only summary-level information will be included in the cube itself.

Requires additional investment: Cube technology are often proprietary and do not already exist in the organization. Therefore, to adopt MOLAP technology, chances are additional investments in human and capital resources are needed.

ROLAP

This methodology relies on manipulating the data stored in the relational database to give the appearance of traditional OLAP's slicing and dicing functionality. In essence, each action of slicing and dicing is equivalent to adding a "WHERE" clause in the SQL statement.

Advantages:

Can handle large amounts of data: The data size limitation of ROLAP technology is the limitation on data size of the underlying relational database. In other words, ROLAP itself places no limitation on data amount.

Can leverage functionalities inherent in the relational database: Often, relational database already comes with a host of functionalities. ROLAP technologies, since they sit on top of the relational database, can therefore leverage these functionalities.

Disadvantages:

Performance can be slow: Because each ROLAP report is essentially a SQL query (or multiple SQL queries) in the relational database, the query time can be long if the underlying data size is large.

Limited by SQL functionalities: Because ROLAP technology mainly relies on generating SQL statements to query the relational database, and SQL statements do not fit all needs (for example, it is difficult to perform complex calculations using SQL), ROLAP technologies are therefore traditionally limited by what SQL can do. ROLAP vendors have mitigated this risk by building into the tool out-of-the-box complex functions as well as the ability to allow users to define their own functions.

HOLAP

HOLAP technologies attempt to combine the advantages of MOLAP and ROLAP. For summary-type information, HOLAP leverages cube technology for faster performance. When detail information is needed, HOLAP can "drill through" from the cube into the underlying relational data.

Multidimensional analysis uses dimensions and measures to analyse data. To do this using Business Intelligence tools such as Cognos 8 you usually need to first model your data dimensionally.

Dimensions are hierarchies and have one or more levels. What dimensions you define depends on your data and your business model. The user can look at the data at any level – for example Team level will show totals of all the members of that team, at Area level you will see the total for all teams within that area.

Measures are usually quantities such as quantity sold, total revenue and so on. Once a measure is selected in the analysis the measure is aggregated to the level you are analysing at. So if you are analysing at Area level

and you have selected the total revenue as your measure, you would see the total revenue, aggregated to Area level.

Often the data you require will be organised into a data warehouse consisting of dimensional and fact tables and produced by a skilled data warehouse team.

The data is then modelled as metadata and published to make it available for analysis. In Cognos 8 this is achieved using the Framework Manager application. Hierarchies for each dimension are stored in the model so do not need to be defined by the user.

5. Explain testing process for Data Warehouse with necessary diagram.

ANS: Data Warehouse Testing

Increasingly, businesses are focusing on the collection and organization of data for strategic decision making.

The ability to review historical trends and monitor near real-time operational data has become a key competitive

advantage. SQA Solution provides practical recommendations for testing extract, transform, and load (ETL)

applications based on years of experience testing data warehouses in the financial services and consumer

retailing areas.

A conceptual diagram for ETL and Data Warehouse Testing.

There is definitely a significantly escalating cost connected with discovering software defects later on in the

development lifecycle. In data warehousing, this can be worsened due to the added expenses of utilizing

incorrect data in making important business decisions. Given the importance of early detection of software

defects, here are some general goals of testing an ETL application:

Data completeness. Ensures that all expected data is loaded.

Data transformation. Ensures that all data is transformed correctly according to business rules and/or

design specifications.

Data quality. Makes sure that the ETL software accurately rejects, substitutes default values, fixes or

disregards, and reports incorrect data.

Scalability and performance. Makes sure that data loads and queries are executed within anticipated

time frames and that the technical design is scalable.

Integration testing. Ensures that the ETL process functions well with other upstream and downstream

processes.

User-acceptance testing. Makes sure that the solution satisfies your current expectations and

anticipates your future expectations.

Regression testing. Makes sure that current functionality stays intact whenever new code is released.

6. What is testing? Differentiate between the Data Warehouse testing and traditional software testing.ANS: Testing the data warehouse and business intelligence system is critical to success. Without testing, the data

warehouse could produce incorrect answers and quickly lose the faith of the business intelligence users.

Effective testing requires putting together the right processes, people and technology and deploying them in

productive ways.

Data Warehouse Testing Responsibilities

Who should be involved with testing? The right team is essential to success:

Business Analysts gather and document requirements

QA Testers develop and execute test plans and test scripts

Infrastructure people set up test environments

Developers perform unit tests of their deliverables

DBAs test for performance and stress

Business Users perform functional tests including User Acceptance Tests (UAT)

Differences

Data warehouse database OLTP database Designed for analysis of business measures by categories and

attributes Designed for real time business operations. Optimized for bulk loads and large, complex, unpredictable

queries that access many rows per table. Optimized for a common set of transactions, usually adding or

retrieving a single row at a time per table. Loaded with consistent, valid data; requires no real time validation

Optimized for validation of incoming data during transactions; uses validation data tables. Supports few

concurrent users relative to OLTP Supports thousands of concurrent users.

Solved

Documents

Transcript of Solved