Tutorialspoint.com-Data Warehousing Quick Guide

t ut o rialspo int .co m http://www.tutorialspo int.com/cgi-bin/printpage.cgi

Data Warehousing - Quick Guide

Data Warehousing - OverviewThe term "Data Warehouse" was f irst coined by Bill Inmon in 1990. He said that Data warehouse is subjectOriented, Integrated, Time-Variant and nonvolatile collection of data.This data helps in supporting decisionmaking process by analyst in an organization

The operational database undergoes the per day transactions which causes the f requent changes to thedata on daily basis.But if in f uture the business executive wants to analyse the previous f eedback on anydata such as product,supplier,or the consumer data. In this case the analyst will be having no data availableto analyse because the previous data is updated due to transactions.

The Data Warehouses provide us generalized and consolidated data in multidimensional view. Along withgeneralize and consolidated view of data the Data Warehouses also provide us Online AnalyticalProcessing (OLAP) tools. These tools help us in interactive and ef f ective analysis of data inmultidimensional space. This analysis results in data generalization and data mining.

The data mining f unctions like association,clustering ,classif ication, prediction can be integrated with OLAPoperations to enhance interactive mining of knowledge at multiple level of abstraction. That's why datawarehouse has now become important platf orm f or data analysis and online analytical processing.

Understanding Data Warehouse

The Data Warehouse is that database which is kept separate f rom the organization's operationaldatabase.

There is no f requent updation done in data warehouse.

Data warehouse possess consolidated historical data which help the organization to analyse it 'sbusiness.

Data warehouse helps the executives to organize,understand and use their data to take strategicdecision.

Data warehouse systems available which helps in integration of diversity of application systems.

The Data warehouse system allows analysis of consolidated historical data analysis.

Def inition

Data warehouse is Subject Oriented, Integrated, Time-Variant and Nonvolatile collection of data thatsupport management's decision making process.

Why Data Warehouse Separated f rom Operational Databases

The f ollowing are the reasons why Data Warehouse are kept separate f rom operational databases:

The operational database is constructed f or well known tasks and workload such as searchingparticular records, indexing etc but the data warehouse queries are of ten complex and it presentsthe general f orm of data.

http://www.tutorialspoint.com

http://www.tutorialspoint.com/cgi-bin/printpage.cgi

Operational databases supports the concurrent processing of multiple transactions.Concurrencycontrol and recovery mechanism are required f or operational databases to ensure robustness andconsistency of database.

Operational database query allow to read, modif y operations while the OLAP query need only readonly access of stored data.

Operational database maintain the current data on the other hand data warehouse maintain thehistorical data.

Data Warehouse Features

The key f eatures of Data Warehouse such as Subject Oriented, Integrated, Nonvolatile and Time-Variantare are discussed below:

Subject Oriented - The Data Warehouse is Subject Oriented because it provide us the inf ormationaround a subject rather the organization's ongoing operations. These subjects can be product,customers, suppliers, sales, revenue etc. The data warehouse does not f ocus on the ongoingoperations rather it f ocuses on modelling and analysis of data f or decision making.

Integrated - Data Warehouse is constructed by integration of data f rom heterogeneous sourcessuch as relational databases, f lat f iles etc. This integration enhance the ef f ective analysis of data.

Time-Variant - The Data in Data Warehouse is identif ied with a particular t ime period. The data indata warehouse provide inf ormation f rom historical point of view.

Non Volatile - Non volatile means that the previous data is not removed when new data is added toit. The data warehouse is kept separate f rom the operational database theref ore f requent changesin operational database is not ref lected in data warehouse.

Note: - Data Warehouse does not require transaction processing, recovery and concurrency controlbecause it is physically stored separate f rom the operational database.

Data Warehouse Applications

As discussed bef ore Data Warehouse helps the business executives in organize, analyse and use theirdata f or decision making. Data Warehouse serves as a soul part of a plan-execute-assess "closed- loop"f eedback system f or enterprise management. Data Warehouse is widely used in the f ollowing f ields:

f inancial services

Banking Services

Consumer goods

Retail sectors.

Controlled manuf acturing

Data Warehouse Types

Inf ormation processing, Analytical processing and Data Mining are the three types of data warehouseapplications that are discussed below:

Information processing - Data Warehouse allow us to process the inf ormation stored in it.Theinf ormation can be processed by means of querying, basic statistical analysis, reporting usingcrosstabs, tables, charts, or graphs.

Analytical Processing - Data Warehouse supports analytical processing of the inf ormation storedin it.The data can be analysed by means of basic OLAP operations,including slice-and-dice,drilldown,drill up, and pivoting.

Data Mining - Data Mining supports knowledge discovery by f inding the hidden patterns andassociations, constructing analytical models, perf orming classif ication and prediction.These miningresults can be presented using the visualization tools.

SN Data Warehouse (OLAP) Operational Database(OLTP)

1 This involves historical processing of inf ormation. This involves day to day processing.

2 OLAP systems are used by knowledge workers such asexecutive, manager and analyst.

OLTP system are used by clerk, DBA, ordatabase prof essionals.

3 This is used to analysis the business. This is used to run the business.

4 It f ocuses on Inf ormation out. It f ocuses on Data in.

5 This is based on Star Schema, Snowf lake Schema andFact Constellation Schema.

This is based on Entity RelationshipModel.

6 It f ocuses on Inf ormation out. This is application oriented.

7 This contains historical data. This contains current data.

8 This provides summarized and consolidated data. This provide primitive and highly detaileddata.

9 This provide summarized and multidimensional view ofdata.

This provides detailed and f lat relationalview of data.

10 The number or users are in Hundreds. The number of users are in thousands.

11 The number of records accessed are in millions. The number of records accessed are intens.

12 The database size is f rom 100GB to TB The database size is f rom 100 MB to GB.

13 This are highly f lexible. This provide high perf ormance.

Data Warehousing - Concepts

What is Data Warehousing?

Data Warehousing is the process of constructing and using the data warehouse. The data warehouse isconstructed by integrating the data f rom multiple heterogeneous sources. This data warehouse supportsanalytical reporting, structured and/or ad hoc queries and decision making. Data Warehousing involves datacleaning, data integration and data consolidations.

Using Data Warehouse Information

There are decision support technologies available which help to utilize the data warehouse. Thesetechnologies helps the executives to use the warehouse quickly and ef f ectively. They can gather the data,analyse it and take the decisions based on the inf ormation in the warehouse. The inf ormation gatheredf rom the warehouse can be used in any of the f ollowing domains:

Tuning production strategies - The product strategies can be well tuned by repositioning theproducts and managing product portf olios by comparing the sales quarterly or yearly.

Customer Analysis - The customer analysis is done by analyzing the customer's buyingpref erences, buying time, budget cycles etc.

Operations Analysis - Data Warehousing also helps in customer relationship management, makingenvironmental corrections.The Inf ormation also allow us to analyse the business operations.

Integrating Heterogeneous Databases

To integrate heterogeneous databases we have the two approaches as f ollows:

Query Driven Approach

Update Driven Approach

Query Driven Approach

This is the tradit ional approach to integrate heterogeneous databases. This approach was used to buildwrappers and integrators on the top of multiple heterogeneous databases. These integrators are alsoknown as mediators.

Process of Query Driven Approach:

when the query is issued to a client side, a metadata dictionary translate the query into the queriesappropriate f or the individual heterogeneous site involved.

Now these queries are mapped and sent to the local query processor.

The results f rom heterogeneous sites are integrated into a global answer set.

Disadvantages

The Query Driven Approach needs complex integration and f iltering processes.

This approach is very inef f icient.

This approach is very expensive f or f requent queries.

This approach is also very expensive f or queries that requires aggregations.

Update Driven Approach

We are provided with the alternative approach to tradit ional approach. Today's Data Warehouse systemf ollows update driven approach rather than the tradit ional approach discussed earlier. In Update drivenapproach the inf ormation f rom multiple heterogeneous sources is integrated in advance and stored in awarehouse. This inf ormation is available f or direct querying and analysis.

Advantages

This approach has the f ollowing advantages:

This approach provide high perf ormance.

The data are copied, processed, integrated, annotated, summarized and restructured in semanticdata store in advance.

Query processing does not require interf ace with the processing at local sources.

Data Warehouse Tools and Utilities Functions

The f ollowing are the f unctions of Data Warehouse tools and Utilit ies:

Data Extraction - Data Extraction involves gathering the data f rom multiple heterogeneous sources.

Data Cleaning - Data Cleaning involves f inding and correcting the errors in data.

Data Transformation - Data Transf ormation involves converting data f rom legacy f ormat towarehouse f ormat.

Data Loading - Data Loading involves sorting, summarizing, consolidating, checking integrity andbuilding indices and partit ions.

Refreshing - Ref reshing involves updating f rom data sources to warehouse.

Note: Data Cleaning and Data Transf ormation are important steps in improving the quality of data and datamining results.

Data Warehousing - TerminologiesIn this article, we will discuss some of the commonly used terms in Data Warehouse.

Data Warehouse

Data warehouse is subject Oriented, Integrated, Time-Variant and nonvolatile collection of data thatsupport of management's decision making process. Let's explore this Def init ion of data warehouse.

Subject Oriented - The Data warehouse is subject oriented because it provide us the inf ormationaround a subject rather the organization's ongoing operations. These subjects can be product,customers, suppliers, sales, revenue etc. The data warehouse does not f ocus on the ongoingoperations rather it f ocuses on modelling and analysis of data f or decision making.

Integrated - Data Warehouse is constructed by integration of data f rom heterogeneous sourcessuch as relational databases, f lat f iles etc. This integration enhance the ef f ective analysis of data.

Time-Variant - The Data in Data Warehouse is identif ied with a particular t ime period. The data indata warehouse provide inf ormation f rom historical point of view.

Non Volatile - Non volatile means that the previous data is not removed when new data is added toit. The data warehouse is kept separate f rom the operational database theref ore f requent changesin operational database is not ref lected in data warehouse.

Metadata - Metadata is simply def ined as data about data. The data that are used to representother data is known as metadata. For example the index of a book serve as metadata f or thecontents in the book.In other words we can say that metadata is the summarized data that lead us tothe detailed data.

In terms of data warehouse we can def ine metadata as f ollowing:

Metadata is a road map to data warehouse.

Metadata in data warehouse def ine the warehouse objects.

Metadata Respiratory

The Metadata Respiratory is an integral part of data warehouse system. The Metadata Respiratorycontains the f ollowing metadata:

Business Metadata - This metadata has the data ownership inf ormation, business def init ion andchanging policies.

Operational Metadata -This metadata includes currency of data and data lineage. Currency of datameans whether data is active, archived or purged. Lineage of data means history of data migratedand transf ormation applied on it.

Data for mapping from operational environment to data warehouse -This metadata includessource databases and their contents, data extraction,data partit ion, cleaning, transf ormation rules,data ref resh and purging rules.

The algorithms for summarization - This includes dimension algorithms, data on granularity,aggregation, summarizing etc.

Data cube

Data cube help us to represent the data in multiple dimensions. The data cube is def ined by dimensions andf acts. The dimensions are the entit ies with respect to which an enterprise keep the records.

Illustration of Data cube

Suppose a company wants to keep track of sales records with help of sales data warehouse with respectto t ime, item, branch and location. These dimensions allow to keep track of monthly sales and at whichbranch the items were sold.There is a table associated with each dimension. This table is known asdimension table. This dimension table f urther describes the dimensions. For example "item" dimension tablemay have attributes such as item_name, item_type and item_brand.

The f ollowing table represents 2-D view of Sales Data f or a company with respect to t ime,item and locationdimensions.

But here in this 2-D table we have records with respect to t ime and item only. The sales f or New Delhi areshown with respect to t ime and item dimensions according to type of item sold. If we want to view the salesdata with one new dimension say the location dimension. The 3-D view of the sales data with respect totime, item, and location is shown in the table below:

The above 3-D table can be represented as 3-D data cube as shown in the f ollowing f igure:

Data mart

Data mart contains the subset of organisation-wide data. This subset of data is valuable to specif ic groupof an organisation. in other words we can say that data mart contains only that data which is specif ic to aparticular group. For example the marketing data mart may contain only data related to item, customers andsales. The data mart are conf ined to subjects.

Points to remember about data marts:

window based or Unix/Linux based servers are used to implement data marts. They are implementedon low cost server.

The implementation cycle of data mart is measured in short period of t ime i.e. in weeks rather thanmonths or years.

The lif e cycle of a data mart may be complex in long run if it 's planning and design are notorganisation-wide.

Data mart are small in size.

Data mart are customized by department.

The source of data mart is departmentally structured data warehouse.

Data mart are f lexible.

Graphical Representation of data mart.

Virtual Warehouse

The view over a operational data warehouse is known as virtual warehouse. It is easy to built the virtualwarehouse. Building the virtual warehouse requires excess capacity on operational database servers.

Data Warehousing - Delivery Process

Introduction

The data warehouse are never static. It evolves as the business increases. The today's need may bedif f erent f rom the f uture needs.We must design the data warehouse to change constantly. The realproblem is that business itself is not aware of its requirement f or inf ormation in the f uture.As businessevolves it 's need also changes theref ore the data warehuose must be designed to ride with these changes.Hence the data warehouse systems need to be f lexible.

There should be a delivery process to deliver the data warehouse.But there are many issues in datawarehouse projects that it is very dif f icult to complete the task and deliverables in the strict, orderedf ashion demanded by waterf all method because the requirements are hardly f ully understood. Hence whenthe requirements are completed only then the architectures designs, and build components can becompleted.

Delivery Method

The delivery method is a variant of the joint application development approach, adopted f or delivery of datawarehouse. We staged the data warehouse delivery process to minimize the risk. The approach that i willdiscuss does not reduce the overall delivery t ime-scales but ensures business benef its are deliveredincrementally through the development process.

Note: The delivery process is broken into phases to reduce the project and delivery risk.

Following diagram Explain the Stages in delivery process:

IT Strategy

Data warehouse are strategic investments, that require business process to generate the project benef its.IT Strategy is required to procure and retain f unding f or the project.

Business Case

The objective of Business case is to know the projected business benef its that should be derived f romusing the data warehouse. These benef its may not be quantif iable but the projected benef its need to beclearly stated.. If the data warehouse does not have a clear business case then the business tend to suf f erf rom the credibility problems at some stage during the delivery process.Theref ore in data warehouseproject we need to understand the business case f or investment.

Education and Prototyping

The organization will experiment with the concept of data analysis and educate themselves on the value ofdata warehouse bef ore determining that a data warehouse is prior solution. This is addressed byprototyping. This prototyping activity helps in understanding the f easibility and benef its of a datawarehouse. The Prototyping activity on a small scale can f urther the educational process as long as:

The prototype address a def ined technical objective.

The prototype can be thrown away af ter the f easibility concept has been shown.

The activity addresses a small subset of eventual data content if the data warehouse.

The activity t imescale is non- crit ical.

Points to remember to produce an early release of a part of a data warehouse to deliver business benef its.

Identif y the architecture that is capable of evolving.

Focus on the business requirements and technical blueprint phases.

Limit the scope of the f irst build phase to the minimum that delivers business benef its.

Understand the short term and medium term requirements of the data warehouse.

Business Requirements

To provide the quality deliverables we should make sure that overall requirements are understood. Thebusiness requirements and the technical blueprint stages are required because of the f ollowing reasons:

If we understand the business requirements f or both short and medium term then we can design asolution that satisf ies the short term need.

This would be capable of growing to the f ull solution.

Things to determine in this stage are f ollowing.

The business rule to be applied on data.

The logical model f or inf ormation within the data warehouse.

The query prof iles f or the immediate requirement.

The source systems that provide this data.

Technical Blueprint

This phase need to deliver an overall architecture satisf ying the long term requirements. This phase alsodeliver the components that must be implemented in a short term to derive any business benef it. Theblueprint need to identif y the f ollowings.

The overall system architecture.

The data retention policy.

The backup and recovery strategy.

The server and data mart architecture.

The capacity plan f or hardware and inf rastructure.

The components of database design.

Building the version

In this stage the f irst production deliverable is produced.

This production deliverable smallest component of data warehouse.

This smallest component adds business benef it.

History Load

This is the phase where the remainder of the required history is loaded into the data warehouse. In thisphase we do not add the new entit ies but additional physical tables would probably be created to store theincreased data volumes.

Let's have an example, Suppose the build version phase has delivered a retail sales analysis datawarehouse with 2 months worth of history. This inf ormation will allow the user to analyse only the recenttrends and address the short term issues. The user can not identif y the annual and seasonal trends. Sothe 2 years worth of sales history could be loaded f rom the archive to make user to analyse the salestrend yearly and seasonal. Now the 40GB data is extended to 400GB.

Note:The backup and recovery procedures may become complex theref ore it is recommended that perf ormthis activity within separate phase.

Ad hoc Query

In this phase we conf igure an ad hoc query tool.

This ad hoc query tool is used to operate the data warehouse.

These tools can generate the database query.

Note:It is recommended that not to use these access tolls when database is being substantially modif ied.

Automation

In this phase operational management processes are f ully automated. These would include:

Transf orming the data into a f orm suitable f or analysis.

Monitoring query prof iles and determining the appropriate aggregations to maintain systemperf ormance.

Extracting and loading the data f rom dif f erent source systems.

Generating aggregations f rom predef ined def init ions within the data warehouse.

Backing Up, restoring and archiving the data.

Extending Scope

In this phase the data warehouse is extended to address a new set of business requirements. The scopecan be extended in two ways:

By loading additional data into the data warehouse.

By introducing new data marts using the existing inf ormation.

Note:This phase should be perf ormed separately since this phase involves substantial ef f orts andcomplexity.

Requirements Evolution

From the perspective of delivery process the requirement are always changeable. They are not static.Thedelivery process must support this and allow these changes to be ref lected within the system.

This issue is addressed by designing the data warehouse around the use of data within businessprocesses, as opposed to the data requirements of existing queries.

The architecture is designed to change and grow to match the business needs,the process operates as apseudo application development process, where the new requirements are continually f ed into thedevelopment activit ies. The partial deliverables are produced.These partial deliverables are f ed back tousers and then reworked ensuring that overall system is continually updated to meet the business needs.

Data Warehousing - System ProcessesWe have f ixed number of operations to be applied on operational databases and we have well def inedtechniques such as use normalized data,keep table small etc. These techniques are suitable f ordelivering a solution. But in case of decision support system we do not know what query and operationneed to be executed in f uture. Theref ore techniques applied on operational databases are not suitable f ordata warehouses.

In this chapter We'll f ocus on designing data warehousing solution built on the top open-systemtechnologies like Unix and relational databases.

Process Flow in Data Warehouse

There are f our major processes that build a data warehouse. Here is the list of f our processes:

Extract and load data.

Cleaning and transf orming the data.

Backup and Archive the data.

Managing queries & directing them to the appropriate data sources.

Extract and Load Process

The Data Extraction takes data f rom the source systems.

Data load takes extracted data and loads it into data warehouse.

Note: Bef ore loading the data into data warehouse the inf ormation extracted f rom external sources mustbe reconstructed.

Points to remember while extract and load process:

Controlling the process

When to Init iate Extract

Loading the Data

Controlling the process

Controlling the process involves determining that when to start data extraction and consistency check ondata. Controlling process ensures that tools, logic modules, and the programs are executed in correctsequence and at correct t ime.

When to Init iate Extract

Data need to be in consistent state when it is extracted i.e. the data warehouse should represent single,consistent version of inf ormation to the user.

For example in a customer prof iling data warehouse in telecommunication sector it is illogical to merge listof customers at 8 pm on wednesday f rom a customer database with the customer subscription events upto 8 pm on tuesday. This would mean that we are f inding the customers f or whom there are no associatedsubscription.

Loading the Data

Af ter extracting the data it is loaded into a temporary data store.Here in the temporary data store it iscleaned up and made consistent.

Note: Consistency checks are executed only when all data sources have been loaded into temporary datastore.

Clean and Transform Process

Once data is extracted and loaded into temporary data store it is the time to perf orm Cleaning andTransf orming. Here is the list of steps involved in Cleaning and Transf orming:

Clean and Transf orm the loaded data into a structure.

Partit ion the data.

Aggregation

Clean and Transform the loaded data into a structure

This will speed up the queries.This can be done in the f ollowing ways:

Make sure data is consistent within itself .

Make sure data is consistent with other data within the same data source.

Make sure data is consistent with data in other source systems.

Make sure data is consistent with data already in the warehouse.

Transf orming involves converting the source data into a structure. Structuring the data will result inincreases query perf ormance and decreases operational cost. Inf ormation in data warehouse must betransf ormed to support perf ormance requirement f rom the business and also the ongoing operationalcost.

Part it ion the data

It will optimize the hardware perf ormance and simplif y the management of data warehouse. In this wepartit ion each f act table into a multiple separate partit ions.

Aggregation

Aggregation is required to speed up the common queries. Aggregation rely on the f act that most commonqueries will analyse a subset or an aggregation of the detailed data.

Backup and Archive the data

In order to recover the data in event of data loss, sof tware f ailure or hardware f ailure it is necessary tobacked up on regular basis.Archiving involves removing the old data f rom the system in a f ormat that allowit to be quickly restored whenever required.

For example in a retail sales analysis data warehouse, it may be required to keep data f or 3 years withlatest 6 months data being kept online. In this kind of scenario there is of ten requirement to be able to domonth-on-month comparisons f or this year and last year. In this case we require some data to be restoredf rom the archive.

Query Management Process

This process perf orms the f ollowing f unctions

This process manages the queries.

This process speed up the queries execution.

This Process direct the queries to most ef f ective data sources.

This process should also ensure that all system sources are used in most ef f ective way.

This process is also required to monitor actual query prof iles.

Inf ormation in this process is used by warehouse management process to determine whichaggregations to generate.

This process does not generally operate during regular load of inf ormation into data warehouse.

Data Warehousing - ArchitectureIn this article, we will discuss the business analysis f ramework f or data warehouse design and architectureof a data warehouse.

Business Analysis Framework

The business analyst get the inf ormation f rom the data warehouses to measure the perf ormance andmake crit ical adjustments in order to win over other business holders in the market. Having data warehousehas the f ollowing advantages f or the business.

Since the data warehouse can gather the inf ormation quickly and ef f iciently theref ore it can enhancethe business productivity.

The data warehouse provides us the consistent view of customers and items hence help us tomanage the customer relationship.

The data warehouse also helps in bringing cost reduction by tracking trends, patterns over a longperiod in a consistent and reliable manner.

To design an ef f ective and ef f icient data warehouse we are required to understand and analyze thebusiness needs and construct a business analysis framework. Each person has dif f erent views regardingthe design of a data warehouse. These views are as f ollows:

The top-down view - This view allows the selection of relevant inf ormation needed f or datawarehouse.

The data source view - This view presents the inf ormation being captured, stored, and managed byoperational system.

The data warehouse view - This view includes the f act tables and dimension tables.This representthe inf ormation stored inside the data warehouse.

The Business Query view - It is the view of the data f rom the viewpoint of the end user.

Three-Tier Data Warehouse Architecture

Generally the data warehouses adopt the three-tier architecture. Following are the three tiers of datawarehouse architecture.

Bottom Tier - The bottom tier of the architecture is the data warehouse database server.It is therelational database system.We use the back end tools and utilit ies to f eed data into bottomtier.these back end tools and utilit ies perf orms the Extract, Clean, Load, and ref resh f unctions.

Middle Tier - In the middle tier we have OLAp Server. the OLAP Server can be implemented in eitherof the f ollowing ways.

By relational OLAP (ROLAP), which is an extended relational database management system.The ROLAP maps the operations on multidimensional data to standard relational operations.

By Multidimensional OLAP (MOLAP) model, which directly implements multidimensional data andoperations.

Top-Tier - This t ier is the f ront-end client layer. This layer hold the query tools and reporting tool,analysis tools and data mining tools.

Following diagram explains the Three-tier Architecture of Data warehouse:

Data Warehouse Models

From the perspective of data warehouse architecture we have the f ollowing data warehouse models:

Virtual Warehouse

Data mart

Enterprise Warehouse

Virtual Warehouse

The view over a operational data warehouse is known as virtual warehouse. It is easy to built thevirtual warehouse.

Building the virtual warehouse requires excess capacity on operational database servers.

Data Mart

Data mart contains the subset of organisation-wide data.

This subset of data is valuable to specif ic group of an organisation

Note: in other words we can say that data mart contains only that data which is specif ic to a particulargroup. For example the marketing data mart may contain only data related to item, customers and sales.The data mart are conf ined to subjects.

Points to remember about data marts

window based or Unix/Linux based servers are used to implement data marts. They are implementedon low cost server.

The implementation cycle of data mart is measured in short period of t ime i.e. in weeks rather thanmonths or years.

The lif e cycle of a data mart may be complex in long run if it 's planning and design are notorganisation-wide.

Data mart are small in size.

Data mart are customized by department.

The source of data mart is departmentally structured data warehouse.

Data mart are f lexible.

Enterprise Warehouse

The enterprise warehouse collects all the inf ormation all the subjects spanning the entireorganization

This provide us the enterprise-wide data integration.

This provide us the enterprise-wide data integration.

The data is integrated f rom operational systems and external inf ormation providers.

This inf ormation can vary f rom a f ew gigabytes to hundreds of gigabytes, terabytes or beyond.

Load Manager

This Component perf orms the operations required to extract and load process.

The size and complexity of load manager varies between specif ic solutions f rom data warehouse todata warehouse.

Load Manager Architecture

The load manager perf orms the f ollowing f unctions:

Extract the data f rom source system.

Fast Load the extracted data into temporary data store.

Perf orm simple transf ormations into structure similar to the one in the data warehouse.

Extract Data from Source

The data is extracted f rom the operational databases or the external inf ormation providers. Gateways isthe application programs that are used to extract data. It is supported by underlying DBMS and allows clientprogram to generate SQL to be executed at a server. Open Database Connection( ODBC), Java DatabaseConnection (JDBC), are examples of gateway.

Fast Load

In order to minimize the total load window the data need to be loaded into the warehouse in thef astest possible t ime.

The transf ormations af f ects the speed of data processing.

It is more ef f ective to load the data into relational database prior to applying transf ormations andchecks.

Gateway technology proves to be not suitable, since they tend not be perf ormant when large datavolumes are involved.

Simple Transformations

While loading it may be required to perf orm simple transf ormations. Af ter this has been completed we are inposit ion to do the complex checks. Suppose we are loading the EPOS sales transaction we need toperf orm the f ollowing checks:

Strip out all the columns that are not required within the warehouse.

Convert all the values to required data types.

Warehouse Manager

Warehouse manager is responsible f or the warehouse management process.

The warehouse manager consist of third party system sof tware, C programs and shell scripts.

Warehouse Manager Architecture

The warehouse manager includes the f ollowing:

The Controlling process

Stored procedures or C with SQL

Backup/Recovery tool

SQL Scripts

Operat ions Performed by Warehouse Manager

Warehouse manager analyses the data to perf orm consistency and ref erential integrity checks.

Creates the indexes, business views, partit ion views against the base data.

Generates the new aggregations and also updates the existing aggregation. Generates thenormalizations.

Warehouse manager Warehouse manager transf orms and merge the source data into the temporarystore into the published data warehouse.

Backup the data in the data warehouse.

Warehouse Manager archives the data that has reached the end of its captured lif e.

Note: Warehouse Manager also analyses query prof iles to determine index and aggregations areappropriate.

Query Manager

Query Manager is responsible f or directing the queries to the suitable tables.

By directing the queries to appropriate table the query request and response process is speed up.

Query Manager is responsible f or scheduling the execution of the queries posed by the user.

Query Manager Architecture

Query Manager includes the f ollowing:

The query redirection via C tool or RDBMS.

Stored procedures.

Query Management tool.

Query Scheduling via C tool or RDBMS.

Query Scheduling via third party Sof tware.

Detailed information

The f ollowing diagram shows the detailed inf ormation

The detailed inf ormation is not kept online rather is aggregated to the next level of detail and then archivedto the tape. The detailed inf omation part of data warehouse keep the detailed inf ormation in the starf lakeschema. the detailed inf ormation is loaded into the data warehouse to supplement the aggregated data.

Note: If the detailed inf ormation is held of f line to minimize the disk storage we should make sure that thedata has been extracted, cleaned up, and transf ormed then into starf lake schema bef ore it is archived.

Summary Information

In this area of data warehouse the predef ined aggregations are kept.

These aggregations are generated by warehouse manager.

This area changes on ongoing basis in order to respond to the changing query prof iles.

This area of data warehouse must be treated as transient.

Points to remember about summary inf ormation.

The summary data speed up the perf ormance of common queries.

It increases the operational cost.

It need to be updated whenever new data is loaded into the data warehouse.

It may not have been backed up, since it can be generated f resh f rom the detailed inf ormation.

Data Warehousing - OLAP

Introduction

Online Analytical Processing Server (OLAP) is based on multidimensional data model. It allows the managers, analysts to get insight the inf ormation through f ast, consistent, interactive access to inf ormation. In thischapter we will discuss about types of OLAP, operations on OLAP, Dif f erence between OLAP and StatisticalDatabases and OLTP.

Types of OLAP Servers

We have f our types of OLAP servers that are listed below.

Relational OLAP(ROLAP)

Multidimensional OLAP (MOLAP)

Hybrid OLAP (HOLAP)

Specialized SQL Servers

Relational OLAP(ROLAP)

The Relational OLAP servers are placed between relational back-end server and client f ront-end tools. Tostore and manage warehouse data the Relational OLAP use relational or extended-relational DBMS.

ROLAP includes the f ollowing.

implementation of aggregation navigation logic.

optimization f or each DBMS back end.

additional tools and services.

Multidimensional OLAP (MOLAP)

Multidimensional OLAP (MOLAP) uses the array-based multidimensional storage engines f ormultidimensional views of data.With multidimensional data stores, the storage utilization may be low if thedata set is sparse. Theref ore many MOLAP Server uses the two level of data storage representation tohandle dense and sparse data sets.

Hybrid OLAP (HOLAP)

The hybrid OLAP technique combination of ROLAP and MOLAP both. It has both the higher scalability ofROLAP and f aster computation of MOLAP. HOLAP server allows to store the large data volumes of detaildata. the aggregations are stored separated in MOLAP store.

Specialized SQL Servers

specialized SQL servers provides advanced query language and query processing support f or SQL queriesover star and snowf lake schemas in a read-only environment.

OLAP Operations

As we know that the OLAP server is based on the multidimensional view of data hence we will discuss theOLAP operations in multidimensional data.

Here is the list of OLAP operations.

Roll-up

Drill-down

Slice and dice

Pivot (rotate)

Roll-up

This operation perf orms aggregation on a data cube in any of the f ollowing way:

By climbing up a concept hierarchy f or a dimension

By dimension reduction.

Consider the f ollowing diagram showing the roll-up operation.

The roll-up operation is perf ormed by climbing up a concept hierarchy f or the dimension location.

Init ially the concept hierarchy was "street < city < province < country".

On rolling up the data is aggregated by ascending the location hierarchy f rom the level of city to levelof country.

The data is grouped into cit ies rather than countries.

When roll-up operation is perf ormed then one or more dimensions f rom the data cube are removed.

Drill-down

Drill-down operation is reverse of the roll-up. This operation is perf ormed by either of the f ollowing way:

By stepping down a concept hierarchy f or a dimension.

By introducing new dimension.

Consider the f ollowing diagram showing the drill-down operation:

The drill-down operation is perf ormed by stepping down a concept hierarchy f or the dimension time.

Init ially the concept hierarchy was "day < month < quarter < year."

On drill-up the time dimension is descended f rom the level quarter to the level of month.

When drill-down operation is perf ormed then one or more dimensions f rom the data cube are added.

It navigates the data f rom less detailed data to highly detailed data.

Slice

The slice operation perf orms selection of one dimension on a given cube and give us a new sub cube.Consider the f ollowing diagram showing the slice operation.

The Slice operation is perf ormed f or thedimension time using the criterion time ="Q1".

It will f orm a new sub cube by selecting one ormore dimensions.

Dice

The Dice operation perf orms selection of two ormore dimension on a given cube and give us a newsubcube. Consider the f ollowing diagram showingthe dice operation:

The dice operation on the cube based on thef ollowing selection criteria that involve threedimensions.

(location = "Toronto" or "Vancouver")

(t ime = "Q1" or "Q2")

(item =" Mobile" or "Modem").

Pivot

The pivot operation is also known as rotation.Itrotates the data axes in view in order to provide analternative presentation of data.Consider thef ollowing diagram showing the pivot operation.

In this the item and location axes in 2-D slice are rotated.

OLAP vs OLTP

SN Data Warehouse (OLAP) Operational Database(OLTP)

1 This involves historical processing of inf ormation. This involves day to day processing.

2 OLAP systems are used by knowledge workers such asexecutive, manager and analyst.

OLTP system are used by clerk, DBA, ordatabase prof essionals.

3 This is used to analysis the business. This is used to run the business.

4 It f ocuses on Inf ormation out. It f ocuses on Data in.

5 This is based on Star Schema, Snowf lake Schema andFact Constellation Schema.

This is based on Entity RelationshipModel.

6 It f ocuses on Inf ormation out. This is application oriented.

7 This contains historical data. This contains current data.

8 This provides summarized and consolidated data. This provide primitive and highly detaileddata.

9 This provide summarized and multidimensional view ofdata.

This provides detailed and f lat relationalview of data.

10 The number or users are in Hundreds. The number of users are in thousands.

11 The number of records accessed are in millions. The number of records accessed are intens.

12 The database size is f rom 100GB to TB The database size is f rom 100 MB to GB.

13 This are highly f lexible. This provide high perf ormance.

Data Warehousing - Relational OLAP

Introduction

The Relational OLAP servers are placed between relational back-end server and client f ront-end tools. Tostore and manage warehouse data the Relational OLAP use relational or extended-relational DBMS.

ROLAP includes the f ollowing.

implementation of aggregation navigation logic.

optimization f or each DBMS back end.

additional tools and services.

Note: The ROLAP servers are highly scalable.

Points to remember

The ROLAP tools need to analyze large volume of data across multiple dimensions.

The ROLAP tools need to store and analyze highly volatile and changeable data.

Relational OLAP Architecture

The ROLAP includes the f ollowing.

Database Server

ROLAP Server

Front end tool

Advantages

The ROLAP servers are highly scalable.

They can be easily used with the existing RDBMS.

Data Can be stored ef f iciently since no zero f acts can be stored.

ROLAP tools do not use pre-calculated data cubes.

DSS server of microstrategy adopts the ROLAP approach.

Disadvantages

Poor query perf ormance.

Some limitations of scalability depending on the technology architecture that is utilized.

Data Warehousing - Multidimensional OLAP

Introduction

Multidimensional OLAP (MOLAP) uses the array-based multidimensional storage engines f ormultidimensional views of data. With multidimensional data stores, the storage utilization may be low if thedata set is sparse. Theref ore many MOLAP Server uses the two level of data storage representation tohandle dense and sparse data sets.

Points to remember:

MOLAP tools need to process inf ormation with consistent response time regardless of level ofsummarizing or calculations selected.

The MOLAP tools need to avoid many of the complexit ies of creating a relational database to storedata f or analysis.

The MOLAP tools need f astest possible perf ormance.

MOLAP Server adopts two level of storage representation to handle dense and sparse data sets.

Denser subcubes are identif ied and stored as array structure.

Sparse subcubes employs compression technology.

MOLAP Architecture

MOLAP includes the f ollowing components.

Database server

MOLAP server

Front end tool

Advantages

Here is the list of advantages of Multidimensional OLAP

MOLAP allows f astest indexing to the precomputed summarized data.

Helps the user who are connected to a network and need to analyze larger, less def ined data.

Easier to use theref ore MOLAP is best suitable f or inexperienced user.

Disadvantages

MOLAP are not capable of containing detailed data.

The storage utilization may be low if the data set is sparse.

MOLAP vs ROLAP

SN MOLAP ROLAP

1 The inf ormation retrieval is f ast. Inf ormation retrieval is comparatively slow.

2 It uses the sparse array to store the data sets. It uses relational table.

3 MOLAP is best suited f or inexperienced users sinceit is very easy to use.

ROLAP is best suited f or experienced users.

4 The separate database f or data cube. It may not require space other than availablein Data warehouse.

5 DBMS f acility is weak. DBMS f acility is strong.

Data Warehousing - Schemas

Introduction

The schema is a logical description of the entire database. The schema includes the name and descriptionof records of all record types including all associated data- items and aggregates. Likewise the databasethe data warehouse also require the schema. The database uses the relational model on the other handthe data warehouse uses the Stars, snowf lake and f act constellation schema. In this chapter we willdiscuss the schemas used in data warehouse.

In star schema each dimension is represented with only one dimension table.

This dimension table contains the set of attributes.

In the f ollowing diagram we have shown the sales data of a company with respect to the f ourdimensions namely, t ime, item, branch and location.

There is a f act table at the centre. This f act table contains the keys to each of f our dimensions.

The f act table also contain the attributes namely, dollars sold and units sold.

Note: Each dimension has only one dimension table and each table holds a set of attributes. For examplethe location dimension table contains the attribute set {location_key,street,city,province_or_state,country}.This constraint may cause data redundancy. For example the "Vancouver" and "Victoria" both cit ies are bothin Canadian province of Brit ish Columbia. The entries f or such cit ies may cause data redundancy along theattributes province_or_state and country.

Snowflake Schema

In Snowf lake schema some dimension tables are normalized.

The normalization split up the data into additional tables.

Unlike Star schema the dimensions table in snowf lake schema are normalized f or example the itemdimension table in star schema is normalized and split into two dimension tables namely, item andsupplier table.

Theref ore now the item dimension table contains the attributes item_key, item_name, type, brand,and supplier-key.

The supplier key is linked to supplier dimension table. The supplier dimension table contains theattributes supplier_key, and supplier_type.

<>Note: Due to normalization in Snowflake schema the redundancy is reduced therefore itbecomes easy to maintain and save storage space.<>

Fact Constellation Schema

In f act Constellation there are multiple f act tables. This schema is also known as galaxy schema.

In the f ollowing diagram we have two f act tables namely, sales and shipping.

The sale f act table is same as that in star schema.

The shipping f act table also contains two measures namely, dollars sold and units sold.

It is also possible f or dimension table to share between f act tables. For example time, item andlocation dimension tables are shared between sales and shipping f act table.

Schema Def inition

The Multidimensional schema is def ined using Data Mining Query Language( DMQL). the two primitivesnamely, cube def init ion and dimension def init ion can be used f or def ining the Data warehouses and datamarts.

Syntax for cube def init ion

define cube < cube_name > [ < dimension-list > }: < measure_list >

Syntax for dimension def init ion

define dimension < dimension_name > as ( < attribute_or_dimension_list > )

Star Schema Def inition

The star schema that we have discussed can be def ined using the Data Mining Query Language (DMQL) asf ollows:

define cube sales star [t ime, item, branch, location]: dollars sold = sum(sales in dollars), units sold = count(*)

define dimension t ime as (t ime key, day, day of week, month, quarter, year)define dimension item as (item key, item name, brand, type, supplier type) define dimension branch as (branch key, branch name, branch type) define dimension location as (location key, street, city, province or state, country)

Snowflake Schema Def inition

The Snowf lake schema that we have discussed can be def ined using the Data Mining Query Language(DMQL) as f ollows:

define cube sales snowflake [t ime, item, branch, location]:

dollars sold = sum(sales in dollars), units sold = count(*)

define dimension t ime as (t ime key, day, day of week, month, quarter, year)define dimension item as (item key, item name, brand, type, supplier(supplier key, supplier type))define dimension branch as (branch key, branch name, branch type)define dimension location as (location key, street, city(city key, city, province or state, country))

Fact Constellation Schema Def inition

The Snowf lake schema that we have discussed can be def ined using the Data Mining Query Language(DMQL) as f ollows:

define cube sales [t ime, item, branch, location]:

dollars sold = sum(sales in dollars), units sold = count(*)

define dimension t ime as (t ime key, day, day of week, month, quarter, year)define dimension item as (item key, item name, brand, type, supplier type)define dimension branch as (branch key, branch name, branch type)define dimension location as (location key, street, city, province or state,country)define cube shipping [t ime, item, shipper, from location, to location]:

dollars cost = sum(cost in dollars), units shipped = count(*)

define dimension t ime as t ime in cube salesdefine dimension item as item in cube salesdefine dimension shipper as (shipper key, shipper name, location aslocation in cube sales, shipper type)define dimension from location as location in cube salesdefine dimension to location as location in cube sales

Data Warehousing - Partitioning Strategy

Introduction

The partit ioning is done to enhance the perf ormance and make the management easy. Partit ioning alsohelps in balancing the various requirements of the system. It will optimize the hardware perf ormance andsimplif y the management of data warehouse. In this we partit ion each f act table into a multiple separatepartit ions. In this chapter we will discuss about the partit ioning strategies.

Why to Partition

Here is the list of reasons.

For easy management

To assist backup/recovery

To enhance perf ormance

For easy management

The f act table in data warehouse can grow to many hundreds of gigabytes in size. This too large size off act table is very hard to manage as a single entity. Theref ore it needs partit ion.

To assist backup/recovery

If we do not have partit ioned the f act table then we have to load the complete f act table with all thedata.Partit ioning allow us to load that data which is required on regular basis. This will reduce the time toload and also enhances the perf ormance of the system.

Note: To cut down on the backup size all partit ions other than the current partit ions can be marked readonly. We can then put these partit ion into a state where they can not be modif ied.Then they can be backedup .This means that only the current partit ion is to be backed up.

To enhance performance

By partit ioning the f act table into sets of data the query procedures can be enhanced. The queryperf ormance is enhanced because now the query scans the partit ions that are relevant. It does not have toscan the large amount of data.

Horizontal Partitioning

There are various way in which f act table can be partit ioned. In horizontal partit ioning we have to keep inmind the requirements f or manageability of the data warehouse.

Part it ioning by Time into equal Segments

In this partit ioning strategy the f act table is partit ioned on the bases of t ime period. Here each time periodrepresents a signif icant retention period within the business. For example if the user queries f or month todate data then it is appropriate to partit ion into monthly segments. We can reuse the partit ioned tables byremoving the data in them.

Part it ioning by t ime into dif ferent-sized segments

This kind of partit ion is done where the aged data is accessed inf requently. This partit ion is implemented asa set of small partit ions f or relatively current data, larger partit ion f or inactive data.

Following is the list of advantages.

The detailed inf ormation remains available online.

The number of physical tables is kept relatively small, whichreduces the operating cost.

This technique is suitable where the mix of data dipping recenthistory, and data mining through entire history is required.

Following is the list of disadvantages.

This technique is not usef ul where the partit ioning prof ile changes on regular basis, because therepartit ioning will increase the operation cost of data warehouse.

Part it ion on a dif ferent dimension

The f act table can also be partit ioned on basis of dimensions other than time such as productgroup,region,supplier, or any other dimensions. Let's have an example.

Suppose a market f unction which is structured into distinct regional departments f or example state bystate basis. If each region wants to query on inf ormation captured within its region, it would proves to bemore ef f ective to partit ion the f act table into regional partit ions. This will cause the queries to speed upbecause it does not require to scan inf ormation that is not relevant.


Since the query does not have to scan the irrelevant data which speed up the query process.


This technique is not appropriate where the dimensions are unlikely to change in f uture. So it isworth determining that the dimension does not change in f uture.

If the dimension changes then the entire f act table would have to be repartit ioned.

Note: We recommend that do the partit ion only on the basis of t ime dimension unless you are certain thatthe suggested dimension grouping will not change within the lif e of data warehouse.

Part it ion by size of table

When there are no clear basis f or partit ioning the f act table on any dimension then we should partit ionthe fact table on the basis of their size. We can set the predetermined size as a crit ical point. when thetable exceeds the predetermined size a new table partit ion is created.


This partit ioning is complex to manage.

Note: This partit ioning required metadata to identif y what data stored in each partit ion.

Part it ioning Dimensions

If the dimension contain the large number of entries then it is required to partit ion dimensions. Here wehave to check the size of dimension.

Suppose a large design which changes over t ime. If we need to store all the variations in order to applycomparisons, that dimension may be very large. This would def initely af f ect the response time.

Round Robin Part it ions

In round robin technique when the new partit ion is needed the old one is archived. In this techniquemetadata is used to allow user access tool to ref er to the correct table partit ion.


This technique make it easy to automate table management f acilit ies within the data warehouse.

Vertical Partition

In Vertical Partit ioning the data is split vertically.

The Vertical Partit ioning can be perf ormed in the f ollowing two ways.

Normalization

Row Splitt ing

Normalizat ion

Normalization method is the standard relational method of database organization. In this method the rowsare collapsed into single row, hence reduce the space.

Table bef ore normalization

Product_id Quantity Value sales_date Store_id Store_name Location Region

30 5 3.67 3-Aug-13 16 sunny Bangalore S

35 4 5.33 3-Sep-13 16 sunny Bangalore S

40 5 2.50 3-Sep-13 64 san Mumbai W

45 7 5.66 3-Sep-13 16 sunny Bangalore S

Table af ter normalization

Store_id Store_name Location Region

16 sunny Bangalore W

64 san Mumbai S

Product_id Quantity Value sales_date Store_id

30 5 3.67 3-Aug-13 16

35 4 5.33 3-Sep-13 16

40 5 2.50 3-Sep-13 64

45 7 5.66 3-Sep-13 16

Row Split t ing

The row splitt ing tend to leave a one-to-one map between partit ions. The motive of row splitt ing is tospeed the access to large table by reducing its size.

Note: while using vertical partit ioning make sure that there is no requirement to perf orm major joinoperations between two partit ions.

Identify Key to Partition

It is very crucial to choose the right partit ion key.Choosing wrong partit ion key will lead you to reorganizethe f act table. Let's have an example. Suppose we want to partit ion the f ollowing table.

Account_Txn_Tabletransaction_idaccount_idtransaction_typevaluetransaction_dateregionbranch_name

We can choose to partit ion on any key. The two possible keys could be

region

transaction_date

Now suppose the business is organised in 30 geographical regions and each region have dif f erent numberof branches.That will give us 30 partit ions, which is reasonable. This partit ioning is good enough becauseour requirements capture has shown that vast majority of queries are restricted to the user's own businessregion.

Now If we partit ion by transaction_date instead of region. Then it means that the latest transaction f romevery region will be in one partit ion. Now the user who wants to look at data within his own region has toquery across multiple partit ion.

Hence it is worth determining the right partit ioning key.

Data Warehousing - Metadata Concepts

What is Metadata

Metadata is simply def ined as data about data. The data that are used to represent other data is known asmetadata. For example the index of a book serve as metadata f or the contents in the book. In other wordswe can say that metadata is the summarized data that leads us to the detailed data. In terms of datawarehouse we can def ine metadata as f ollowing.

Metadata is a road map to data warehouse.

Metadata in data warehouse def ine the warehouse objects.

The metadata act as a directory.This directory helps the decision support system to locate thecontents of data warehouse.

Note: In data warehouse we create metadata f or the data names and def init ions of a given datawarehouse. Along with this metadata additional metadata are also created f or t imestamping any extracteddata, the source of extracted data.

Categories of Metadata

The metadata can be broadly categorized into three categories:


Technical Metadata - Technical metadata includes database system names, table and columnnames and sizes, data types and allowed values. Technical metadata also includes structuralinf ormation such as primary and f oreign key attributes and indices.

Operational Metadata - This metadata includes currency of data and data lineage.Currency of datameans whether data is active, archived or purged. Lineage of data means history of data migratedand transf ormation applied on it.

Role of Metadata

Metadata has very important role in data warehouse. The role of metadata in warehouse is dif f erent f romthe warehouse data yet it has very important role. The various roles of metadata are explained below.

The metadata act as a directory.

This directory helps the decision support system to locate the contents of data warehouse.

Metadata helps in decision support system f or mapping of data when data are transf ormed f romoperational environment to data warehouse environment.

Metadata helps in summarization between current detailed data and highly summarized data.

Metadata also helps in summarization between lightly detailed data and highly summarized data.

Metadata are also used f or query tools.

Metadata are used in reporting tools.

Metadata are used in extraction and cleansing tools.

Metadata are used in transf ormation tools.

Metadata also plays important role in loading f unctions.

Diagram to understand role of Metadata.

Metadata Respiratory

The Metadata Respiratory is an integral part of data warehouse system. The Metadata Respiratory has thef ollowing metadata:

Definit ion of data warehouse - This includes the description of structure of data warehouse. Thedescription is def ined by schema, view, hierarchies, derived data def init ions, and data mart locationsand contents.


Operational Metadata - This metadata includes currency of data and data lineage. Currency of datameans whether data is active, archived or purged. Lineage of data means history of data migratedand transf ormation applied on it.

Data for mapping from operational environment to data warehouse - This metadata includessource databases and their contents, data extraction,data partit ion cleaning, transf ormation rules,data ref resh and purging rules.

The algorithms for summarization - This includes dimension algorithms, data on granularity,aggregation, summarizing etc.

Challenges for Metadata Management

The importance of metadata can not be overstated. Metadata helps in driving the accuracy of reports,validates data transf ormation and ensures the accuracy of calculations. The metadata also enf orces theconsistent def init ion of business terms to business end users. With all these uses of Metadata it also haschallenges f or metadata management. The some of the challenges are discussed below.

The Metadata in a big organization is scattered across the organization. This metadata is spreadedin spreadsheets, databases, and applications.

The metadata could present in text f ile or multimedia f ile. To use this data f or inf ormationmanagement solution, this data need to be correctly def ined.

There are no industry wide accepted standards. The data management solution vendors have narrowf ocus.

There is no easy and accepted methods of passing metadata.

Data Warehousing - Data Marting

Why to create Datamart

The f ollowing are the reasons to create datamart:

To partit ion data in order to impose access control strategies.

To speed up the queries by reducing the volume of data to be scanned.

To segment data into dif f erent hardware platf orms.

To structure data in a f orm suitable f or a user access tool.

Note: Donot data mart f or any other reason since the operation cost of data marting could be very high.Bef ore data marting, make sure that data marting strategy is appropriate f or your particular solution.

Steps to determine that data mart appears to f it the bill

Following steps need to be f ollowed to make cost ef f ective data marting:

Identif y the Functional Splits

Identif y User Access Tool Requirements

Identif y Access Control Issues

Ident ify the Functional Splits

In this step we determine that whether the natural f unctional split is there in the organization. We look f ordepartmental splits, and we determine whether the way in which department use inf ormation tends to be inisolation f rom the rest of the organization. Let's have an example...

suppose in a retail organization where the each merchant is accountable f or maximizing the sales of agroup of products. For this the inf ormation that is valuable is :

sales transaction on daily basis

sales f orecast on weekly basis

stock posit ion on daily basis

stock movements on daily basis

As the merchant is not interested in the products they are not dealing with, so the data marting is subset ofthe data dealing which the product group of interest. Following diagram shows data marting f or dif f erentusers.

Issues in determining the funct ional split :

The structure of the department may change.

The products might switch f rom one department to other.

The merchant could query the sales trend of other products to analyse what is happening to thesales.

These are issues that need to be taken into account while determining the f unctional split.

Note: we need to determine the business benef its and technical f easibility of using data mart.

Ident ify User Access Tool Requirements

For the user access tools that require the internal data structures we need data mart to support suchtools. The data in such structures are outside the control of data warehouse but need to be populated andupdated on regular basis.

There are some tools that populated directly f rom the source system but some can not. Theref oreadditional requirements outside the scope of the tool are needed to be identif ied f or f uture.

Note: In order to ensure consistency of data across all access tools the data should not be directlypopulated f rom the data warehouse rather each tool must have its own data mart.

Ident ify Access Control Issues

There need to be privacy rules to ensure the data is accessed by the authorised users only. For example indata warehouse f or retail baking institution ensure that all the accounts belong to the same legal entity.Privacy laws can f orce you to totally prevent access to inf ormation that is not owned by the specif ic bank.

Data mart allow us to build complete wall by physically separating data segments within the datawarehouse. To avoid possible privacy problems the detailed data can be removed f rom the datawarehouse.We can create data mart f or each legal entity and load it via data warehouse, with detailedaccount data.

Designing Data Marts

The data marts should be designed as smaller version of starf lake schema with in the data warehouse andshould match to the database design of the data warehouse. This helps in maintaining control on databaseinstances.

The summaries are data marted in the same way as they would have been designed within the datawarehouse. Summary tables helps to utilize all dimension data in the starf lake schema.

Cost Of Data Marting

The f ollowing are the cost measures f or Data marting:

Hardware and Sof tware Cost

Network Access

Time Window Constraints

Hardware and Software Cost

Although the data marts are created on the same hardware even then they require some additionalhardware and sof tware.To handle the user queries there is need of additional processing power and diskstorage. If the detailed data and the data mart exist within the data warehouse then we would f aceadditional cost to store and manage replicated data.

Note: The data marting is more expensive than aggregations theref ore it should be used as an additionalstrategy not as an alternative strategy.

Network Access

The data mart could be on dif f erent locations f rom the data warehouse so we should ensure that the LANor WAN has the capacity to handle the data volumes being transf erred within the data mart load process.

Time Window Constraints

The extent to which the data mart loading process will eat into the available t ime window will depend on thecomplexity of the transf ormations and the data volumes being shipped. Feasiblity of number of data martdepend on.

Network Capacity.

Time Window Available

Volume of data being transf erred

Mechanisms being used to insert data into data mart

Data Warehousing - System Managers

Introduction

The system management is must f or the successf ul implementation of data warehouse. In this chapter wewill discuss the most important system managers such as f ollowing mentioned below.

System Conf iguration Manager

System Scheduling Manager

System Event Manager

System Database Manager

System Backup Recovery Manager

System Conf iguration Manager

The system conf iguration manager is responsible f or the management of the setup andconf iguration of data warehouse.

The Structure of conf iguration manager varies f rom the operating system to operating system.

In unix structure of conf iguration manager varies f rom vendor to vendor.

The interf ace of conf iguration manager allow us to control of all aspects of the system.

Note: The most important conf iguration tool is the I/O manager.

System Scheduling Manager

The System Scheduling Manager is also responsible f or the successf ul implementation of the datawarehouse. The purpose of this scheduling manager is to schedule the ad hoc queries. Every operatingsystem has its own scheduler with some f orm of batch control mechanism. Features of System SchedulingManager are f ollowing.

Work across cluster or MPP boundaries.

Deal with international t ime dif f erences.

Handle job f ailure.

Handle multiple queries.

Supports job priorit ies.

Restart or requeue the f ailed jobs.

Notif y the user or a process when job is completed.

Maintain the job schedules across system outages.

Requeue jobs to other queues.

Support the stopping and starting of queues.

Log Queued jobs.

Deal with interqueue processing.

Note: The above are the evaluation parameters f or evaluation of a good scheduler.

Some important jobs that the scheduler must be able to handle are as f ollowed:

Daily and ad hoc query scheduling.

execution of regular report requirements.

Data load

Data Processing

Index creation

Backup

Aggregation creation

data transf ormation

Note: If the data warehouse is running on a cluster or MPP architecture, then the system schedulingmanager must be capable of running across the architecture.

System Event Manager

The event manager is a kind of a sof tware. The event manager manages the events that are def ined on thedata warehouse system. We cannot manage the data warehouse manually because the structure of datawarehouse is very complex. Theref ore we need a tool that automatically handle all the events withoutintervention of the user.

Note: The Event manager monitor the events occurrences and deal with them. the event manager alsotrack the myriad of things that can go wrong on this complex data warehouse system.

Events

The question arises is What is an event? event is nothing but the action that are generated by the user orthe system itself . It may be noted that the event is measurable, observable, occurrence of def ined action.

The f ollowing are the common events that are required to be tracked.

hardware f ailure.

Running out of space on certain key disks.

A process dying.

A process returning an error.

CPU usage exceeding an 805 threshold.

Internal contention on database serialization points.

Buf f er cache hit ratios exceeding or f ailure below threshold.

A table reaching to maximum of its size.

Excessive memory swapping.

A table f ailing to extend due to lack of space.

Disk exhibit ing I/O bottlenecks.

Usage of temporary or sort area reaching a certain thresholds.

Any other database shared memory usage.

The most important thing about is that they should be capable of executing on their own. there eventpackages that def ined the procedures f or the predef ined events. The code associated with each event isknown as event handler. This code is executed whenever an event occurs.

System and Database Manager

System and Database manager are the two separate piece of sof tware but they do the same job. Theobjective of these tools is to automate the certain processes and to simplif y the execution of others. TheCriteria of choosing the system and database manager are an abitlity to:

increase user's Quota.

assign and deassign role to the users.

assign and deassign the prof iles to the users.

perf orm database space management

monitor and report on space usage.

tidy up f ragmented and unused space.

add and expand the space.

add and remove users.

manage user password.

manage summary or temporary tables.

assign or deassign temporary space to and f rom the user.

reclaim the space f orm old or outof date temporary tables.

manage error and trace logs.

to browse log and trace f iles.

redirect error or trace inf ormation.

switch on and of f error and trace logging.

perf orm system space management.

monitor and report on space usage.

clean up old and unused f ile directories.

add or expand space.

System Backup Recovery Manager

The backup and recovery tool make it easy f or operations and management staf f to backup the data. It isworth noted that the system backup manager must be integrated with the schedule manager sof tware beingused. The important f eatures that are required f or the management of backups are f ollowing.

Scheduling

Backup data tracking

Database awareness.

The backup are taken only to protect the data against loss. Following are the important points to remember.

The backup sof tware will keep some f rom of database of where and when the piece of data wasbacked up.

The backup recovery manager must have a good f ront end to that database.

The backup recovery sof tware should be database aware.

Being aware of database the sof tware then can be addressed in database terms, and will notperf orm backups that would not be viable.

Data Warehousing - Process Managers

Data Warehouse Load Manager

This Component perf orms the operations required to extract and load process.

The size and complexity of load manager varies between specif ic solutions f rom data warehouse todata warehouse.

Load Manager Architecture

The load manager does the f ollowing f unctions.

Extract the data f rom source system.

Fast Load the extracted data into temporary data store.

Perf orm simple transf ormations into structure similar to the one in the data warehouse.

Extract Data from Source

The data is extracted f rom the operational databases or the external inf ormation providers. Gateways isthe application programs that are used to extract data. It is supported by underlying DBMS and allows clientprogram to generate SQL to be executed at a server. Open Database Connection( ODBC), Java DatabaseConnection (JDBC), are examples of gateway.

Fast Load

In order to minimize the total load window the data need to be loaded into the warehouse in thef astest possible t ime.

The transf ormations af f ects the speed of data processing.

It is more ef f ective to load the data into relational database prior to applying transf ormations andchecks.

Gateway technology proves to be not suitable, since they tend not be perf ormant when large datavolumes are involved.

Simple Transformations

While loading it may be required to perf orm simple transf ormations. Af ter this has been completed we are inposit ion to do the complex checks. Suppose we are loading the EPOS sales transaction we need toperf orm the f ollowing checks.

Strip out all the columns that are not required within the warehouse.

Convert all the values to required data types.

Warehouse Manager

Warehouse manager is responsible f or the warehouse management process.

The warehouse manager consist of third party system sof tware, C programs and shell scripts.

The size and complexity of warehouse manager varies between specif ic solutions.

Warehouse Manager Architecture

The warehouse manager includes the f ollowing.

The Controlling process

Stored procedures or C with SQL

Backup/Recovery tool

SQL Scripts

Operat ions Performed by Warehouse Manager

Warehouse manager analyses the data to perf orm consistency and ref erential integrity checks.

Creates the indexes, business views, partit ion views against the base data.

Generates the new aggregations and also updates the existing aggregation

Generates the normalizations.

Warehouse manager Warehouse manager transf orms and merge the source data into the temporarystore into the published data warehouse.

Backup the data in the data warehouse.

Warehouse Manager archives the data that has reached the end of its captured lif e.

Note: Warehouse Manager also analyses query prof iles to determine index and aggregations areappropriate.

Query Manager

Query Manager is responsible f or directing the queries to the suitable tables.

By directing the queries to appropriate table the query request and response process is speed up.

Query Manager is responsible f or scheduling the execution of the queries posed by the user.

Query Manager Architecture

Query Manager includes the f ollowing.

The query redirection via C tool or RDBMS.

Stored procedures.

Query Management tool.

Query Scheduling via C tool or RDBMS.

Query Scheduling via third party Sof tware.

Operat ions Performed by Query Manager

Query Manager direct to the appropriate tables.

Query Manager schedule the execution of the queries posed by the end user.

Query Manager stores query prof iles to allow the warehouse manager to determine which indexesand aggregations are appropriate.

Data Warehousing - Security

Introduction

The objective data warehouse is to allow large amount of data to be easily accessible by the users. Henceallowing user to extract the inf ormation about the business as a whole. But we know that there could besome security restrictions applied on the data which can prove an obstacle f or accessing the inf ormation. Ifthe analyst has the restricted view of data then it is impossible to capture a complete picture of the trendswithin the business.

The data f rom each analyst can be summarised and passed onto management where the dif f erentsummarise can be created. As the aggregations of summaries cannot be same as that of aggregation as awhole so It is possible to miss some inf ormation trends in the data unless someone is analysing the dataas a whole.

Requirements

Adding the security will af f ect the perf ormance of the data warehouse, theref ore it is worth determining thesecurity requirements early as possible. Adding the security af ter the data warehouse has gone live, is verydif f icult.

During the design phase of data warehouse we should keep in mind that what data sources may be addedlater and what would be the impact of adding those data sources. We should consider the f ollowingpossibilit ies during the design phase.

Whether the new data sources will require new security and/or audit restrictions to be implemented?

Whether the new users added who have restricted access to data that is already generally available?

This situation arises when the f uture users and the data sources are not well known. In such a situation weneed to use the knowledge of business and the objective of data warehouse to know likely requirements.

Factor to Consider for Security requirements

The f ollowing are the parts that are af f ected by the security hence it is worth consider these f actors.

User Access

Data Load

Data Movement

Query Generation

User Access

We need to classif y the data f irst and then the users by what data they can access.In other word the usersare classif ied according to the data, they can access.

Data Classif icat ion

The f ollowing are the two approaches that can be used to classif y the data:

The data can be classif ied according to its sensit ivity. The highly sensit ive data is classif ied as highlyrestricted and less sensit ive data is classif ied as less restrictive.

The data can also be classif ied according to the job f unction. This restriction allows only the specif icusers to view particular data. In this we restrict the users to view only that that in which they areinterested and are responsible f or.

There are some issues in the second approach. To understand let's have an example, suppose you arebuilding the data warehouse f or a bank. suppose f urther that data being stored in the data warehouse isthe transaction data f or all the accounts. The question here is who is allowed to see the transaction data.The solution lies in classif ying the data according to the f unction.

User classif icat ion

The f ollowing are the approaches that can be used to classif y the users.

The users can be classif ied as per the hierarchy of users in an organisation i.e. users can beclassif ied by department, section, group, and so on.

The user can also be classif ied according to their role, with people grouped across departmentsbased on their role.

Classif icat ion on basis of Department

Let's have an example of a data warehouse where the users are f rom sales and marketing department. wecan design the security by topdown company view, with access centered around the dif f erent departments.But they could be some restrictions on users at dif f erent level. This structure is shown in the f ollowingdiagram.

But if each department accesses the dif f erent data then we should design the security access f or eachdepartment separately. This can be achieved by the departmental data marts. Since these data marts areseparated f rom the data warehouse hence we can enf orce the separate security restrictions on each datamart. This approach is shown in the f ollowing f igure.

Classif icat ion on basis of Role

If the data is generally available to all thedepartments.The it is worth to f ollow the role accesshierarchy. In other words if the data is generallyaccessed by all the departments the apply the securityrestrictions as per the role of the user. The roleaccesshierarchy is shown in the f ollowing f igure.

Audit Requirements

The auditing is a subset of security. The auditing is a costly activity theref ore it is worth understanding theaudit requirements and reason f or each audit requirement. The auditing can cause the heavy overheads onthe system. To complete auditing in t ime we require the more hardware theref ore it is recommended thatwhere possible, auditing should be switch of f . Audit requirements can be categorized into the f ollowing:

Connections

Disconnections

Data access

Data change

Note: For each of the above mentioned categories it is necessary to audit success, f ailure or both. Fromthe perspective of security reasons the auditing of f ailures are very important. The auditing of f ailure areimportant because they can highlight the unauthorised or f raudulent access.

Network Requirements

The Network security is as important as other securit ies. We can not ignore the network securityrequirement. We need to consider the f ollowing issues.

Is it necessary to encrypt data bef ore transf erring it to the data warehouse machine?

Are there restrictions on which network routes the data can take?

These restrictions need to be considered caref ully. Following are the points to remember.

The process of encryption and decryption will increase the overheads.It would require moreprocessing power and processing time.

The cost of encryption can be high if the system is already a loaded system because the encryptionis borne by the source system.

Data Movement

There exist potential security implications while moving the data. Suppose we need to transf er somerestricted data as a f lat f ile to be loaded. When the data is loaded into the data warehouse the f ollowingquestions are raised?

Where is the f lat f ile stored?

Who has access to that disk space?

If we talk about the backup of these f lat f iles the f ollowing questions are raised?

Do you backup encrypted or decrypted versions?

Do these backup needs to be made to special tapes that are stored separately?

Who has access to these tapes?

Some other f orm of data movement like query result sets also need to be considered. The question hereare raised when creating the temporary table are as f ollows.

Where is that temporary table to be held?

How do you make such table visible?

We should avoid the accidental f louting of security restrictions. If a user with access to the restricted datacan generate accessible temporary tables, data can be made visible to nonauthorized users. We canovercome it by having separate temporary area f or users with access to restricted data.

Documentation

The audit and security requirements need to be properly documented. This will be treated as part ofjustif ication. This document can contain all the inf ormation gathered on the f ollowing.

Data classif ication

User classif ication

Network requirements

Data movement and storage requirements

All auditable actions

Impact of Security on Design

The security af f ects the application code and the development t imescales. The Security af f ects thef ollowing.

Application development

Database design

Testing

Applicat ion Development

The security af f ect the overall application development and it also af f ect the design of the importantcomponents of the data warehouse such as load manager, warehouse manager and the query manager.The load manager may require checking code to f ilter record and place them in dif f erent locations. Themore transf ormation rule may also be required to hide certain data . Also there may be requirement of extrametadata to handle any extra objects.

To create and maintain the extra vies the warehouse manager may require extra code to enf orce thesecurity. There may be the requirement of the extra checks coded into the data warehouse to prevent itf rom being f ooled into moving data into location where it should not be available. The query managerrequire the changes to handle any access restrictions. The query manager will need to be aware of all extraviews and aggregations.

Database design

The database layout is also af f ected because when the security is added there is increase in number ofviews and tables. Adding security adds the size to the database and hence increase the complexity of thedatabase design and management. it will also add complexity to the backup management and recovery plan.

Testing

The testing of the data warehouse is very complex and a lengthy process. Adding security to the datawarehouse also af f ect the testing time complexity. It af f ects the testing in the f ollowing two ways.

It will increase the time required f or integration and system testing.

There is added f unctionality to be tested which will cause increase in the size of the testing suite.

Data Warehousing - Backup

Introduction

There exist large volume of data into the data warehouse and the data warehouse system is very complexhence it becomes important to have backup of all the data which is available f or the recovery in f uture asper the requirement. In this chapter I will discuss the issues on designing backup strategy.

Backup Terminologies

Bef ore proceeding f urther we should know some of the backup terminologies discussed below.

Complete backup - In complete backup the entire database is backed up at the same time. Thisbackup includes all the database f iles, control f iles and journal f iles.

Partial backup - Partial backup is not the complete backup of database. Partial backup are veryusef ul in large databases because they allow a strategy whereby various parts of the database arebacked up in a round robin f ashion on daybyday basis, so that the whole database is backed upef f ectively once a week.

Cold backup - Cold backup is taken while the database is completely shut down. In multiinstanceenvironment all the instances should be shut down.

Hot backup - The hot backup is take when the database engine is up and running. Hot backuprequirements that need to be considered varies f rom RDBMS to RDBMS. Hot backups are extremelyusef ul.

Online backup - It is same as the hot backup.

Hardware Backup

It is important to decide which hardware to use f or the backup.We have to make the upper bound on thespeed at which backup is can be processed. the speed of processing backup and restore depends not onlyon the hardware being use rather it also depends upon the how hardware is connected, bandwidth of thenetwork, backup sof tware and speed of server's I/O system. Here I will discuss about some of thehardware choices that are available and their pros and cons. These choices are as f ollows.

Tape Technology

Disk Backups

Tape Technology

The tape choice can be categorized into the f ollowing.

Tape media

Standalone tape drives

Tape stackers

Tape silos

Tape Media

There exists several varieties of tape media. The some tape media standard are listed in the table below:

Tape Media Capacity I/O rates

DLT 40 GB 3 MB/s

3490e 1.6 GB 3 MB/s

8 mm 14 GB 1 MB/s

Other f actors that need to be considered are f ollowing:

Reliability of the tape medium.

Cost of tape medium per unit.

scalability.

Cost of upgrades to tape system.

Cost of tape medium per unit.

Shelf lif e of tape medium.

Standalone tape drives

The tape drives can be connected in the f ollowing ways.

Direct to the server.

As as networkavailable devices.

Remotely to other machine.

Issues of connecting the tape drives

Suppose the server is the 48node MPP machine so which node do you connect the tape drive, howdo you spread them over the server nodes to get the optimal perf ormance with least disruption ofthe server and least internal I/O latency?

Connecting the tape drive as a network available device require the network to be up to the job ofthe huge data transf er rates needed. make sure that suf f icient bandwidth is available during the timeyou require it.

Connecting the tape drives remotely also require the high bandwidth.

Tape Stackers

The method of loading the multiple tapes into a single tape drive is known as tape stackers. The stackerdismounts the current tape when it has f inished with it and load the next tape hence only one tape isavailable data a time to be accessed.The price and the capabilit ies may vary but the common ability is thatthey can perf orm unattended backups.

Tape Silos

The tape silos provide the large store capacities.Tape silos can store and manage the thousands of tapes.The tape silos can integrate the multiple tape drives. They have the sof tware and hardware to label andstore the tapes they store. It is very common f or the silo to be connected remotely over a network or adedicated link.We should ensure that the bandwidth of that connection is up to the job.

Other Technologies

The technologies other than the tape are mentioned below.

Disk Backups

Optical jukeboxes

Disk Backups

Methods of disk backups are listed below.

Disk-to-disk backups

Mirror breaking

These methods are used in OLTP system. These methods minimize the database downtime and maximizethe availability.

Disk-to-disk backups

In this kind of backup the backup is taken on to disk rather than to tape. Reasons f or doing Disktodiskbackups are.

Speed of init ial backups

Speed of restore

Backing up the data f rom Disk to disk is much f aster than to the tape. However it is the intermediate stepof backup later the data is backed up on the tape. The other advantage of Disk to disk backups is that itgives you the online copy of the latest backup.

Mirror Breaking

The idea is to have disks mirrored f or resilience during the working day. When back is required one of themirror sets can be broken out. This technique is variat of Disktodisk backups.

Note: The database may need to be shutdown to guarantee the consistency of the backup.

Optical jukeboxes

Optical jukeboxes allow the data to be stored near line. This technique allow large number of optical disksto be managed in same way as a tape stacker or tape silo. The drawback of this technique is that it is slowwrite speed than disks. But the optical media provide the long lif e and reliability make them good choice ofmedium of archiving.

Software Backups

There are sof tware tools available which helps in backup process. These sof tware tools come as apackage.These tools not only take backup in f act they ef f ectively manage and control the backupstrategies. There are many sof tware packages available in the market .Some of them are here listed in thef ollowing table.

Package Name Vendor

Networker Legato

ADSM IBM

Epoch Epoch Systems

Omniback II HP

Alexandria Sequent

Criteria For Choosing Software Packages

The criteria of choosing the best sof tware package is listed below:

How scalable is the product as tape drives are added?

Does the package have client server option, or must it run on database server itself ?

Will it work in cluster and MPP environments?

What degree of parallelism is required?

What platf orms are supported by the package?

Does package support easy access to inf ormation about tape contents?

Is the package database aware?

What tape drive and tape media are supported by package?

Data Warehousing - Tuning

Introduction

The data warehouse evolves throughout the period of t ime and the it is unpredictable that what query theuser is going to be produced in f uture. Theref ore it becomes more dif f icult to tune data warehouse system.In this chapter we will discuss about how to tune the dif f erent aspects of data warehouse such asperf ormance, data load, queries ect.

Dif f iculties in Data Warehouse Tuning

Here is the list of dif f icult ies that can occur while tuning the data warehouse.

The data warehouse never remain constant throughout the period of t ime.

It is very dif f icult to predict that what query the user is going to produce in f uture.

The need of the business also changes with t ime.

The users and their prof ile never remains the same with t ime.

The user can switch f rom one group to another.

the data load on the warehouse also changes with t ime.

Note: It is very important to have the complete knowledge of data warehouse.

Performance Assessment

Here is the list of objective measures of perf ormance.

Average query response time

Scan rates.

Time used per day query.

Memory usage per process.

I/O throughput rates

Following are the points to be remembered.

It is necessary to specif y the measures in service level agreement(SLA).

It is of no use to trying to tune response time if they are already better than those required.

It is essential to have realistic expectations while perf ormance assessment.

It is also essential that the users have the f easible expectations.

To hide the complexity of the system f rom the user the aggregations and views should be used.

It is also possible that the user can write a query you had not tuned f or.

Data Load Tuning

Data Load is very crit ical part of overnight processing.

Nothing else can run until data load is complete.

This is the entry point into the system.

Note: If there is delay in transf erring the data or in arrival of data then the entire system is ef f ected badly.Theref ore it is very important to tune the data load f irst.

There are various approaches of tuning data load that are discussed below:

The very common approach is to insert data using the SQL Layer. In this approach the normalchecks and constraints need to be perf ormed. When the data is inserted into the table the code willrun to check is there enough space available to insert the data. if the suf f icient space is not availablethen more space may have to be allocated to these tables. These checks take time to perf orm andare costly to CPU. But pack the data tightly by making maximal use of space.

The second approach is to bypass all these checks and constraints and place the data directly intopref ormatted blocks. These blocks are later written to the database. It is f aster than the f irstapproach but it can work only with the whole blocks of data. This can lead to some space wastage.

The third approach is that while loading the data into the table that already contains the table, wecan either maintain the indexes.

The f ourth approach says that to load the data in tables that already contains the data, drop theindexes & recreate them when the data load is complete. Out of third and f ourth, which approachis better depends on how much data is already loaded and how many indexes need to be rebuilt.

Integrity Checks

The integrity checking highly af f ects the perf ormance of the load


The integrity checks need to be limited because processing required can be heavy.

The integrity checks should be applied on the source system to avoid perf ormance degrade of dataload.

Tuning Queries

We have two kinds of queries in data warehouse:

Fixed Queries

Ad hoc Queries

Fixed Queries

The f ixed queries are well def ined. The f ollowing are the examples of f ixed queries.

regular reports

Canned queries

Common aggregations

Tuning the f ixed queries in data warehouses is same as in relational database systems. the only dif f erenceis that the amount of data to be queries may be dif f erent. It is good to store the most successf ulexecution plan while testing the f ixed queries. Storing these executing plan will allow us to spot changingdata size and data skew as this will cause the execution plan to change.

Note: We cannot do more on f act table but while dealing with the dimension table or the aggregations, theusual collection of SQL tweaking, storage mechanism and access methods can be used to tune thesequeries.

Ad hoc Queries

To know the ad hoc queries it is important to know the ad hoc users of the data warehouse. Here is the listof points that need to understand about the users of the data warehouse:

The number of users in the group.

Whether they use ad hoc queries at regular interval of t ime.

Whether they use ad hoc queries f requently.

whether they use ad hoc queries occasionally at unknown intervals.

The maximum size of query they tend to run

The average size of query they tend to run.

Whether they require drill-down access to the base data.

The elapsed login t ime per day

The peak time of daily usage

The number of queries they run per peak hour.


It is important to track the users prof iles and identif y the queries that are run on regular basis.

It is also important to identif y tuning perf ormed does not af f ect the perf ormance.

Identif y the similar and ad hoc queries that are f requently run.

If these queries are identif ied then the database will change and new indexes can be added f or thosequeries.

If these queries are identif ied then new aggregations can be created specif ically f or those queriesthat would result in their ef f icient execution.

Data Warehousing - Testing

Introduction

Testing is very important f or data warehouse systems to make them work correctly and ef f iciently. Thereare three basic level of testing that are listed below:

Unit Testing

Integration Testing

System testing

Unit Test ing

In the Unit Testing each component is separately tested.

In this kind of testing each module i.e. procedure, program, SQL Script, Unix shell is tested.

This tested is perf ormed by the developer.

Integrat ion Test ing

In this kind of testing the various modules of the application are brought together and then testedagainst number of inputs.

It is perf ormed to test whether the various components do well af ter integration.

Sustem Test ing

In this kind of testing the whole data warehouse application is tested together.

The purpose of this testing is to check whether the entire system work correctly together or not.

This testing is perf ormed by the testing team.

Since the size of the whole data warehouse is very large so it is usually possible to perf orm minimalsystem testing bef ore the test plan proper can be enacted.

Test Schedule

First of all the Test Schedule is created in process of development of Test Plan.

In this we predict the estimated time required f or the testing of entire data warehouse system.

Diff icult ies in Scheduling the Test ing

There are dif f erent methodologies available but none of them is perf ect because the datawarehouse is very complex and large. Also the data warehouse system is evolving in nature.

A simple problem may have large size of query which can take a day or more to complete i.e. thequery does not complete in desired time scale.

There may be the hardware f ailure such as losing a disk, or the human error such as accidentallydeleting the table or overwrit ing a large table.

Note: Due to the above mentioned dif f icult ies it is recommended that always double the amount of t imeyou would normally allow f or testing.

Testing the backup recovery

This is very important testing that need to be perf ormed. Here is the list of scenarios f or which this testingis needed.

Media f ailure.

Loss or damage of table space or data f ile

Loss or damage of redo log f ile.

Loss or damage of control f ile

Instance f ailure.

Loss or damage of archive f ile.

Loss or damage of table.

Failure during data f ailure.

Testing Operational Environment

There are number of aspects that need to be tested. These aspects are listed below.

Security - A separate security document is required f or security testing. This document contain thelist of disallowed operations and devising test f or each.

Scheduler - Scheduling sof tware is required to control the daily operations of data warehouse. Thisneed to be tested during the system testing. The scheduling sof tware require interf ace with the datawarehouse, which will need the scheduler to control the overnight processing and the managementof aggregations.

Disk Configuration. - The Disk conf iguration also need to be tested to identif y the I/O bottlenecks.The test should be perf ormed with multiple t imes with dif f erent settings.

Management Tools. - It is needed to test all the management tools during system testing. Here isthe list of tools that need to be tested.

Event manager

system Manager.

Database Manager.

Conf iguration Manager

Backup recovery manager.

Testing the Database

There are three set of tests that are listed below:

Testing the database manager and monitoring tools. - To test the database manager and themonitoring tools they should be used in the creation, running and management of test database.

Testing database features. - Here is the list of f eatures that we have to test:

Querying in parallel

Create index in parallel

Data load in parallel

Testing database performance. - Query execution plays a very important role in data warehouseperf ormance measures. There are set of f ixed queries that need to be run regularly and they shouldbe tested. To test ad hoc queries one should go through the user requirement document andunderstand the business completely. Take the time to test the most awkward queries that thebusiness is likely to ask against dif f erent index and aggregation strategies.

Testing The Application

All the managers should be integrated correctly and work in order to ensure that the end-to-endload, index, aggregate and queries work as per the expectations.

Each f unction of each manager should work in correct manner.

It is also necessary to test the application over a period of t ime.

The week-end and month-end task should also be tested.

Logistic of the Test

There is a question that What you are really testing? The answer to this question is that you are testing asuite of data warehouse application code.

The aim of system test is to test all of the f ollowing areas.

Scheduling Sof tware

Day-to Day operational procedures.

Backup recovery strategy.

Management and scheduling tools.

Overnight processing

Query Perf ormance

Note: The most important point is to test the scalability. Failure to do so will leave us a system design thatdoes not work when the system grow.

Data Warehousing - Future AspectsFollowing are the f uture aspects of Data Warehousing.

As we have seen that the size of the open database has grown approximately double the magnitudein last f ew years. This change in magnitude is of greater signif icance.

As the size of the databases grow , the estimates of what constitutes a very large databasecontinues to grow.

The Hardware and sof tware that are available today do not allow to keep a large amount of dataonline. For example a Telco call record require 10TB of data to be kept online which is just a size ofone month record. If It require to keep record of sales, marketing customer, employee etc. then thesize will be more than 100 TB.

The record not only contain the textual inf ormation but also contain some multimedia data.Multimedia data cannot be easily manipulated as text data. Searching the multimedia data is not aneasy task whereas the textual inf ormation can be retrieved by the relational sof tware available today.

Apart f rom size planning, building and running ever- larger data warehouse systems are very complex.As the number of users increases the size of the data warehouse also increases. These users willalso require to access to the system.

With growth of internet there is requirement of users to access data online.

Hence the Future shape of data warehouse will be very different from what is being createdtoday.

Data Warehousing - Interview QuestionsDear readers, these Data Warehousing Interview Questions have been designed especially to get youacquainted with the nature of questions you may encounter during your interview f or the subject of DataWarehousing. As per my experience, good interviewers hardly planned to ask any particular question duringyour interview, normally questions start with some basic concept of the subject and later they continuebased on f urther discussion and what you answer:

Q: Define Data Warehouse?

A: Data warehouse is Subject Oriented, Integrated, Time-Variant and Nonvolatile collection of data thatsupport management's decision making process.

Q: What does the subject oriented data warehouse signif ies?

A: Subject oriented signif ies that the data warehouse stores the inf ormation around a particular subjectsuch as product, customer, sales etc.

Q: List any f ive applications of Data Warehouse?

A: Some applications include Financial services, Banking Services, Customer goods, Retail Sectors,Controlled Manuf acturing.

Q: What does OLAP and OLTP stand for?

A: OLAP is acronym of Online Analytical Processing and OLAP is acronym of Online TransactionalProcessing

Q: What is the very basic difference between data warehouse and Operational Databases?

A: Data warehouse contains the historical inf ormation that is made available f or analysis of the businesswhereas the Operational database contains the current inf ormation that is required to run the business.

Q: List the Schema that Data Warehouse System implements ?

A: Data Warehouse can implement Star Schema, Snowf lake Schema or the Fact Constellation Schema

Q: What is Data Warehousing?

A: Data Warehousing is the process of constructing and using the data warehouse.

Q: List the process that are involved in Data Warehousing?

A: Data Warehousing involves data cleaning, data integration and data consolidations.

Q: List the functions of data warehouse tools and utilit ies?

A: The f unctions perf ormed by Data warehouse tool and utilit ies are Data Extraction, Data Cleaning, DataTransf ormation, Data Loading and Ref reshing

Q: What do you mean by Data Extraction?

A: Data Extraction means gathering the data f rom multiple heterogeneous sources.

Q: Define Metadata?

A: Metadata is simply def ined as data about data. In other words we can say that metadata is thesummarized data that lead us to the detailed data.

Q: What does MetaData Respiratory contains?

A: Metadata respiratory contains Def init ion of data warehouse, Business Metadata, Operational Metadata,Data f or mapping f rom operational environment to data warehouse and the Alorithms f or summarization

Q: How does a Data Cube help?

A: Data cube help us to represent the data in multiple dimensions. The data cube is def ined by dimensionsand f acts.

Q: Define Dimension?

A: The dimensions are the entit ies with respect to which an enterprise keep the records.

Q: Explain Data mart?

A: Data mart contains the subset of organisation-wide data. This subset of data is valuable to specif icgroup of an organisation. in other words we can say that data mart contains only that data which is specif icto a particular group.

Q: What is Virtual Warehouse?

A: The view over a operational data warehouse is known as virtual warehouse.

Q: List the phases involved in Data warehouse delivery Process?

A: The stages are IT strategy, Education, Business Case Analysis, technical Blueprint, Build the version,History Load, Ad hoc query,Requirement Evolution, Automation, Extending Scope.

Q: Explain Load Manager?

A: This Component perf orms the operations required to extract and load process. The size and complexityof load manager varies between specif ic solutions f rom data warehouse to data warehouse.

Q: Define the function of Load Manager?

A: Extract the data f rom source system.Fast Load the extracted data into temporary data store.Perf ormsimple transf ormations into structure similar to the one in the data warehouse.

Q: Explain Warehouse Manager?

A: Warehouse manager is responsible f or the warehouse management process.The warehouse managerconsist of third party system sof tware, C programs and shell scripts.The size and complexity of warehousemanager varies between specif ic solutions.

Q: Define functions of Warehouse Manager?

A: The Warehouse Manager perf orms consistency and ref erential integrity checks, Creates the indexes,business views, partit ion views against the base data, transf orms and merge the source data into thetemporary store into the published data warehouse, Backup the data in the data warehouse and archivesthe data that has reached the end of its captured lif e.

Q: What is Summary Information?

A: Summary Inf ormation is the area in data warehouse where the predef ined aggregations are kept.

Q: What does the Query Manager responsible for?

A: Query Manager is responsible f or directing the queries to the suitable tables.

Q: List the types of OLAP server?

A: There are f our types of OLAP Server namely Relational OLAP, Multidimensional OLAP, Hybrid OLAP,Specialized SQL Servers

Q: Which one is more faster Multidimensional OLAP or Relational OLAP?

A: Multidimensional OLAP is f aster than the Relational OLAP

Q: List the functions performed by OLAP?

A: The f unctions such as roll-up, drill-down, slice, dice, and pivot are perf ormed by OLAP

Q: How many dimensions are selected in Slice operation?

A: Only one dimension is selected f or the slice operation.

Q: How many dimensions are selected in dice operation?

A: For dice operation two or more dimensions are selected f or a given cube.

Q: How many fact tables are there in Star Schema?

A: There is only one f act table in Star Schema.

Q: What is Normalization?

A: The normalization split up the data into additional tables.

Q: Out of Star Schema and Snowflake Schema, the dimension table is normalised?

A: The snowf lake schema uses the concept of normalization.

Q: What is the benefit of Normalization?

A: Normalization helps to reduce the data redundancy.

Q: Which language is used for defining Schema Definit ion

A: Data Mining Query Language (DMQL) id used f or Schema Def init ion.

Q: What language is the base of DMQL

A: DMQL is based on Structured Query Language (SQL)

Q: What are the reasons for partit ioning?

A: Partit ioning is done f or various reasons such as easy management, to assist backup recovery, toenhance perf ormance.

Q: What kind of costs are involved in Data Marting?

A: Data Marting involves Hardware & Sof tware cost, Network access cost and Time cost.

What is Next?

Further, you can go through your past assignments you have done with the subject and make sure you areable to speak conf idently on them. If you are f resher then interviewer does not expect you will answer verycomplex questions, rather you have to make your basics concepts very strong.

Second it really doesn't matter much if you could not answer f ew questions but it matters that whateveryou answered, you must have answered with conf idence. So just f eel conf ident during your interview. We attutorialspoint wish you best luck to have a good interviewer and all the very best f or your f uture endeavor.Cheers :- )

Tutorialspoint.com-Data Warehousing Quick Guide

Documents

Transcript of Tutorialspoint.com-Data Warehousing Quick Guide