Chapter Six- Business Intelligence

8/3/2019 Chapter Six- Business Intelligence

1/13

Business Intelligence:

Data warehousing

Chapter Objectives

How data warehousing evolved.

The main concepts and benefits associated with data warehousing.

How Online Transaction Processing (OLTP) systems differ from data warehousing.

The problems associated with data warehousing.

The architecture and main components ofa data warehouse.

The important data flows orprocesses of a data warehouse.

The main tools and technologies associated with data warehousing.

The issues associated with the integration of a data warehouse and the importance of managing

metadata.

The concept ofa data mart and the main reasons for implementing a data mart.

The main issues associated with the development and management of data marts.

How Oracle supports data warehousing.

Introduction to Data Warehousing

Since the 1970s, organizations have mostly focused theirinvestment in new computersystems thatautomate businessprocesses. In this way, organizations gained competitive advantage through systems

that offered more efficient and cost-effective services to the customer. Throughout this period,

organizations accumulated growing amounts of data stored in their operational databases. However, inrecent times, where such systems are commonplace, organizations are focusing on ways to use

operational data to support decision-making, as a means ofregaining competitive advantage.

Operational systems were neverdesigned to support such business activities and so using these systems

for decision-making may neverbe an easy solution. The legacy is that a typical organization may havenumerous operational systems with overlapping and sometimes contradictory definitions, such as data

types. The challenge for an organization is to turn its archives ofdata into a source of knowledge, so that

a single integrated/ consolidated view of the organization's data ispresented to the user. The concept ofa data warehouse was deemed the solution to meet the requirements of a system capable of supporting

decision-making, receiving data from multiple operational data sources.

Data Warehousing Concepts

Data warehousing

A subject-oriented, integrated, time-variant, and non-volatile collection of data in supportofmanagement's decision-making process. (lnmon 1993)

1


2/13

From this definition, the data is:

Subject-orientedas the warehouse is organized around the majorsubjects ofthe enterprise (such ascustomers,products, and sales) rather than the majorapplication areas (such as customer invoicing,

stockcontrol, andproduct sales). This is reflected in the need to store decision-support data ratherthan application-oriented data.

Integratedbecause ofthe coming togetherofsource data from different enterprise-wide applicationssystems. The source data is often inconsistent using, forexample, different formats. The integrateddata source must be made consistent to present a unified view ofthe data to the users.

Time-variantbecause data in the warehouse is only accurate and valid at somepoint in time orover

some time interval. The time-variance ofthe data warehouse is also shown in the extended time thatthe data is held, the implicit orexplicit association of time with all data, and the fact that the data

represents a series ofsnapshots.

Non-volatile as the data is not updated in real timebut is refreshed from operational systems on a

regularbasis.New data is always added as a supplement to the database, rather than a replacement.The database continually absorbs this new data, incrementally integrating it with the previous data.

Data Webhouse

A distributed data warehouse that is implemented over the Web with no central data repository.

The Web is an immense source of behavioral data as individuals interact through their Web browserswith remote Web sites.

Benefits of Data Warehousing

Potential high returns on investment: An organization must commit a huge amount of resources toensure the successful implementation of a data warehouse and the cost can vary enormously from50,000 to over 10 million due to the variety of technical solutions available. However, a study by

the International Data Corporation (IDC) in 1996 reported that average three-year returns on

investment (ROI) in data warehousing reached 401 %, with over 90% of the companies surveyed

achieving over 40% ROI, half the companies achieving over 160% ROI, and a quarter with more than600% ROI (IDC, 1996).

Competitive advantage: The huge returns on investment for those companies that have successfully

implemented a data warehouse is evidence of the enormous competitive advantage that accompaniesthis technology. The competitive advantage is gained by allowing decision-makers access to data that

can reveal previously unavailable, unknown, and untapped information on, for example, customers,

trends, and demands. Increased productivity of corporate decision-makers Data warehousing improves the productivity

of corporate decision-makers by creating an integrated database of consistent, subject-oriented,

historical data. It integrates data from multiple incompatible systems into a form that provides oneconsistent view of the organization. By transforming data into meaningful information, a data

warehouse allows corporate decision-makers to perform more substantive, accurate, and consistent

analysis.

2


3/13


4/13

Data warehouse Architecture

Operational Data: the source for the data warehouse is supplied from:

Mainframe operational data held in first generation hierarchical and network databases.

Departmental data held in proprietary file systems such as YSAM, RMS, and relational DBMSssuch as Informix and Oracle.

Private data held on workstations and private servers.

External systems such as the Internet, commercially available databases, or databases associatedwith an organization's suppliers or customers

Operational Data store

An Operational Data Store (ODS) is a repository of current and integrated operational data used foranalysis. It is often structured and supplied with data in the same way as the data warehouse, but

may in fact act simply as a staging area for data to be moved into the warehouse.

The ODS is often created when legacy operational systems are found to be incapable of achieving

reporting requirements.

The ODS provides users with the ease of use of a relational database while remaining distant from

the decision support functions of the data warehouse.

4


5/13

Load manager

The load manager (also called the frontend component) performs all the operations associated with

the extraction and loading of data into the warehouse.

The data may be extracted directly from the data sources or more commonly from the operationaldata store.

The operations performed by the load manager may include simple transformations of the data to

prepare the data for entry into the warehouse.

The size and complexity of this component will vary between data warehouses and may beconstructed using a combination of vendor data loading tools and custom-built programs.

Warehouse Manager

The warehouse manager performs all the operations associated with the management of the data in

the warehouse. This component is constructed using vendor data management tools and custom-built programs. The operations performed by the warehouse manager include:

analysis of data to ensure consistency;

transformation and merging of source data from temporary storage into data warehousetables;

creation of indexes and views on base tables;

Generation of renormalizations (if necessary);

Generation of aggregations (if necessary);

Backing-up and archiving data.

Query Manager

The query manager (also called the backend component) performs all the operations associated with

the management of user queries.

This component is typically constructed using vendor end-user data access tools, data warehouse

monitoring tools, database facilities, and custom-built programs.

The complexity of the query manager is determined by the facilities provided by the end-useraccess tools and the database.

The operations performed by this component include directing queries to the appropriate tables and

scheduling the execution of queries.

In some cases, the query manager also generates query profiles to allow the warehouse manager todetermine which indexes and aggregations are appropriate.

Detailed Data

This area of the warehouse stores all the detailed data in the database schema. In most cases, thedetailed data is not stored online but is made available by aggregating the data to the next level of

detail. However, on a regular basis, detailed data is added to the warehouse to supplement the

5


6/13

aggregated data.

Lightly and highly summarized data

This area of the warehouse stores all the predefined lightly and highly summarized (aggregated)data generated by the warehouse manager. This area of the warehouse is transient as it will be

subject to change on an ongoing basis in order to respond to changing query profiles.

The purpose of summary information is to speed up the performance of queries.

Archive or Backup data

This area of the warehouse stores the detailed and summarized data for the purposes of archiving

and backup. Even although summary data is generated from detailed data, it may be necessary tobackup online summary data if this data is kept beyond the retention period for detailed data. Thedata is transferred to storage archives such as magnetic tape or optical disk.

Metadata

This area of the warehouse stores all the metadata (data about data) definitions used by all the

processes in the warehouse. Metadata is used for a variety of purposes including:

The extraction and loading processes - metadata is used to map data sources to a common view ofthe data within the warehouse;

The warehouse management process - metadata is used to automate the production of summarytables;

As part of the query management process - metadata is used to direct a query to the most

appropriate data source.

End-User Access Tools

The principal purpose of data warehousing is to provide information to business users forstrategic decision-making. These users interact with the warehouse using end-user access tools.

The data warehouse must efficiently support ad hoc and routine analysis. High performance is

achieved by pre-planning the requirements for joins, summations, and periodic reports by end-users.

Five main groups of End User Tools (Berson and Smith, 1997)

Reporting and query tools;

6


7/13

Application development tools;

Executive Information System (EIS) tools;

Online Analytical Processing (OLAP) tools;

Data mining tools.

Data warehouse Data flows

Data warehousing focuses on the management of five primary data flows, namely:

o The inflow

o Upflow

o Downflow

o Outflow

o Metaflow (Hackathorn, 1995)

Inflow-The processes associated with the extraction, cleansing, and loading of the data from the sourcesystems into the data warehouse.

The inflow is concerned with taking data from the source systems to load into the data warehouse.

Alternatively, the data may be first loaded into the operational data store before being transferred to the

data warehouse. As the source data is generated predominately by OLTP systems, the data must bereconstructed for the purposes of the data warehouse. The reconstruction of data involves:

Cleansing dirty data;

Restructuring data to suit the new requirements of the data warehouse including, for example,

adding and/or removing fields, and denormalizing data; Ensuring that the source data is consistent with itself and with the data already in the warehouse.

Upflow:

The processes associated with adding value to the data in the warehouse through summarizing,

packaging, and distribution of the data.

The activities associated with the upflow include:

o Summarizing the data by selecting, projecting, joining, and grouping relational data into

views that are more convenient and useful to the end-users. Summarizing extends beyond

simple relational operations to involve sophisticated statistical analysis including

identifying trends, clustering, and sampling the data.o Packaging the data by converting the detailed or summarized data into more useful

formats, such as spreadsheets, text documents, charts, other graphical presentations,

private databases, and animation.

o Distributing the data to appropriate groups to increase its availability and accessibility.

Downflow: -The processes associated with archiving and backing-up of data in the warehouse.

7


8/13

Archiving old data plays an important role in maintaining the effectiveness and performance of the

warehouse by transferring the older data of limited value to a storage archive such as magnetic tape or

optical disk.

However, if the correct partitioning scheme is selected for the database, the amount of data online

should not affect performance.

Outflow- The processes associated with making the data available to the end-users.

The outflow is where the real value of warehousing is realized by the organization. This may require re-

engineering the business processes to achieve competitive advantage (Hackathorn, 1995).

The two key activities involved in the outflow include:

Accessing, which is concerned with satisfying the end-users' requests for the data they need. The

main issue is to create an environment so that users can effectively use the query tools to access

the most appropriate data source. The frequency of user accesses can vary from ad hoc, toroutine, to real-time. It is important to ensure that the system's resources are used in the most

effective way in scheduling the execution of user queries.

Delivering, which is concerned with proactively delivering information to the end-users'workstations and is referred to as a type of 'publish-and-subscribe' process. The warehouse

publishes various 'business objects' that are revised periodically by monitoring usage patterns.

Users subscribe to the set of business objects that best meets their needs.

Metaflow-The previous flows describe the management of the data warehouse with regard to how thedata moves in and out of the warehouse.

Metaflow is the process that moves metadata (data about the other flows). Metadata is a description ofthe data contents of the data warehouse, what is in it, where it came from originally, and what has been

done to it by way of cleansing, integrating, and summarizing.

Data warehousing tools and techniques (Extraction, Cleansing, and Transformation tools)

Selecting the correct extraction, cleansing, and transformation tools are critical steps in the construction

of a data warehouse. There are an increasing number of vendors that are focused on fulfilling the

requirements of data warehouse implementations as opposed to simply moving data between hardware

platforms. Some of the tools include:-

Code generators; - Code generators create customized 3GL/4GL transformation programs based

on source and target data definitions.

Database data replication tools-Database data replication tools employ database triggers or arecovery log to capture changes to a single data source on one system and apply the changes to a

copy of the source data located on a different system

Dynamic transformation engines- Rule-driven dynamic transformation engines capture data from

8


9/13

source system at user-defined intervals, transform the data, and then send and load the results

into a target environment.

9


10/13

Data warehouse DBMS

There are a few integration issues associated with the data warehouse database. Due to the maturity of

such products, most relational databases will integrate predictably with other types ofsoftware.

However, there are issues associated with thepotential size of the data warehouse database. Parallelismin the databasebecomes an important issue, as well as the usual issues such asperformance, scalability,

availability, and manageability, which must allbe taken into consideration when choosing a DBMS. .

Requirements for data warehouse DBMS

Load performance

Data warehouses require incremental loading of new data on a periodic basis within narrow time

windows. Performance of the load process should be measured in hundreds of millions of rows orgigabytes of data per hour and there should be no maximum limit that constrains the business.

Load processing

Many steps must be taken to load new or updated data into the data warehouse including data

conversions, filtering, reformatting, integrity checks, physical storage, indexing, and metadata update.Although each step may in practice be atomic, the load process should appear to execute as a single,

seamless unit of work.

Data quality management

The shift to fact-based management demands the highest data quality. The warehouse must ensure local

consistency, global consistency, and referential integrity despite 'dirty' sources and massive databasesizes. While loading and preparation are necessary steps, they are not sufficient. The ability to answer

end-users' queries is the measure of success for a data warehouse application. As more questions are

answered, analysts tend to ask more creative and complex questions.

Query performance

Fact-based management and ad hoc analysis must not be slowed or inhibited by the perfOlmance of the

data warehouse RDBMS. Large, complex queries for key business operations must complete in

reasonable time periods.

Terabyte scalability

Data warehouse sizes are growing at enormous rates with sizes ranging from a few to hundreds of

gigabytes to terabyte-sized (1012 bytes) and petabyte-sized (1015 bytes).

10


11/13

Mass user scalability

Current thinking is that access to a data warehouse is limited to relatively low numbers of managerial

users. This is unlikely to remain true as the value ofdata warehouses is realized. It is predicted that thedata warehouse RDBMS should be capable of supporting hundreds, oreven thousands, of concurrent

users while maintaining acceptable queryperformance.

Network data warehouse

Data warehouse systems should be capable of cooperating in a larger network ofdata warehouses. The

data warehouse must include tools that coordinate the movement of subsets ofdata between warehouses.Users should be able to look at, and work with, multiple data warehouses from a single client

workstation.

Warehouse Administration

The very-large scale and time-cyclic nature of the data warehouse demands administrative ease andflexibility. The RDBMS must provide controls forimplementing resource limits, chargeback accounting

to allocate costsbackto users, and query prioritization to address the needs of different userclasses and

activities. The RDBMS must also provide for workload tracking and tuning so that system resources

may be optimized formaximumperformance and throughput. The most visible and measurable value ofimplementing a data warehouse is evidenced in the uninhibited, creative access to data itprovides for

end-users.

Integrated dimensional analysis

Thepowerof multi-dimensional views is widely accepted, and dimensional support must be inherent in

the warehouse RDBMS to provide the highest performance for relational OLAP tools (see Chapter 33).The RDBMS must support fast, easy creation ofpre-computed summaries common in large data

warehouses, andprovide maintenance tools to automate the creation of these pre-computed aggregates.Dynamic calculation ofaggregates should be consistent with the interactive performance needs of the

end-user.

Advanced query functionality

End-users require advanced analytical calculations, sequential and comparative analysis, and consistent

access to detailed and summarized data. Using SQL in a client-server 'point-and-click' tool environmentmay sometimesbe impractical oreven impossible due to the complexity of the users' queries. The

RDBMS must provide a complete and advanced set of analytical operations.

Parallel DBMS

Parallel DBMSs must be capable of running parallel queries. In other words, they must be able todecompose large complex queries into subqueries, run the separate subqueries simultaneously, and

reassemble the results at the end. The capability of such DBMSs must also include parallel data loading,

table scanning, and data archiving and backup. There are two main parallel hardware architectures

11


12/13

commonly used as database server platforms for data warehousing:

Symmetric Multi-Processing (SMP) - a set of tightly coupled processors that share memory and

disk storage;

Massively Parallel Processing (MPP) - a set of loosely coupled processors, each of which has its

own memory and disk storage.

Data Marts

A subset of a data warehouse that supports the requirements of a particular department orbusiness function.

Characteristics of a data mart

A data mart focuses on only the requirements of users associated with one department orbusiness function;

Data marts do not normally contain detailed operational data, unlike data warehouses;

As data marts contain less data compared with data warehouses, data marts are more easily

understood and navigated.

Reasons for creating a data Mart

There are many reasons for creating a data mart, which include:

To give users access to the data they need to analyze most often.

To provide data in a form that matches the collective view of the data by a group of users in a

department or business function.

To improve end-user response time due to the reduction in the volume of data to be accessed.

To provide appropriately structured data as dictated by the requirements of end-user access tools

such as Online Analytical Processing (OLAP) and data mining tools, which may require their

own internal database structures. In practice, these tools often create their own data martdesigned to support their specific functionality.

Data marts normally use less data so tasks such as data cleansing, loading, transformation andintegration are far easier, and hence implementing and setting up a data mart is simpler than

establishing a corporate data warehouse.

The cost of implementing data marts is normally less than that required to establish a datawarehouse.

The potential users of a data mart are more clearly defined and can be more easily targeted to

obtain support for a data mart project rather than a corporate data warehouse project.

Data Mart Issues

Data mart functionality

12


13/13

Data mart size

Data mart load performance

Users access to data in multiple data marts

Data mart Internet/intranet access

Data mart administration

Data mart installation

13

Chapter Six- Business Intelligence

Documents

Transcript of Chapter Six- Business Intelligence