Chapter Six- Business Intelligence

download Chapter Six- Business Intelligence

of 13

Transcript of Chapter Six- Business Intelligence

  • 8/3/2019 Chapter Six- Business Intelligence

    1/13

    Business Intelligence:

    Data warehousing

    Chapter Objectives

    How data warehousing evolved.

    The main concepts and benefits associated with data warehousing.

    How Online Transaction Processing (OLTP) systems differ from data warehousing.

    The problems associated with data warehousing.

    The architecture and main components ofa data warehouse.

    The important data flows orprocesses of a data warehouse.

    The main tools and technologies associated with data warehousing.

    The issues associated with the integration of a data warehouse and the importance of managing

    metadata.

    The concept ofa data mart and the main reasons for implementing a data mart.

    The main issues associated with the development and management of data marts.

    How Oracle supports data warehousing.

    Introduction to Data Warehousing

    Since the 1970s, organizations have mostly focused theirinvestment in new computersystems thatautomate businessprocesses. In this way, organizations gained competitive advantage through systems

    that offered more efficient and cost-effective services to the customer. Throughout this period,

    organizations accumulated growing amounts of data stored in their operational databases. However, inrecent times, where such systems are commonplace, organizations are focusing on ways to use

    operational data to support decision-making, as a means ofregaining competitive advantage.

    Operational systems were neverdesigned to support such business activities and so using these systems

    for decision-making may neverbe an easy solution. The legacy is that a typical organization may havenumerous operational systems with overlapping and sometimes contradictory definitions, such as data

    types. The challenge for an organization is to turn its archives ofdata into a source of knowledge, so that

    a single integrated/ consolidated view of the organization's data ispresented to the user. The concept ofa data warehouse was deemed the solution to meet the requirements of a system capable of supporting

    decision-making, receiving data from multiple operational data sources.

    Data Warehousing Concepts

    Data warehousing

    A subject-oriented, integrated, time-variant, and non-volatile collection of data in supportofmanagement's decision-making process. (lnmon 1993)

    1

  • 8/3/2019 Chapter Six- Business Intelligence

    2/13

    From this definition, the data is:

    Subject-orientedas the warehouse is organized around the majorsubjects ofthe enterprise (such ascustomers,products, and sales) rather than the majorapplication areas (such as customer invoicing,

    stockcontrol, andproduct sales). This is reflected in the need to store decision-support data ratherthan application-oriented data.

    Integratedbecause ofthe coming togetherofsource data from different enterprise-wide applicationssystems. The source data is often inconsistent using, forexample, different formats. The integrateddata source must be made consistent to present a unified view ofthe data to the users.

    Time-variantbecause data in the warehouse is only accurate and valid at somepoint in time orover

    some time interval. The time-variance ofthe data warehouse is also shown in the extended time thatthe data is held, the implicit orexplicit association of time with all data, and the fact that the data

    represents a series ofsnapshots.

    Non-volatile as the data is not updated in real timebut is refreshed from operational systems on a

    regularbasis.New data is always added as a supplement to the database, rather than a replacement.The database continually absorbs this new data, incrementally integrating it with the previous data.

    Data Webhouse

    A distributed data warehouse that is implemented over the Web with no central data repository.

    The Web is an immense source of behavioral data as individuals interact through their Web browserswith remote Web sites.

    Benefits of Data Warehousing

    Potential high returns on investment: An organization must commit a huge amount of resources toensure the successful implementation of a data warehouse and the cost can vary enormously from50,000 to over 10 million due to the variety of technical solutions available. However, a study by

    the International Data Corporation (IDC) in 1996 reported that average three-year returns on

    investment (ROI) in data warehousing reached 401 %, with over 90% of the companies surveyed

    achieving over 40% ROI, half the companies achieving over 160% ROI, and a quarter with more than600% ROI (IDC, 1996).

    Competitive advantage: The huge returns on investment for those companies that have successfully

    implemented a data warehouse is evidence of the enormous competitive advantage that accompaniesthis technology. The competitive advantage is gained by allowing decision-makers access to data that

    can reveal previously unavailable, unknown, and untapped information on, for example, customers,

    trends, and demands. Increased productivity of corporate decision-makers Data warehousing improves the productivity

    of corporate decision-makers by creating an integrated database of consistent, subject-oriented,

    historical data. It integrates data from multiple incompatible systems into a form that provides oneconsistent view of the organization. By transforming data into meaningful information, a data

    warehouse allows corporate decision-makers to perform more substantive, accurate, and consistent

    analysis.

    2

  • 8/3/2019 Chapter Six- Business Intelligence

    3/13

  • 8/3/2019 Chapter Six- Business Intelligence

    4/13

    Data warehouse Architecture

    Operational Data: the source for the data warehouse is supplied from:

    Mainframe operational data held in first generation hierarchical and network databases.

    Departmental data held in proprietary file systems such as YSAM, RMS, and relational DBMSssuch as Informix and Oracle.

    Private data held on workstations and private servers.

    External systems such as the Internet, commercially available databases, or databases associatedwith an organization's suppliers or customers

    Operational Data store

    An Operational Data Store (ODS) is a repository of current and integrated operational data used foranalysis. It is often structured and supplied with data in the same way as the data warehouse, but

    may in fact act simply as a staging area for data to be moved into the warehouse.

    The ODS is often created when legacy operational systems are found to be incapable of achieving

    reporting requirements.

    The ODS provides users with the ease of use of a relational database while remaining distant from

    the decision support functions of the data warehouse.

    4

  • 8/3/2019 Chapter Six- Business Intelligence

    5/13

    Load manager

    The load manager (also called the frontend component) performs all the operations associated with

    the extraction and loading of data into the warehouse.

    The data may be extracted directly from the data sources or more commonly from the operationaldata store.

    The operations performed by the load manager may include simple transformations of the data to

    prepare the data for entry into the warehouse.

    The size and complexity of this component will vary between data warehouses and may beconstructed using a combination of vendor data loading tools and custom-built programs.

    Warehouse Manager

    The warehouse manager performs all the operations associated with the management of the data in

    the warehouse. This component is constructed using vendor data management tools and custom-built programs. The operations performed by the warehouse manager include:

    analysis of data to ensure consistency;

    transformation and merging of source data from temporary storage into data warehousetables;

    creation of indexes and views on base tables;

    Generation of renormalizations (if necessary);

    Generation of aggregations (if necessary);

    Backing-up and archiving data.

    Query Manager

    The query manager (also called the backend component) performs all the operations associated with

    the management of user queries.

    This component is typically constructed using vendor end-user data access tools, data warehouse

    monitoring tools, database facilities, and custom-built programs.

    The complexity of the query manager is determined by the facilities provided by the end-useraccess tools and the database.

    The operations performed by this component include directing queries to the appropriate tables and

    scheduling the execution of queries.

    In some cases, the query manager also generates query profiles to allow the warehouse manager todetermine which indexes and aggregations are appropriate.

    Detailed Data

    This area of the warehouse stores all the detailed data in the database schema. In most cases, thedetailed data is not stored online but is made available by aggregating the data to the next level of

    detail. However, on a regular basis, detailed data is added to the warehouse to supplement the

    5

  • 8/3/2019 Chapter Six- Business Intelligence

    6/13

    aggregated data.

    Lightly and highly summarized data

    This area of the warehouse stores all the predefined lightly and highly summarized (aggregated)data generated by the warehouse manager. This area of the warehouse is transient as it will be

    subject to change on an ongoing basis in order to respond to changing query profiles.

    The purpose of summary information is to speed up the performance of queries.

    Archive or Backup data

    This area of the warehouse stores the detailed and summarized data for the purposes of archiving

    and backup. Even although summary data is generated from detailed data, it may be necessary tobackup online summary data if this data is kept beyond the retention period for detailed data. Thedata is transferred to storage archives such as magnetic tape or optical disk.

    Metadata

    This area of the warehouse stores all the metadata (data about data) definitions used by all the

    processes in the warehouse. Metadata is used for a variety of purposes including:

    The extraction and loading processes - metadata is used to map data sources to a common view ofthe data within the warehouse;

    The warehouse management process - metadata is used to automate the production of summarytables;

    As part of the query management process - metadata is used to direct a query to the most

    appropriate data source.

    End-User Access Tools

    The principal purpose of data warehousing is to provide information to business users forstrategic decision-making. These users interact with the warehouse using end-user access tools.

    The data warehouse must efficiently support ad hoc and routine analysis. High performance is

    achieved by pre-planning the requirements for joins, summations, and periodic reports by end-users.

    Five main groups of End User Tools (Berson and Smith, 1997)

    Reporting and query tools;

    6

  • 8/3/2019 Chapter Six- Business Intelligence

    7/13

    Application development tools;

    Executive Information System (EIS) tools;

    Online Analytical Processing (OLAP) tools;

    Data mining tools.

    Data warehouse Data flows

    Data warehousing focuses on the management of five primary data flows, namely:

    o The inflow

    o Upflow

    o Downflow

    o Outflow

    o Metaflow (Hackathorn, 1995)

    Inflow-The processes associated with the extraction, cleansing, and loading of the data from the sourcesystems into the data warehouse.

    The inflow is concerned with taking data from the source systems to load into the data warehouse.

    Alternatively, the data may be first loaded into the operational data store before being transferred to the

    data warehouse. As the source data is generated predominately by OLTP systems, the data must bereconstructed for the purposes of the data warehouse. The reconstruction of data involves:

    Cleansing dirty data;

    Restructuring data to suit the new requirements of the data warehouse including, for example,

    adding and/or removing fields, and denormalizing data; Ensuring that the source data is consistent with itself and with the data already in the warehouse.

    Upflow:

    The processes associated with adding value to the data in the warehouse through summarizing,

    packaging, and distribution of the data.

    The activities associated with the upflow include:

    o Summarizing the data by selecting, projecting, joining, and grouping relational data into

    views that are more convenient and useful to the end-users. Summarizing extends beyond

    simple relational operations to involve sophisticated statistical analysis including

    identifying trends, clustering, and sampling the data.o Packaging the data by converting the detailed or summarized data into more useful

    formats, such as spreadsheets, text documents, charts, other graphical presentations,

    private databases, and animation.

    o Distributing the data to appropriate groups to increase its availability and accessibility.

    Downflow: -The processes associated with archiving and backing-up of data in the warehouse.

    7

  • 8/3/2019 Chapter Six- Business Intelligence

    8/13

    Archiving old data plays an important role in maintaining the effectiveness and performance of the

    warehouse by transferring the older data of limited value to a storage archive such as magnetic tape or

    optical disk.

    However, if the correct partitioning scheme is selected for the database, the amount of data online

    should not affect performance.

    Outflow- The processes associated with making the data available to the end-users.

    The outflow is where the real value of warehousing is realized by the organization. This may require re-

    engineering the business processes to achieve competitive advantage (Hackathorn, 1995).

    The two key activities involved in the outflow include:

    Accessing, which is concerned with satisfying the end-users' requests for the data they need. The

    main issue is to create an environment so that users can effectively use the query tools to access

    the most appropriate data source. The frequency of user accesses can vary from ad hoc, toroutine, to real-time. It is important to ensure that the system's resources are used in the most

    effective way in scheduling the execution of user queries.

    Delivering, which is concerned with proactively delivering information to the end-users'workstations and is referred to as a type of 'publish-and-subscribe' process. The warehouse

    publishes various 'business objects' that are revised periodically by monitoring usage patterns.

    Users subscribe to the set of business objects that best meets their needs.

    Metaflow-The previous flows describe the management of the data warehouse with regard to how thedata moves in and out of the warehouse.

    Metaflow is the process that moves metadata (data about the other flows). Metadata is a description ofthe data contents of the data warehouse, what is in it, where it came from originally, and what has been

    done to it by way of cleansing, integrating, and summarizing.

    Data warehousing tools and techniques (Extraction, Cleansing, and Transformation tools)

    Selecting the correct extraction, cleansing, and transformation tools are critical steps in the construction

    of a data warehouse. There are an increasing number of vendors that are focused on fulfilling the

    requirements of data warehouse implementations as opposed to simply moving data between hardware

    platforms. Some of the tools include:-

    Code generators; - Code generators create customized 3GL/4GL transformation programs based

    on source and target data definitions.

    Database data replication tools-Database data replication tools employ database triggers or arecovery log to capture changes to a single data source on one system and apply the changes to a

    copy of the source data located on a different system

    Dynamic transformation engines- Rule-driven dynamic transformation engines capture data from

    8

  • 8/3/2019 Chapter Six- Business Intelligence

    9/13

    source system at user-defined intervals, transform the data, and then send and load the results

    into a target environment.

    9

  • 8/3/2019 Chapter Six- Business Intelligence

    10/13

    Data warehouse DBMS

    There are a few integration issues associated with the data warehouse database. Due to the maturity of

    such products, most relational databases will integrate predictably with other types ofsoftware.

    However, there are issues associated with thepotential size of the data warehouse database. Parallelismin the databasebecomes an important issue, as well as the usual issues such asperformance, scalability,

    availability, and manageability, which must allbe taken into consideration when choosing a DBMS. .

    Requirements for data warehouse DBMS

    Load performance

    Data warehouses require incremental loading of new data on a periodic basis within narrow time

    windows. Performance of the load process should be measured in hundreds of millions of rows orgigabytes of data per hour and there should be no maximum limit that constrains the business.

    Load processing

    Many steps must be taken to load new or updated data into the data warehouse including data

    conversions, filtering, reformatting, integrity checks, physical storage, indexing, and metadata update.Although each step may in practice be atomic, the load process should appear to execute as a single,

    seamless unit of work.

    Data quality management

    The shift to fact-based management demands the highest data quality. The warehouse must ensure local

    consistency, global consistency, and referential integrity despite 'dirty' sources and massive databasesizes. While loading and preparation are necessary steps, they are not sufficient. The ability to answer

    end-users' queries is the measure of success for a data warehouse application. As more questions are

    answered, analysts tend to ask more creative and complex questions.

    Query performance

    Fact-based management and ad hoc analysis must not be slowed or inhibited by the perfOlmance of the

    data warehouse RDBMS. Large, complex queries for key business operations must complete in

    reasonable time periods.

    Terabyte scalability

    Data warehouse sizes are growing at enormous rates with sizes ranging from a few to hundreds of

    gigabytes to terabyte-sized (1012 bytes) and petabyte-sized (1015 bytes).

    10

  • 8/3/2019 Chapter Six- Business Intelligence

    11/13

    Mass user scalability

    Current thinking is that access to a data warehouse is limited to relatively low numbers of managerial

    users. This is unlikely to remain true as the value ofdata warehouses is realized. It is predicted that thedata warehouse RDBMS should be capable of supporting hundreds, oreven thousands, of concurrent

    users while maintaining acceptable queryperformance.

    Network data warehouse

    Data warehouse systems should be capable of cooperating in a larger network ofdata warehouses. The

    data warehouse must include tools that coordinate the movement of subsets ofdata between warehouses.Users should be able to look at, and work with, multiple data warehouses from a single client

    workstation.

    Warehouse Administration

    The very-large scale and time-cyclic nature of the data warehouse demands administrative ease andflexibility. The RDBMS must provide controls forimplementing resource limits, chargeback accounting

    to allocate costsbackto users, and query prioritization to address the needs of different userclasses and

    activities. The RDBMS must also provide for workload tracking and tuning so that system resources

    may be optimized formaximumperformance and throughput. The most visible and measurable value ofimplementing a data warehouse is evidenced in the uninhibited, creative access to data itprovides for

    end-users.

    Integrated dimensional analysis

    Thepowerof multi-dimensional views is widely accepted, and dimensional support must be inherent in

    the warehouse RDBMS to provide the highest performance for relational OLAP tools (see Chapter 33).The RDBMS must support fast, easy creation ofpre-computed summaries common in large data

    warehouses, andprovide maintenance tools to automate the creation of these pre-computed aggregates.Dynamic calculation ofaggregates should be consistent with the interactive performance needs of the

    end-user.

    Advanced query functionality

    End-users require advanced analytical calculations, sequential and comparative analysis, and consistent

    access to detailed and summarized data. Using SQL in a client-server 'point-and-click' tool environmentmay sometimesbe impractical oreven impossible due to the complexity of the users' queries. The

    RDBMS must provide a complete and advanced set of analytical operations.

    Parallel DBMS

    Parallel DBMSs must be capable of running parallel queries. In other words, they must be able todecompose large complex queries into subqueries, run the separate subqueries simultaneously, and

    reassemble the results at the end. The capability of such DBMSs must also include parallel data loading,

    table scanning, and data archiving and backup. There are two main parallel hardware architectures

    11

  • 8/3/2019 Chapter Six- Business Intelligence

    12/13

    commonly used as database server platforms for data warehousing:

    Symmetric Multi-Processing (SMP) - a set of tightly coupled processors that share memory and

    disk storage;

    Massively Parallel Processing (MPP) - a set of loosely coupled processors, each of which has its

    own memory and disk storage.

    Data Marts

    A subset of a data warehouse that supports the requirements of a particular department orbusiness function.

    Characteristics of a data mart

    A data mart focuses on only the requirements of users associated with one department orbusiness function;

    Data marts do not normally contain detailed operational data, unlike data warehouses;

    As data marts contain less data compared with data warehouses, data marts are more easily

    understood and navigated.

    Reasons for creating a data Mart

    There are many reasons for creating a data mart, which include:

    To give users access to the data they need to analyze most often.

    To provide data in a form that matches the collective view of the data by a group of users in a

    department or business function.

    To improve end-user response time due to the reduction in the volume of data to be accessed.

    To provide appropriately structured data as dictated by the requirements of end-user access tools

    such as Online Analytical Processing (OLAP) and data mining tools, which may require their

    own internal database structures. In practice, these tools often create their own data martdesigned to support their specific functionality.

    Data marts normally use less data so tasks such as data cleansing, loading, transformation andintegration are far easier, and hence implementing and setting up a data mart is simpler than

    establishing a corporate data warehouse.

    The cost of implementing data marts is normally less than that required to establish a datawarehouse.

    The potential users of a data mart are more clearly defined and can be more easily targeted to

    obtain support for a data mart project rather than a corporate data warehouse project.

    Data Mart Issues

    Data mart functionality

    12

  • 8/3/2019 Chapter Six- Business Intelligence

    13/13

    Data mart size

    Data mart load performance

    Users access to data in multiple data marts

    Data mart Internet/intranet access

    Data mart administration

    Data mart installation

    13