Download - 2. NOTES.doc

8/11/2019 2. NOTES.doc

1/70

CS2032 DATA WAREHOUSING AND DATA MINING

Department of Information Technoo!"

UNIT I

DATA WAREHOUSING

Data Warehouse Introduction

A data warehouse is a collection of data marts representing historical data from different

operations in the company. This data is stored in a structure optimized for querying and data analysis as a

data warehouse. Table design, dimensions and organization should be consistent throughout a data

warehouse so that reports or queries across the data warehouse are consistent. A data warehouse can also

be viewed as a database for historical data from different functions within a company.

The term Data Warehouse was coined by Bill nmon in !""#, which he defined in the following

way$ A warehouse is a subject-oriented, integrated, time-variant and non-volatile collection of data

in support of managements decision making process!

%e defined the terms in the sentence as follows$

Subject Oriented:Data that gives information about a particular sub&ect instead of about a company's

ongoing operations.

Integrated: Data that is gathered into the data warehouse from a variety of sources and merged into a

coherent whole.

Time-variant:All data in the data warehouse is identified with a particular time period.

Non-volatile:Data is stable in a data warehouse. (ore data is added but data is never removed.

This enables management to gain a consistent picture of the business. t is a single, complete and

consistent store of data obtained from a variety of different sources made available to end users in what

they can understand and use in a business conte)t. t can be

*sed for decision +upport

*sed to manage and control business

*sed by managers and endusers to understand the business and ma-e &udgments

Data Warehousing is an architectural construct of information systems that provides users with current

and historical decision support information that is hard to access or present in traditional operational data

stores

"ther important terminolog#

Enterprise Data warehouse: t collects all information about sub&ects customers, products, sales,

assets, personnel/ that span the entire organization

Data (art$ Departmental subsets that focus on selected sub&ects. A data mart is a segment of a datawarehouse that can provide data for reporting and analysis on a section, unit, department or operation in

1

8/11/2019 2. NOTES.doc

2/70


the company, e.g. sales, payroll, production. Data marts are sometimes complete individual data

warehouses which are usually smaller than the corporate data warehouse.

Decision Support System (DSS):nformation technology to help the -nowledge wor-er e)ecutive,

manager, and analyst/ ma-es faster 0 better decisions

Drill-down:Traversing the summarization levels from highly summarized data to the underlyingcurrent or old detail

Metadata:Data about data. 1ontaining location and description of warehouse system components$

names, definition, structure2

$enefits of data warehousing

Data warehouses are designed to perform well with aggregate queries running on large

amounts of data.

The structure of data warehouses is easier for end users to navigate, understand and query

against unli-e the relational databases primarily designed to handle lots of transactions. Data warehouses enable queries that cut across different segments of a company's operation.

3.g. production data could be compared against inventory data even if they were originally

stored in different databases with different structures.

4ueries that would be comple) in very normalized databases could be easier to build and

maintain in data warehouses, decreasing the wor-load on transaction systems.

Data warehousing is an efficient way to manage and report on data that is from a variety of

sources, non uniform and scattered throughout a company.

Data warehousing is an efficient way to manage demand for lots of information from lots of

users.

5Data warehousing provides the capability to analyze large amounts of historical data fornuggets of wisdom that can provide an organization with competitive advantage.

"perational and informational Data

6perational Data$

7ocusing on transactional function such as ban- card withdrawals and deposits

Detailed

*pdateable

8eflects current data

nformational Data$

7ocusing on providing answers to problems posed by decision ma-ers

+ummarized

9on updateable

Data Warehouse %haracteristicsA data warehouse can be viewed as an information system with the following attributes$

: t is a database designed for analytical tas-s

: t's content is periodically updated

: t contains current and historical data to provide a historical perspective of information

2

8/11/2019 2. NOTES.doc

3/70


6perational data store 6D+/

; 6D+ is an architecture concept to support daytoday operational decision support and contains

current value data propagated from operational applications

; 6D+ is sub&ectoriented, similar to a classic definition of a Data warehouse

; 6D+ is integrated

6D+ DATA WA83%6*+3

8/11/2019 2. NOTES.doc

4/70


. nformation delivery system

Data warehouse is an environment, not a product which is based on relational database

management system that functions as the central repository for informational data.

The central repository information is surrounded by number of -ey components designed to ma-e

the environment is functional, manageable and accessible.

The data source for data warehouse is coming from operational applications. The data entered into

the data warehouse transformed into an integrated structure and format. The transformation process

involves conversion, summarization, filtering and condensation. The data warehouse must be capable of

holding and managing large volumes of data as well as different structure of data structures over the time.

&! Data warehouse database

This is the central part of the data warehousing environment. This is the item number > in the

above arch. diagram. This is implemented based on 8DB(+ technology.

'! (ourcing, Ac)uisition, %lean up, and *ransformation *ools

This is item number ! in the above arch diagram. They perform conversions, summarization, -ey

changes, structural changes and condensation. The data transformation is required so that the information

can by used by decision support tools. The transformation produces programs, control statements, E1F

4

8/11/2019 2. NOTES.doc

5/70


code, 16B6F code, *9G scripts, and +4F DDF code etc., to move the data into data warehouse from

multiple operational systems.

The functionalities of these tools are listed below$

To remove unwanted data from operational db 1onverting to common data names and attributes

1alculating summaries and derived data

3stablishing defaults for missing data

5Accommodating source data definition changes

Issues to be considered while data sourcing, cleanup, extract and transformation:

Data heterogeneity$ t refers to DB(+ different nature such as it may be in different data modules,

it may have different access languages, it may have data navigation methods, operations, concurrency,

integrity and recovery processes etc.,

Data heterogeneity$ t refers to the different way the data is defined and used in different modules.

Some experts involved in the development of such tools:

=rism +olutions, 3volutionary Technology nc.,

8/11/2019 2. NOTES.doc

6/70


(eta data helps the users to understand content and find the data. (eta data are stored in a

separate data stores which is -nown as informational directory or (eta data repository which helps to

integrate, maintain and view the contents of the data warehouse. The following lists the characteristics of

info directory@ (eta data$

t is the gateway to the data warehouse environment

t supports easy distribution and replication of content for high performance and availability

t should be searchable by business oriented -ey words

5t should act as a launch platform for end user to access data and analysis tools

t should support the sharing of info

5t should support scheduling options for request

5t should support and provide interface to other applications

t should support end user monitoring of the status of the data warehouse environment

Access tools

ts purpose is to provide info to business users for decision ma-ing. There are five categories$

5Data query and reporting tools

Application development tools

3)ecutive info system tools 3+/

56FA= tools

Data mining tools

4uery and reporting tools are used to generate query and report. There are two types of reporting tools.

They are$

=roduction reporting tool used to generate regular operational reports

Des-top report writer are ine)pensive des-top tools designed for end users.

Managed Query tools:used to generate +4F query. t uses (eta layer software in between users

and databases which offers a pointandclic- creation of +4F statement. This tool is a preferred choice of

users to perform segment identification, demographic analysis, territory management and preparation of

customer mailing lists etc.

pplication de!elopment tools: This is a graphical data access environment which integrates

6FA= tools with data warehouse and can be used to access all db systems

"#$ Tools:are used to analyze the data in multi dimensional and comple) views. To enable

multidimensional properties it uses (DDB and (8DB where (DDB refers multi dimensional data base

and (8DB refers multi relational data bases.

Data mining tools:are used to discover -nowledge from the data warehouse data also can be used

for data visualization and data correction purposes.

6

8/11/2019 2. NOTES.doc

7/70


.!Data marts

Departmental subsets that focus on selected sub&ects. They are independent used by

dedicated user group. They are used for rapid delivery of enhanced decision support functionality

to end users. Data mart is used in the following situation$

3)tremely urgent user requirement

The absence of a budget for a full scale data warehouse strategy

The decentralization of business needs

The attraction of easy to use tools and mind sized pro&ect

Data mart presents two problems$

!. Scala%ility: A small data mart can grow quic-ly in multi dimensions. +o that while

designing it, the organization has to pay more attention on system scalability, consistency

and manageability issues

>.Data integration

/!Data warehouse admin and management

The management of data warehouse includes,

+ecurity and priority management

(onitoring updates from multiple sources

Data quality chec-s

(anaging and updating meta data

Auditing and reporting data warehouse usage and status

=urging data

8eplicating, sub setting and distributing data

Bac-up and recovery

Data warehouse storage management which includes capacity planning, hierarchical storage

management and purging of aged data etc.,

0!Information deliver# s#stem

; t is used to enable the process of subscribing for data warehouse info.; Delivery to one or more destinations according to specified scheduling algorithm.

'!$uilding a Data warehouse

7

8/11/2019 2. NOTES.doc

8/70


There are two reasons why organizations consider data warehousing a critical need. n

other words, there are two factors that drive you to build and use data warehouse. They are$

Business &actors:

Business users want to ma-e decision quic-ly and correctly using all available data.

Technological &actors:

To address the incompatibility of operational data stores

T infrastructure is changing rapidly. ts capacity is increasing and cost is decreasing so that

building a data warehouse is easy

*here are several things to be considered while building a successful data warehouse

Business considerations$

6rganizations interested in development of a data warehouse can choose one of the following

*wo approaches1

1. Top : Down Approach +uggested by Bill nmon/

2. Bottom : *p Approach +uggested by 8alph Himball/

&!*op 2 Down Approach

n the top down approach suggested by Bill nmon, we build a centralized repository to house

corporate wide business data. This repository is called 3nterprise Data Warehouse 3DW/. The data in the

3DW is stored in a normalized form in order to avoid redundancy.

The central repository for corporate wide data helps us maintain one version of truth of the

data.The data in the 3DW is stored at the most detail level. The reason to build the 3DW on the most detail

level is to leverage

!. 7le)ibility to be used by multiple departments.

>. 7le)ibility to cater for future requirements.

*he disadvantages of storing data at the detail level are

!. The comple)ity of design increases with increasing level of detail.

>. t ta-es large amount of space to store data at detail level, hence increased cost.

8

8/11/2019 2. NOTES.doc

9/70


6nce the 3DW is implemented we start building sub&ect area specific data marts which contain

data in a de normalized form also called star schema. The data in the marts are usually summarized based

on the end users analytical requirements. The reason to de normalize the data in the mart is to provide

faster access to the data for the end users analytics. f we were to have queried a normalized schema for the

same analytics, we would end up in a comple) multiple level &oins that would be much slower as

compared to the one on the de normalized schema.

We should implement the topdown approach when

!. The business has complete clarity on all or multiple sub&ect areas data warehosue requirements.

>. The business is ready to invest considerable time and money.

Theadvantage

of using the Top Down approach is that we build a centralized repository to cater

for one version of truth for business data. This is very important for the data to be reliable, consistent

across sub&ect areas and for reconciliation in case of data related contention between sub&ect areas.

The disadvantageof using the Top Down approach is that it requires more time and initial

investment. The business has to wait for the 3DW to be implemented followed by building the data marts

before which they can access their reports.

'! $ottom 3p Approach

The bottom up approach suggested by 8alph Himball is an incremental approach to build a data

warehouse. %ere we build the data marts separately at different points of time as and when the specific

sub&ect area requirements are clear. The data marts are integrated or combined together to form a data

warehouse. +eparate data marts are combined through the use of conformed dimensions and conformed

facts. A conformed dimension and a conformed fact is one that can be shared across data marts.

A 1onformed dimension has consistent dimension -eys, consistent attribute names and consistent

values across separate data marts. The conformed dimension means e)act same thing with every fact table

it is &oined. A 1onformed fact has the same definition of measures, same dimensions &oined to it and at the

same granularity across data marts.

The bottom up approach helps us incrementally build the warehouse by developing and integrating

data marts as and when the requirements are clear. We don't have to wait for -nowing the overall

requirements of the warehouse. We should implement the bottom up approach when

!. We have initial cost and time constraints.

>. The complete warehouse requirements are not clear. We have clarity to only one data mart.

9

8/11/2019 2. NOTES.doc

10/70


The advantageof using the Bottom *p approach is that they do not require high initial costs and

have a faster implementation timeI hence the business can start using the marts much earlier as compared

to the topdown approach.

The disadvantages of using the Bottom *p approach is that it stores data in the de normalized

format, hence there would be high space usage for detailed data. We have a tendency of not -eeping

detailed data in this approach hence loosing out on advantage of having detail data .i.e. fle)ibility to easily

cater to future requirements. Bottom up approach is more realistic but the comple)ity of the integration

may become a serious obstacle.

!SI"N #ONSI!$%TIONS

To be a successful data warehouse designer must adopt a holistic approach that is considering all

data warehouse components as parts of a single comple) system, and ta-e into account all possible data

sources and all -nown usage requirements.

(ost successful data warehouses that meet these requirements have these common characteristics$

Are based on a dimensional model

1ontain historical and current data

nclude both detailed and summarized data

1onsolidate disparate data from multiple sources while retaining consistency

Data warehouse is difficult to build due to the following reason$

%eterogeneity of data sources

*se of historical data

Jrowing nature of data base

Data warehouse design approach muse be business driven, continuous and iterative engineering

approach. n addition to the general considerations there are following specific points relevant to the data

warehouse design$

Data content

The content and structure of the data warehouse are reflected in its data model. The data model is

the template that describes how information will be organized within the integrated warehouse framewor-.

The data warehouse data must be a detailed data. t must be formatted, cleaned up and transformed to fit

the warehouse data model.

10

8/11/2019 2. NOTES.doc

11/70


eta data

t defines the location and contents of data in the warehouse. (eta data is searchable by users to

find definitions or sub&ect areas. n other words, it must provide decision support oriented pointers to

warehouse data and thus provides a logical lin- between warehouse data and decision support applications.

Data distribution

6ne of the biggest challenges when designing a data warehouse is the data placement and

distribution strategy. Data volumes continue to grow in nature. Therefore, it becomes necessary to -now

how the data should be divided across multiple servers and which users should get access to which types of

data. The data can be distributed based on the sub&ect area, location geographical region/, or time current,

month, year/.

*ools

A number of tools are available that are specifically designed to help in the

implementation of the data warehouse. All selected tools must be compatible with the given data

warehouse environment and with each other. All tools must be able to use a common (eta data

repository.

Design steps

The following ninestep method is followed in the design of a data warehouse$

!. 1hoosing the sub&ect matter

>. Deciding what a fact table represents

?. dentifying and conforming the dimensions

. 1hoosing the facts

. +toring pre calculations in the fact table

C. 8ounding out the dimension table

. 1hoosing the duration of the db

K. The need to trac- slowly changing dimensions

". Deciding the query priorities and query models

T!#&NI#%' #ONSI!$%TIONS

A number of technical issues are to be considered when designing a data warehouse

environment. These issues include$

11

8/11/2019 2. NOTES.doc

12/70


The hardware platform that would house the data warehouse

The dbms that supports the warehouse data

The communication infrastructure that connects data marts, operational systems and end

users

The hardware and software to support meta data repository

The systems management framewor- that enables admin of the entire environment

I()'!(!NT%TION #ONSI!$%TIONS

The following logical steps needed to implement a data warehouse$

1ollect and analyze business requirements

1reate a data model and a physical design

Define data sources

1hoose the db tech and platform

3)tract the data from operational db, transform it, clean it up and load it into the warehouse

1hoose db access and reporting tools

1hoose db connectivity software

1hoose data analysis and presentation s@w

*pdate the data warehouse

Access tools

Data warehouse implementation relies on selecting suitable data access tools. The best way to choose

this is based on the type of data can be selected using this tool and the -ind of access it permits for a

particular user. The following lists the various type of data that can be accessed$

+imple tabular form data

8an-ing data

(ultivariable data

Time series data

Jraphing, charting and pivoting data

1omple) te)tual search data

+tatistical analysis data

Data for testing of hypothesis, trends and patterns

=redefined repeatable queries

Ad hoc user specified queries

12

8/11/2019 2. NOTES.doc

13/70


8eporting and analysis data

1omple) queries with multiple &oins, multi level sub queries and sophisticated search criteria

Data e4traction, clean up, transformation and migration

A proper attention must be paid to data e)traction which represents a success factor for a data

warehouse architecture. When implementing data warehouse several the following selection criteria that

affect the ability to transform, consolidate, integrate and repair the data should be considered$

Timeliness of data delivery to the warehouse

The tool must have the ability to identify the particular data and that can be read by conversion tool

The tool must support flat files, inde)ed files since corporate data is still in this type

The tool must have the capability to merge data from multiple data stores

The tool should have specification interface to indicate the data to be e)tracted

The tool should have the ability to read data from data dictionary

The code generated by the tool should be completely maintainable

The tool should permit the user to e)tract the required data

The tool must have the facility to perform data type and character set translation

The tool must have the capability to create summarization, aggregation and derivation of records

The data warehouse database system must be able to perform loading data directly from these tools

Data placement strategies

: As a data warehouse grows, there are at least two options for data placement. 6ne is to put some of

the data in the data warehouse into another storage media.

: The second option is to distribute the data in the data warehouse across multiple servers.

3ser levels

The users of data warehouse data can be classified on the basis of their s-ill level in accessing the

warehouse. There are three classes of users$

'asual users:are most comfortable in retrieving info from warehouse in pre defined formats and

running pre e)isting queries and reports. These users do not need tools that allow for building standard and

ad hoc reports

$ower sers:can use pre defined as well as user defined queries to create simple and ad hoc

reports. These users can engage in drill down operations. These users may have the e)perience of using

reporting and query tools.

13

8/11/2019 2. NOTES.doc

14/70


Epert users:These users tend to create their own comple) queries and perform standard analysis

on the info they retrieve. These users have the -nowledge about the use of query and report tools

$enefits of data warehousing 1Data warehouse usage includes,

: Focating the right info

: =resentation of info

: Testing of hypothesis

: Discovery of info

: +haring the analysis

*he benefits can be classified into two1

Tangible benefits quantified @ measureable/$t includes,

: mprovement in product inventory

: Decrement in production cost

: mprovement in selection of target mar-ets

: 3nhancement in asset and liability management

ntangible benefits not easy to quantified/$ t includes,

: mprovement in productivity by -eeping all data in single location and eliminating re-eying of

data

: 8educed redundant processing

: 3nhanced customer relation

+! apping the data warehouse architecture to ultiprocessor architecture

The functions of data warehouse are based on the relational data base technology. The relational

data base technology is implemented in parallel manner. There are two advantages of having parallel

relational data base technology for data warehouse$

#inear Speed up:refers the ability to increase the number of processor to reduce response time.

#inear Scale up:refers the ability to provide same performance on the same requests as the

database size increases

*#pes of parallelism

There are two types of parallelism$

14

8/11/2019 2. NOTES.doc

15/70


*nter +uery $arallelism: n which different server threads or processes handle multiple requests at

the same time.

*ntra +uery $arallelism:This form of parallelism decomposes the serial +4F query into lower

level operations such as scan, &oin, sort etc. Then these lower level operations are e)ecuted concurrently in

parallel.

ntra query parallelism can be done in either of two ways$

oriontal parallelism:which means that the data base is partitioned across multiple dis-s and

parallel processing occurs within a specific tas- that is performed concurrently on different processors

against different set of data

.ertical parallelism:This occurs among different tas-s. All query components such as scan, &oin,

sort etc are e)ecuted in parallel in a pipelined fashion. n other words, an output from one tas- becomes an

input into another tas-.

Data partitioning1

Data partitioning is the -ey component for effective parallel e)ecution of data base operations.=artition can be done randomly or intelligently.

/andom portioningincludes random data striping across multiple dis-s on a single server. Anotheroption for random portioning is round robin fashion partitioning in which each record is placed on the ne)tdis- assigned to the data base.

*ntelligent partitioningassumes that DB(+ -nows where a specific record is located and does notwaste time searching for it across all dis-s. The various intelligent partitioning include$

ash partitioning:A hash algorithm is used to calculate the partition number based on the value ofthe partitioning -ey for each row

15

8/11/2019 2. NOTES.doc

16/70


0ey range partitioning:8ows are placed and located in the partitions according to the value of thepartitioning -ey. That is all the rows with the -ey value from A to H are in partition !, F to T are inpartition > and so on.

Schema portioning:an entire table is placed on one dis-I another table is placed on different dis-

etc. This is useful for small reference tables.

ser de&ined portioning:t allows a table to be partitioned on the basis of a user definede)pression.

Data base architectures of parallel processing

There are three DB(+ software architecture styles for parallel processing$

!. +hared memory or shared everything Architecture

>. +hared dis- architecture

?. +hred nothing architecture

Shared (emor* %rchitecture

Tightly coupled shared memory systems, illustrated in following figure have the following

characteristics$

(ultiple =*s share memory.

3ach =* has full access to all shared memory through a common bus.

1ommunication between nodes occurs via shared memory.

=erformance is limited by the bandwidth of the memory bus.

+ymmetric multiprocessor +(=/ machines are often nodes in a cluster. (ultiple +(= nodes can be

used with 6racle =arallel +erver in a tightly coupled system, where memory is shared among the multiple

=*s, and is accessible by all the =*s through a memory bus. 3)amples of tightly coupled systems include

the =yramid, +equent, and +un +parc+erver.

16

8/11/2019 2. NOTES.doc

17/70


=erformance is potentially limited in a tightly coupled system by a number of factors. These include

various system components such as the memory bandwidth, =* to =* communication bandwidth, the

memory available on the system, the @6 bandwidth, and the bandwidth of the common bus.

=arallel processing advantages of shared memor# s#stemsare these$

(emory access is cheaper than internode communication. This means that internal

synchronization is faster than using the Foc- (anager.

+hared memory systems are easier to administer than a cluster.

A disadvantage of shared memor# s#stems for parallel processing is as follows$

+calability is limited by bus bandwidth and latency, and by available memory.

Shared is+ %rchitecture

+hared dis- systems are typically loosely coupled. +uch systems, illustrated in following figure, have

the following characteristics$

3ach node consists of one or more =*s and associated memory.

(emory is not shared between nodes.

1ommunication occurs over a common highspeed bus.

17

8/11/2019 2. NOTES.doc

18/70


3ach node has access to the same dis-s and other resources.

A node can be an +(= if the hardware supports it.

Bandwidth of the highspeed bus limits the number of nodes scalability/ of the system.

The cluster illustrated in figure is composed of multiple tightly coupled nodes. The Distributed Foc-

(anager DF( / is required. 3)amples of loosely coupled systems are

8/11/2019 2. NOTES.doc

19/70


=arallel processing disadvantages of shared dis- systems are these$

nternode synchronization is required, involving DF( overhead and greater dependency on high

speed interconnect.

f the wor-load is not partitioned well, there may be high synchronization overhead.

There is operating system overhead of running shared dis- software.

Shared Nothing %rchitecture

+hared nothing systems are typically loosely coupled. n shared nothing systems only one 1=* is

connected to a given dis-. f a table or database is located on that dis-, access depends entirely on the =*

which owns it. +hared nothing systems can be represented as follows$

+hared nothing systems are concerned with access to dis-s, not access to memory. 9onetheless,

adding more =*s and dis-s can improve scale up. 6racle =arallel +erver can access the dis-s on a shared

nothing system as long as the operating system provides transparent dis- access, but this access is

e)pensive in terms of latency.

+hared nothing systems have advantages and disadvantages for parallel processing$

Advantages

19

8/11/2019 2. NOTES.doc

20/70


+hared nothing systems provide for incremental growth.

+ystem growth is practically unlimited.

(==s are good for readonly databases and decision support applications.

7ailure is local$ if one node fails, the others stay up.

Disadvantages

(ore coordination is required.

(ore overhead is required for a process wor-ing on a dis- belonging to another node.

f there is a heavy wor-load of updates or inserts, as in an online transaction processing system, it

may be worthwhile to consider datadependent routing to alleviate contention.

5arallel D$( features

+cope and techniques of parallel DB(+ operations

6ptimizer implementation

Application transparency

=arallel environment which allows the DB(+ server to ta-e full advantage of the e)isting facilities

on a very low level

DB(+ management tools help to configure, tune, admin and monitor a parallel 8DB(+ as

effectively as if it were a serial 8DB(+

=rice @ =erformance$ The parallel 8DB(+ can demonstrate a non linear speed up and scale up at

reasonable costs.

5arallel D$( vendors

6racle$ =arallel 4uery 6ption =46/

Architecture$ shared dis- arch

20

8/11/2019 2. NOTES.doc

21/70


Data partition$ Hey range, hash, round robin

=arallel operations$ hash &oins, scan and sort

nformi)$ eGtended =arallel +erver G=+/

Architecture$ +hared memory, shared dis- and shared nothing models

Data partition$ round robin, hash, schema, -ey range and user defined

=arallel operations$ 9+38T, *=DAT3, D3F3FT3

B($ DB> =arallel 3dition DB> =3/

Architecture$ +hared nothing modelsData partition$ hash

=arallel operations$ 9+38T, *=DAT3, D3F3FT3, load, recovery, inde) creation, bac-up, table

reorganization

+LBA+3$ +LBA+3 (==

Architecture$ +hared nothing models

Data partition$ hash, -ey range, +chema

=arallel operations$ %orizontal and vertical parallelism

! D$( schemas for decision support

The basic concepts of dimensional modeling are$ facts, dimensions and measures. A fact is a

collection of related data items, consisting of measures and conte)t data. t typically represents business

items or business transactions. A dimension is a collection of data that describe one business dimension.

Dimensions determine the conte)tual bac-ground for the factsI they are the parameters over which we

want to perform 6FA=. A measure is a numeric attribute of a fact, representing the performance or

behavior of the business relative to the dimensions.

1onsidering 8elational conte)t, there are three basic schemas that are used in dimensional

modeling$

!. +tar schema

>. +nowfla-e schema

?. 7act constellation schema

(tar schema

21

8/11/2019 2. NOTES.doc

22/70

8/11/2019 2. NOTES.doc

23/70


fact that the star schema is the simplest architecture, it is most commonly used nowadays and is

recommended by 6racle.

6act *ables

A fact table is a table that contains summarized numerical and historical data facts/ and a

multipart inde) composed of foreign -eys from the primary -eys of related dimension tables. A fact table

typically has two types of columns$ foreign -eys to dimension tables and measures those that contain

numeric facts. A fact table can contain fact's data on detail or aggregated level.

Dimension *ables

Dimensions are categories by which summarized data can be viewed. 3.g. a profit

summary in a fact table can be viewed by a Time dimension profit by month, quarter, year/,

8egion dimension profit by country, state, city/, =roduct dimension profit for product!,

product>/.

A dimension is a structure usually composed of one or more hierarchies that categorizes data. f a

dimension hasn't got a hierarchies and levels it is called flat dimension or list. The primary -eys of each of

the dimension tables are part of the composite primary -ey of the fact table. Dimensional attributes help to

describe the dimensional value. They are normally descriptive, te)tual values. Dimension tables are

generally small in size then fact table.

Typical fact tables store data about sales while dimension tables data about geographic region

mar-ets, cities/, clients, products, times, channels.

easures

(easures are numeric data based on columns in a fact table. They are the primary data which

end users are interested in. 3.g. a sales fact table may contain a profit measure which represents profit on

each sale.

Aggregations are pre calculated numeric data. By calculating and storing the answers to a query before

users as- for it, the query processing time can be reduced. This is -ey in providing fast query performance

in 6FA=.

1ubes are data processing units composed of fact tables and dimensions from the data

warehouse. They provide multidimensional views of data, querying and analytical capabilities to clients.

The main characteristics of star schema$

+imple structure N easy to understand schema

23

8/11/2019 2. NOTES.doc

24/70


Jreat query effectives N small number of tables to &oin

8elatively long time of loading data into dimension tables N denormalization, redundancy

data caused that size of the table could be large.

The most commonly used in the data warehouse implementations N widely supported by a

large number of business intelligence tools

(nowflake schema1

The snowfla-e schema is an e)tension of the star schema, where each point of the star e)plodes

into more points. n a star schema, each dimension is represented by a single dimensional table, whereas in

a snowfla-e schema, that dimensional table is normalized into multiple loo-up tables, each representing a

level in the dimensional hierarchy.

7or e)ample, the Time Dimension that consists of > different hierarchies$

!.LearO(onthODay

>. Wee- O Day

We will have loo-up tables in a snowfla-e schema$ A loo-up table for year, a loo-up table for

month, a loo-up table for wee-, and a loo-up table for day. Lear is connected to (onth, which is then

connected to Day. Wee- is only connected to Day.

The main advantage of the snowfla+e schemais the improvement in query performance due to

minimized dis- storage requirements and &oining smaller loo-up tables.

The main disadvantage of the snowfla+e schemais the additional maintenance efforts needed due

to the increase number of loo-up tables.

24
http://www.1keydata.com/datawarehousing/www.1keydata.com/datawarehousing/star-schema.htmlhttp://www.1keydata.com/datawarehousing/www.1keydata.com/datawarehousing/star-schema.html

8/11/2019 2. NOTES.doc

25/70


t is the result of decomposing one or more of the dimensions. The manytoone relationships

among sets of attributes of a dimension can separate new dimension tables, forming a hierarchy. The

decomposed snowfla-e structure visualizes the hierarchical structure of dimensions very well.

7act constellation schema$ 7or each star schema it is possible to construct fact constellation

schema for e)ample by splitting the original star schema into more star schemes each of them describes

facts on another level of dimension hierarchies/. The fact constellation architecture contains multiple fact

tables that share many dimension tables.

The main shortcoming of the fact constellation schema is a more complicated design because

many variants for particular -inds of aggregation must be considered and selected. (oreover, dimension

tables are still large.

! Data 74traction, %leanup, and *ransformation *ools

3TF stands for 3)tract, Transform, Foad is Data Warehouse acquisition processes that involves

3)tract the data from outside sources.

Transform the data to fit business needs and ultimately

Foad the the transform data to the data warehouse.

7or e)ample$

!. nformatics.

>. Data +tage.

?. 6racle warehouse builder.

. Ab initio.

3TF can also be used for the integration with legacy systems. 3TF is the Data Warehouse

acquisition processes of 3)tracting, Transforming and Foading data from source systems into the data

warehouse.

74traction

25

8/11/2019 2. NOTES.doc

26/70


3)traction is the operation of e)tracting data from a source system for further use in a data

warehouse environment. This is the first step of the 3TF process. After the e)traction, this data can be

transformed and loaded into the data warehouse.

*ntroduction to Etraction Methods in Data 1arehouses

The e)traction method you should choose is highly dependent on the source system and also from

the business needs in the target data warehouse environment.

8/11/2019 2. NOTES.doc

27/70


users to perform segment identification, demographic analysis, territory management and preparation of

customer mailing lists etc.

pplication de!elopment tools: This is a graphical data access environment which integrates

6FA= tools with data warehouse and can be used to access all db systems

"#$ Tools:are used to analyze the data in multi dimensional and comple) views. To enable

multidimensional properties it uses (DDB and (8DB where (DDB refers multi dimensional data base

and (8DB refers multi relational data bases.

Data mining tools:are used to discover -nowledge from the data warehouse data also can be used

for data visualization and data correction purposes.

.! etadata

(eta data$ data about data

eta Data in Data Warehouse

eta Datais one of the most important aspect of data warehousing. t is the data about data stored

in data warehouse and its users.

eta Dataprovides decisionsupportoriented pointer to warehouse data and thus provide logical

lin- between warehouse data and decision support application.

eta Datais the -ey to providing user and application with a road map to the information stored

in the warehouse.

eta Datacan define all attributes, data sources and timing, and rules that govern data use and

data transformation of all data elements.

(etadata metacontent/ is defined as data providing information about one or more aspects of the

data, such as$

(eans of creation of the data

=urpose of the data

Time and date of creation

1reator or author of data

27

8/11/2019 2. NOTES.doc

28/70


Focation on acomputer networ-where the data was created

+tandardsused

*#pes1

Technical (eta data1

t contains information about data warehouse data used by warehouse designer, administrator to carry out

development and management tas-s. t includes,

nfo about data stores

Transformation descriptions. That is mapping methods from operational db to warehouse db

Warehouse 6b&ect and data structure definitions for target data

The rules used to perform clean up, and data enhancement

Data mapping operations

Access authorization, bac-up history, archive history, info delivery history, data acquisition history,

data access etc.,

. /usiness (eta data:

t contains info that gives info stored in data warehouse to users. t includes,

+ub&ect areas, and info ob&ect type including queries, reports, images, video, audio clips etc.

nternet home pages

nfo related to info delivery system

Data warehouse operational info such as ownerships, audit trails etc.,

"ther *#pes1

28
http://en.wikipedia.org/wiki/Computer_networkhttp://en.wikipedia.org/wiki/Computer_networkhttp://en.wikipedia.org/wiki/Technical_standardhttp://en.wikipedia.org/wiki/Computer_networkhttp://en.wikipedia.org/wiki/Technical_standard

8/11/2019 2. NOTES.doc

29/70


(tructural metadata is used to describe the structure of computer systems such as tables,

columns and inde)es. 8uide metadatais used to help humans find specific items and is usually e)pressed

as a set of -eywords in a natural language.

According to 8alph Himballmetadata can be divided into > similar categoriesQTechnical

metadata and Business metadata. Technical metadata correspond to internal metadata, business

metadatato e)ternal metadata.

Himball adds a third category named 5rocess metadata. 6n the other hand, 9+6 distinguishes

between three types of metadata$ descriptive, structural and administrative.

Descriptive metadatais the information used to search and locate an ob&ect such as title, author,

sub&ects, -eywords, publisherIstructural metadata

gives a description of how the components of the

ob&ect are organizedI and administrative metadatarefers to the technical information including file type.

Two subtypes of administrative metadata are rights management metadata and preservation metadata.

*#pes of Data Warehouse

There are mainly three type of Data Warehouse.

!/. 3nterprise Data Warehouse.

>/. 6perational data store.

?/. Data (art.

7nterprise Data Warehouseprovide a control Data Base for decision support through out the

enterprise.

"perational data storehas a broad enterprise under scope but unli-e a real enterprise DW. Datais refreshed in rare real time and used for routine business activity.

Data artis a sub part of Data Warehouse. t support a particular reason or it is design for

particular lines of business such as sells, mar-eting or finance, or in any organization documents of a

particular department will be data mart

29
http://en.wikipedia.org/wiki/Ralph_Kimballhttp://en.wikipedia.org/wiki/Ralph_Kimball

8/11/2019 2. NOTES.doc

30/70


UNIT II

#USINESS ANA$%SIS

&! 9eporting and :uer# tools and Applications 2 *ool %ategories 2 the ;eed for

Applications

Data )uer# and reporting tools

4uery and reporting tools are divided in to two parts.

8eporting tools

(anaged query tools

9eporting toolsfurther dividing in to two parts.

5roduction reporting toolswill let companies generate regular operational reports or support

high level batch &ob, such as calculating and printing paychec-s.

9eport writer, on the other hand, are e)pensive des-top tools designed for end users.

30

8/11/2019 2. NOTES.doc

31/70

8/11/2019 2. NOTES.doc

32/70


5nteractive reporting capability

53nterprisewide scalability

5+uperior user interface

57astest time to result

5Fowest cost of ownership

%atalogs

mpromptu stores metadata in sub&ect related folders. This metadata is what will be used to

develop a query for a report. The metadata set is stored in a file called a Rcatalog'. The catalog does not

contain any data. t &ust contains information about connecting to the database and the fields that will be

accessible for reports.

A catalog contains1

; 7oldersQmeaningful groups of information representing columns from one or more tables

; 1olumnsQindividual data elements that can appear in one or more folders

; 1alculationsQe)pressions used to compute required values from e)isting data

; 1onditionsQused to filter information so that only a certain type of information is displayed

; =romptsQpredefined selection criteria prompts that users can include in reports they create

; 6ther components, such as metadata, a logical database name, &oin information, and user classes

=ou can use catalogs to

; view, run, and print reports

; e)port reports to other applications; disconnect from and connect to the database; create reports; change the contents of the catalog; add user classes

5rompts

Lou can use prompts to; filter reports; calculate data items; format data

5icklist 5rompts

A pic-list prompt presents you with a list of data items from which you select one or more values,

so you need not be familiar with the database. The values listed in pic-list prompts can be retrieved from

32

8/11/2019 2. NOTES.doc

33/70

8/11/2019 2. NOTES.doc

34/70

8/11/2019 2. NOTES.doc

35/70


6ne of the limitations that +4F has, it cannot represent these comple) problems. A query will be

translated in to several +4F statements. These +4F statements will involve multiple &oins, intermediate

tables, sorting, aggregations and a huge temporary memory to store these tables. These procedures

required a lot of computation which will require a long time in computing. The second limitation of +4F is

its inability to use mathematical models in these +4F statements. f an analyst, could create these comple)

statements using +4F statements, still there will be a large number of computation and huge memory

needed. Therefore the use of 6FA= is preferable to solve this -ind of problem.

.! %ategories of ">A5 *ools

(O'%)

This is the more traditional way of 6FA= analysis. n (6FA=, data is stored in a

multidimensional cube. The storage is not in the relational database, but in proprietary formats. That is,

data stored in arraybased structures.

Advantages$

3)cellent performance$ (6FA= cubes are built for fast data retrieval, and are optimal for slicing

and dicing operations.

1an perform comple) calculations$ All calculations have been pregenerated when the cube is

created. %ence, comple) calculations are not only doable, but they return quic-ly.

Disadvantages$

Fimited in the amount of data it can handle$ Because all calculations are performed when the cube

is built, it is not possible to include a large amount of data in the cube itself. This is not to say that

the data in the cube cannot be derived from a large amount of data. ndeed, this is possible. But in

this case, only summarylevel information will be included in the cube itself.

8equires additional investment$ 1ube technology are often proprietary and do not already e)ist in

the organization. Therefore, to adopt (6FA= technology, chances are additional investments in

human and capital resources are needed.

35

8/11/2019 2. NOTES.doc

36/70


3)amples$ %yperion 3ssbase, 7usion nformation Builders/

$O'%)

This methodology relies on manipulating the data stored in the relational database to give the

appearance of traditional 6FA='s slicing and dicing functionality. n essence, each action of slicing and

dicing is equivalent to adding a SW%383 clause in the +4F statement. Data stored in relational tables

Advantages$

1an handle large amounts of data$ The data size limitation of 86FA= technology is the limitation

on data size of the underlying relational database. n other words, 86FA= itself places no

limitation on data amount.

1an leverage functionalities inherent in the relational database$ 6ften, relational database already

comes with a host of functionalities. 86FA= technologies, since they sit on top of the relational

database, can therefore leverage these functionalities.

Disadvantages$

=erformance can be slow$ Because each 86FA= report is essentially a +4F query or multiple

+4F queries/ in the relational database, the query time can be long if the underlying data size is

large.

Fimited by +4F functionalities$ Because 86FA= technology mainly relies on generating +4F

statements to query the relational database, and +4F statements do not fit all needs for e)ample, it

is difficult to perform comple) calculations using +4F/, 86FA= technologies are thereforetraditionally limited by what +4F can do. 86FA= vendors have mitigated this ris- by building

36

8/11/2019 2. NOTES.doc

37/70


into the tool outofthebo) comple) functions as well as the ability to allow users to define their

own functions.

3)amples$ (icrostrategy ntelligence +erver, (eta1ube nformi)@B(/

&O'%) 0(1!: (anaged 1uer* !nvironment2

%6FA= technologies attempt to combine the advantages of (6FA= and 86FA=. 7or summary

type information, %6FA= leverages cube technology for faster performance. t stores only the inde)es and

aggregations in the multidimensional form while the rest of the data is stored in the relational database.

3)amples$ =ower=lay 1ognos/, Brio, (icrosoft Analysis +ervices, 6racle Advanced Analytic+ervices

/! ultidimensional ?ersus ultirelational ">A5

These relational implementations of multidimensional database systems are sometimes referred to

as multirelationaldatabase systems. To achieve the required speed, these products use the star or snowfla-e

schemasspecially optimized and denormalized data models that involve data restructuring and

37

8/11/2019 2. NOTES.doc

38/70


aggregation. The snowfla-e schema is an e)tension of the star schema that supports multiple fact tables

and &oins between them./

6ne benefit of the star schema approach is reduced comple)ity in the data model, which increases

data Slegibility, ma-ing it easier for users to pose business questions of 6FA= nature.Data warehouse

queries can be answered up to !# times faster because of improved navigations.

Two types of database activity$

! 6FT=$ 6nFine Transaction =rocessing

+hort transactions, both queries and updates

e.g., update account balance, enroll in course/

4ueries are simple

e.g., find account balance, find grade in course/

*pdates are frequent

e.g., concert tic-ets, seat reservations, shopping carts/

>. 6FA=$ 6nFine Analytical =rocessing

U Fong transactions, usually comple) queries

U e.g., all statistics about all sales, grouped by dept and

U month/

U SData mining operations

U nfrequent updates

O'T) vs O'%)

6FT= stands for 6n Fine Transaction =rocessing and is a data modeling approach typically used to

facilitate and manage usual business applications. (ost of applications yousee and use are 6FT= based.

6FT= technology used to perform updates on operational or transactional systems e.g., point of

sale systems/

6FA= stands for 6n Fine Analytic =rocessing and is an approach to answer multidimensional queries. 6FA= was conceived for (anagement nformation +ystems and Decision+upport +ystems. 6FA= technology used to perform comple) analysis of the data in a datawarehouse.

The following table summarizes the major dieren!es between "#T$ and "#%$

s&stem design'"#T$ (&stem "#%$ (&stem

38

8/11/2019 2. NOTES.doc

39/70


"nline Transa!tion $ro!essing)"*erational (&stem+

"nline %nal&ti!al $ro!essing),ata -arehouse+

+ource of data6perational dataI 6FT=s are theoriginal source of the data.

1onsolidation dataI 6FA= data comes fromthe various 6FT= Databases

=urpose of data To control and run fundamentalbusiness tas-s

To help with planning, problem solving, anddecision support

What the data8eveals a snapshot of ongoing

business processes(ultidimensional views of various -inds of

business activities

nserts and*pdates

+hort and fast inserts and updatesinitiated by end users

=eriodic longrunning batch &obs refresh thedata

4ueries8elatively standardized and simplequeries 8eturning relatively fewrecords

6ften comple) queries involvingaggregations

=rocessing+peed

Typically very fast

Depends on the amount of data involvedIbatch data refreshes and comple) queriesmay ta-e many hoursI query speed can beimproved by creating inde)es

+pace8equirements

1an be relatively small if historicaldata is archived

Farger due to the e)istence of aggregationstructures and history dataI requires moreinde)es than 6FT=

Database

Design%ighly normalized with many tables

Typically denormalized with fewer tablesI

use of star and@or snowfla-e schemas

Bac-up and8ecovery

Bac-up religiouslyI operational data iscritical to run the business, data loss isli-ely to entail significant monetaryloss and legal liability

nstead of regular bac-ups, someenvironments may consider simplyreloading the 6FT= data as a recoverymethod

0! *he ultidimensional data odel

The multidimensional data model is an integral part of 6nFine Analytical =rocessing, or 6FA=.

Because 6FA= is online, it must provide answers quic-lyI analysts pose iterative queries during

interactive sessions, not in batch &obs that run overnight. And because 6FA= is also analytic, the queries

are comple). The multidimensional data model is designed to solve comple) queries in real time.

(ultidimensional data model is to view it as a cube. The cable at the left contains detailed sales

data by product, mar-et and time. The cube on the right associates sales number unit sold/ with

dimensionsproduct type, mar-et and time with the unit variables organized as cell in an array.

39

8/11/2019 2. NOTES.doc

40/70


This cube can be e)pended to include another arraypricewhich can be associates with all or only

some dimensions. As number of dimensions increases number of cubes cell increase e)ponentially.

Dimensions are hierarchical in nature i.e. time dimension may contain hierarchies for years,

quarters, months, wea- and day. J36J8A=%L may contain country, state, city etc.

n this cube we can observe, that each side of the cube represents one of the elements of the

question. The )a)is represents the time, the ya)is represents the products and the za)is represents

different centers. The cells of in the cube represents the number of product sold or can represent the price

of the items.

This 7igure also gives a different understanding to the drilling down operations. The relations

defined must not be directly related, they related directly.

The size of the dimension increase, the size of the cube will also increase e)ponentially. The time

response of the cube depends on the size of the cube.

"perations in ultidimensional Data odel1

; Aggregation roll-up/

: dimension reduction$ e.g., total sales by city

: summarization over aggregate hierarchy$ e.g., total sales by city and year N total sales byregion and by year

40

8/11/2019 2. NOTES.doc

41/70


; +election slice/ defines a subcube

: e.g., sales where city V =alo Alto and date V !@!@"C

; 9avigation to detailed data drill-down/

: e.g., sales : e)pense/ by city, top ? of cities by average income

; / *nlimited dimensions and aggregation levels$ This depends on the -ind of Business, where

multiple dimensions and defining hierarchies can be made.

n addition to these guidelines an 6FA= system should also support$

1omprehensive database management tools$ This gives the database management to control

distributed Businesses

41

8/11/2019 2. NOTES.doc

42/70


The ability to drill down to detail source record level$ Which requires that The 6FA= tool should

allow smooth transitions in the multidimensional database.

ncremental database refresh$ The 6FA= tool should provide partial refresh.

+tructured 4uery Fanguage +4F interface/$ the 6FA= system should be able to integrate effectively

in the surrounding enterprise environment.

UNIT III

DATA MINING

&! Data mining knowledge discover# in databasesB

3)traction of interesting nontrivial, implicit, previously un-nown and potentially useful/information or patterns from data in large databases.

42

8/11/2019 2. NOTES.doc

43/70


Data mining is the practice of automatically searching large stores of data to discover patterns and

trends that go beyond simple analysis. Data mining uses sophisticated mathematical algorithms to segment

the data and evaluate the probability of future events. Data mining is also -nown as Hnowledge Discovery

in Data HDD/.

The -ey properties of data mining are$

Automatic discovery of patterns

=rediction of li-ely outcomes

1reation of actionable information

7ocus on large data sets and databases

Data mining can answer questions that cannot be addressed through simple query and reportingtechniques.

'! Data ining 6unctions

A basic understanding of data mining functions and algorithms is required for using 6racle Data

(ining. This section introduces the concept of data mining functions. Algorithms are introduced in XData

(ining AlgorithmsX.

3ach data mining function specifies a class of problems that can be modeled and solved. Data

mining functions fall generally into two categories$ supervised and unsupervised. 9otions of supervised

and unsupervised learning are derived from the science of machine learning, which has been called a sub

area of artificial intelligence.

Artificial intelligence refers to the implementation and study of systems that e)hibit autonomous

intelligence or behavior of their own. (achine learning deals with techniques that enable devices to learn

from their own performance and modify their own functioning. Data mining applies machine learning

concepts to data.

Supervised ata (ining:

+upervised learning is also -nown as directed learning. The learning process is directed by a

previously -nown dependent attribute or target. Directed data mining attempts to e)plain the behavior of

the target as a function of a set of independent attributes or predictors.

43
http://docs.oracle.com/cd/B28359_01/datamine.111/b28129/intro_concepts.htm#BHCDJDAFhttp://docs.oracle.com/cd/B28359_01/datamine.111/b28129/intro_concepts.htm#BHCDJDAFhttp://docs.oracle.com/cd/B28359_01/datamine.111/b28129/intro_concepts.htm#BHCDJDAFhttp://docs.oracle.com/cd/B28359_01/datamine.111/b28129/intro_concepts.htm#BHCDJDAFhttp://docs.oracle.com/cd/B28359_01/datamine.111/b28129/intro_concepts.htm#BHCDJDAF

8/11/2019 2. NOTES.doc

44/70


+upervised learning generally results in predictive models. This is in contrast to unsupervised

learning where the goal is pattern detection.

The building of a supervised model involves training, a process whereby the software analyzes

many cases where the target value is already -nown. n the training process, the model XlearnsX the logic

for ma-ing the prediction. 7or e)ample, a model that see-s to identify the customers who are li-ely to

respond to a promotion must be trained by analyzing the characteristics of many customers who are -nown

to have responded or not responded to a promotion in the past.

3nsupervised ata (ining

*nsupervised learning is nondirected. There is no distinction between dependent and independent

attributes. There is no previously-nown result to guide the algorithm in building the model.

*nsupervised learning can be used for descriptive purposes. t can also be used to ma-e

predictions.

ata pre-processing

Data pre-processing is an often neglected but important step in the data mining process. The

phrase Xgarbage in, garbage outXis particularly applicable to data miningandmachine learningpro&ects.

Datagathering methods are often loosely controlled, resulting in outofrange values e.g., ncome$ Y!##/,

impossible data combinations e.g., Jender$ (ale, =regnant$ Les/, missing values,etc. Analyzing data that

has not been carefully screened for such problems can produce misleading results. Thus, the representation

and quality of data is first and foremost before running an analysis.

f there is much irrelevant and redundant information present or noisy and unreliable data, then

-nowledge discoveryduring the training phase is more difficult. Data preparation and filtering steps can

ta-e considerable amount of processing time. Data preprocessing includes cleaning, normalization,

transformation, feature e)tractionand selection, etc. The product of data preprocessing is the final training

set. Hotsiantis et al. >##C/ present a well-nown algorithm for each step of data preprocessing.

+! %lassification of Data ining (#stems

ata mining classification scheme:

!. Decisions in data mining

: Hinds of databases to be mined

44
http://en.wikipedia.org/wiki/GIGOhttp://en.wikipedia.org/wiki/Data_mininghttp://en.wikipedia.org/wiki/Machine_learninghttp://en.wikipedia.org/wiki/Machine_learninghttp://en.wikipedia.org/wiki/Machine_learninghttp://en.wikipedia.org/wiki/Missing_valueshttp://en.wikipedia.org/wiki/Missing_valueshttp://en.wikipedia.org/wiki/Knowledge_discoveryhttp://en.wikipedia.org/wiki/Knowledge_discoveryhttp://en.wikipedia.org/wiki/Data_cleaninghttp://en.wikipedia.org/wiki/Feature_extractionhttp://en.wikipedia.org/wiki/Training_sethttp://en.wikipedia.org/wiki/Training_sethttp://en.wikipedia.org/wiki/GIGOhttp://en.wikipedia.org/wiki/Data_mininghttp://en.wikipedia.org/wiki/Machine_learninghttp://en.wikipedia.org/wiki/Missing_valueshttp://en.wikipedia.org/wiki/Knowledge_discoveryhttp://en.wikipedia.org/wiki/Data_cleaninghttp://en.wikipedia.org/wiki/Feature_extractionhttp://en.wikipedia.org/wiki/Training_sethttp://en.wikipedia.org/wiki/Training_set

8/11/2019 2. NOTES.doc

45/70


: Hinds of -nowledge to be discovered

: Hinds of techniques utilized

: Hinds of applications adapted

>. Data mining tas-s

: Descriptive data mining

: =redictive data mining

Decisions in data mining

Databases to be mined

o 8elational, transactional, ob&ectoriented, ob&ectrelational, active, spatial, time

series, te)t, multimedia, heterogeneous, legacy, WWW, etc.

- Hnowledge to be mined

o 1haracterization, discrimination, association, classification, clustering, trend,

deviation and outlier analysis, etc.

o (ultiple@integrated functions and mining at multiple levels

- Techniques utilized

o Databaseoriented, data warehouse 6FA=/, machine learning, statistics,

visualization, neural networ-, etc.

- Applications adapted

o 8etail, telecommunication, ban-ing, fraud analysis, D9A mining, stoc- mar-et

analysis, Web mining, Weblog analysis, etc.

. Data mining tasks

: =rediction Tas-s

o *se some variables to predict un-nown or future values of other variables

: Description Tas-s

o 7ind humaninterpretable patterns that describe the data.

45

8/11/2019 2. NOTES.doc

46/70


#ommon data mining tas+s

: 1lassification Z=redictive[

: 1lustering ZDescriptive[

: Association 8ule Discovery ZDescriptive[

: +equential =attern Discovery ZDescriptive[

: 8egression Z=redictive[

: Deviation Detection Z=redictive[

%lassifications of data mining s#stems1

+upervised learning classification/

+upervision$ The training data observations, measurements, etc./ are

accompanied by labels indicating the class of the observations 9ew data is classified based on the training set

*nsupervised learning clustering/

The class labels of training data is un-nown

Jiven a set of measurements, observations, etc. with the aim of establishing the e)istence of

classes or clusters in the data.

%lassification

predicts categorical class labels discrete or nominal/

classifies data constructs a model/ based on the training set and the values class

labels/ in a classifying attribute and uses it in classifying new data

;umeric 5rediction

models continuousvalued functions, i.e., predicts un-nown or missing values

*#pical applications

1redit@loan approval

(edical diagnosis$ if a tumor is cancerous or benign

7raud detection$ if a transaction is fraudulent

Web page categorization$ which category it is

46

8/11/2019 2. NOTES.doc

47/70


! Data ining *ask 5rimitives

The set of tas7-rele!ant data to be mined$ This specifies the portions of the database or the set of

data in which the user is interested. This includes the database attributes or data warehouse dimensions of

interest referred to as the rele!ant attri%utes or dimensions/.

The 7ind o& 7nowledge to be mined$ This specifies the data mining &unctions to be per

formed, such as characterization, discrimination, association or correlation analysis, classification,

prediction, clustering, outlier analysis, or evolution analysis.

The %ac7ground 7nowledge to be used in the discovery process$ This -nowledge about the domainto be mined is useful for guiding the -nowledge discovery process and for evaluating the patterns found.

'oncept hierarchies are a popular form of bac-ground -nowledge, which allow data to be mined

at multiple levels of abstraction. An e)ample of a concept hierarchy for the attribute or dimension/ age is

shown in 7igure. *ser beliefs regarding relationships in the data are another formof bac- ground

-nowledge.

The interestingness measures and thresholds for pattern evaluation$ They may be used to guide the

mining process or, after discovery, to evaluate the discovered patterns. Different -inds of -nowledge may

have different interestingness measures. 7or e)am ple, interestingness measures for association rules

includesupport and con&idence.

8ules whose support and confidence values are below userspecified thresholds are considered

uninteresting. The e)pected representation &or !isualiing the discovered patterns$ This refers to the

forminwhich discovered patterns are to be displayed,which may include rules, tables, charts, graphs,

decision trees, and cubes.

47

8/11/2019 2. NOTES.doc

48/70


.! Data5reprocessing!

The realworld data that is to be analyzed by data mining techniques are$

1' Incomplete1lac-ing attribute values or certain attributes of interest, or containing only aggregate

data. (issing data, particularly for tuples with missing values for some attributes, may need to be

inferred.

2' ;ois# $ containing errors, or outlier values that deviate from the e)pected. ncorrect data may also

result from inconsistencies in naming conventions or data codes used, or inconsistent formats for

input fields, such as date. t is hence necessary to use some techniques to replace the noisy data.

3' Inconsistent 1containing discrepancies between different data items. some attributes representing

a given concept may have different names in different databases, causing inconsistencies and

redundancies. 9aming inconsistencies may also occur for attribute values. The inconsistency in

data needs to be removed.

4' Aggregate Information1 t would be useful to obtain aggregate information such as to the sales

per customer regionQsomething that is not part of any precomputed data cube in the data

warehouse.

48

8/11/2019 2. NOTES.doc

49/70


5' 7nhancing mining process1Farge number of data sets may ma-e the data mining process slow.

%ence, reducing the number of data sets to enhance the performance of the mining process is

important.

6' Improve Data :ualit#1Data preprocessing techniques can improve the quality of the data,thereby helping to improve the accuracy and efficiency of the subsequent mining process. Data

preprocessing is an important step in the -nowledge discovery process, because quality decisions

must be based on quality data. Detecting data anomalies, rectifying them early, and reducing the

data to be analyzed can lead to huge payoffs for decision ma-ing.

Different forms of Data 5rocessing

Data %leaning1

Data cleaning routines wor- to Sclean the data by filling in missing values, smoothing noisy

data, identifying or removing outliers, and resolving inconsistencies.

f users believe the data are dirty, they are unli-ely to trust the results of any data mining that

has been applied to it. Also, dirty data can cause confusion for the mining procedure,

resulting in unreliable output. But, they are not always robust.

Therefore, a useful preprocessing step is used some datacleaning routines.

Data Integration1

Data integration involves integrating data from multiple databases, data cubes, or files.

+ome attributes representing a given concept may have different names in different databases,

causing inconsistencies and redundancies. 7or e)ample, the attribute for customer

identification may be referred to as customer\id in one data store and cust\id in another.

9aming inconsistencies may also occur for attribute values.

Also, some attributes may be inferred from others e.g., annual revenue/.

%aving a large amount of redundant data may slow down or confuse the -nowledge

discovery process. Additional data cleaning can be performed to detect and remove

redundancies that may have resulted from data integration.

Data *ransformation1

Data transformation operations, such as normalization and aggregation, are additional data

preprocessing procedures that would contribute toward the success of the mining process.

9ormalization$ 9ormalization is scaling the data to be analyzed to a specific range such as

Z#.#, !.#[ for providing better results.

49

8/11/2019 2. NOTES.doc

50/70


Aggregation$ Also, it would be useful for data analysis to obtain aggregate information such

as the sales per customer region. As, it is not a part of any precomputed data cube, it would

need to be computed. This process is called Aggregation.

Data 9eduction1

Data reduction obtains a reduced representation of the data set that is much smaller in volume, yet

produces the same or almost the same/ analytical results. There are a number of strategies for

data reduction.

data aggregation e.g., building a data cube/,

attribute subset selection e.g., removing irrelevant attributes through correlation analysis/,

dimensionality reduction e.g., using encoding schemes such as minimum length encoding or

wavelets/,

and numerosity reduction e.g., Sreplacing the data by alternative, smaller representations

such as clusters or parametric models/.

generalization with the use of concept hierarchies,by organizing the concepts into varying

levels of abstraction.

Data discretization is very useful for the automatic generation of concept hierarchies from

numerical data.

50

8/11/2019 2. NOTES.doc

51/70


UNIT & I'

ASSOCIATION RU$E MINING AND C$ASSI(ICATION

&! 6re)uent 5attern Anal#sis

7requent pattern$ a pattern a set of items, subsequences, substructures, etc./ that occurs frequently

in a data set

7irst proposed by Agrawal, mielins-i, and +wami ZA+"?[ in the conte)t of frequent itemsets and

association rule mining

(otivation$ 7inding inherent regularities in data

What products were often purchased togetherMQ Beer and diapersM]

What are the subsequent purchases after buying a =1M

What -inds of D9A are sensitive to this new drugM

1an we automatically classify web documentsM

Applications$

Bas-et data analysis, crossmar-eting, catalog design, sale campaign analysis, Web log clic-

stream/ analysis, and D9A sequence analysis

Wh# Is 6re)! 5attern ining Important

Dimension@level constraint

o in relevance to region, price, brand, customer category

8ule or pattern/ constraint

o small sales price ^ _!#/ triggers big sales sum N _>##/

nterestingness constraint

o strong rules$ min\support ?, min\confidence C#

%onstrained ining vs! %onstraint-$ased (earch

1onstrained mining vs. constraintbased search@reasoning

o Both are aimed at reducing search space

o 7inding all patterns satisfying constraints vs. finding some or one/ answer in

constraintbased search in A

o 1onstraintpushing vs. heuristic search

o t is an interesting research problem on how to integrate them

1onstrained mining vs. query processing in DB(+

o Database query processing requires to find all

o 1onstrained pattern mining shares a similar philosophy as pushing selections

deeply in query processing

*he Apriori Algorithm C 74ample

53

8/11/2019 2. NOTES.doc

54/70


+!Decision *ree Induction

54

8/11/2019 2. NOTES.doc

55/70


nformation produced by data mining techniques can be represented in many different

ways. Decision tree structures are a common way to organize classification schemes. n

classifying tas-s, decision trees visualize what steps are ta-en to arrive at a classification. 3very

decision tree begins with what is termed a root node, considered to be the XparentX of every other

node. 3ach node in the tree evaluates an attribute in the data and determines which path it should

follow. Typically, the decision test is based on comparing a value against some constant.

1lassification using a decision tree is performed by routing from the root node until arriving at a

leaf node.

The illustration provided here is a cannonical e)ample in data mining, involving the

decision to play or not play based on climate conditions. n this case, outloo- is in the position of

the root node. The degrees of the node are attribute values. n this e)ample, the child nodes are

tests of humidity and windy, leading to the leaf nodes which are the actual classifications. This

e)ample also includes the corresponding data, also referred to as instances. n our e)ample, there

are " XplayX days and Xno playX days.

55

8/11/2019 2. NOTES.doc

56/70


Decision trees can represent diverse types of data. The simplest and most familiar is

numerical data. t is often desirable to organize nominal data as well. 9ominal quantities are

formally described by a discrete set of symbols. 7or e)ample, weather can be described in either

numeric or nominal fashion. We can quantify the temperature by saying that it is !! degrees

1elsius or > degrees 7ahrenheit. We could also say that it is cold, cool, mild, warm or hot. The

former is an e)ample of numeric data, and the latter is a type of nominal data. (ore accurately,

the e)ample of cold, cool, mild, warm and hot is a special type of nominal data, described as

ordinal data. 6rdinal data has an implicit assumption of ordered relationships between the values.

1ontinuing with the weather e)ample, we could also have a purely nominal description li-e

sunny, overcast and rainy. These values have no relationships or distance measures.

The type of data organized by a tree is important for understanding how the tree wor-s at

the node level. 8ecalling that each node is effectively a test, numeric data is often evaluated in

terms of simple mathematical inequality. 7or e)ample, numeric weather data could be tested by

finding if it is greater than !# degrees 7ahrenheit. 9ominal data is tested in Boolean fashionI in

other words, whether or not it has a particular value. The illustration shows both types of tests. n

the weather e)ample, outloo- is a nominal data type. The test simply as-s which attribute value is

represented and routes accordingly. The humidity node reflects numeric tests, with an inequality

of less than or equal to #, or greater than #.

Decision tree induction algorithms function recursively. 7irst, an attribute must be selected

as the root node. n order to create the most efficient i.e, smallest/ tree, the root node must

effectively split the data. 3ach split attempts to pare down a set of instances the actual data/ until

they all have the same classification. The best split is the one that provides what is termed the

most information gain.

nformation in this conte)t comes from the concept of entropy from information theory, as

developed by 1laude +hannon. Although XinformationX has many conte)ts, it has a very specific

mathematical meaning relating to certainty in decision ma-ing. deally, each split in the decision

tree should bring us closer to a classification. 6ne way to conceptualize this is to see each step

along the tree as removing randomness or entropy. nformation, e)pressed as a mathematical

quantity, reflects this. 7or e)ample, consider a very simple classification problem that requires

creating a decision tree to decide yes or no based on some data. This is e)actly the scenario

visualized in the decision tree. 3ach attributes values will have a certain number of yes or no

classifications. f there are equal numbers of yeses and noPs, then there is a great deal of entropy in

56

8/11/2019 2. NOTES.doc

57/70


that value. n this situation, information reaches a ma)imum. 1onversely, if there are only yeses

or only noPs the information is also zero. The entropy is low, and the attribute value is very useful

for ma-ing a decision.

The formula for calculating intermediate values is as follows$

)*Machine $earnin!

The general problem of machine learning is to search a, usually very large, space of potential

hypotheses to determine the one that will best fit the data and any prior -nowledge. The data may be

labelled or unlabelled. f labels are given then the problem is one of supervised learning in that the true

answer is -nown for a given set of data. f the labels are categorical then the problem is one of

classification, e.g. predicting the species of a flower given petal and sepal measurements. f the labels are

realvalued the problem is one of regression, e.g. predicting property values from crime, pollution, etc.

statistic. f labels are not given then the problem is one of unsupervised learning and the aim is

characterize the structure of the data, e.g. by identifying groups of e)amples in the data that are

collectively similar to each other and distinct from the other data.

S+per,i-e. $earnin!

Jiven some e)amples we wish to predict certain properties, in the case where there are available a

set of e)amples whose properties have already been characterized the tas- is to learn the relationship

between the two. 6ne common early approach was to present the e)amples in turn to a learner. The learner

ma-es a prediction of the property of interest, the correct answer is presented, and the learner ad&usts its

hypothesis accordingly. This is -nown as learning with a teacher, or supervised learning.

n supervised learning there is necessarily the assumption that the descriptors available are in some

related to a quantity of interest. 7or instance, suppose that a ban- wishes to detect fraudulent credit card

transactions. n order to do this some domain -nowledge is required to identify factors that are li-ely to be

indicative of fraudulent use. These may include frequency of usage, amount of transaction, spending

patterns, type of business engaging in the transaction and so forth. These variables are the predictive, or

independent, variables 4. t would be hoped that these were in some way related to the target, or

dependent, variable . Deciding which variables to use in a model is a very difficult problem in generalI this

is -nown as the problem of feature selection and is 9=complete. (any methods e)ist for choosing the

predictive variables, if domain -nowledge is available then this can be very useful in this conte)t. %ere we

assume that at least some of the predictive variables at least are in fact predictive. L Assume, then, that the

relationship between and is given by the &oint probability density .

57

8/11/2019 2. NOTES.doc

58/70


UNIT & '

C$USTERING AND A//$ICATIONS AND TRENDS IN DATA MINING

&!%luster Anal#sis

Data clustering is a method in which we make cluster of objects that are somehow similar in

characteristics. The criterion for checking the similarity is implementation dependent.

Clustering is often confused with classification, but there is some difference between the two. In

classification the objects are assigned to pre defined classes, whereas in clustering the classes are also to be

defined.

Precisely, Data Clustering is a technique in which, the information that is logically similar is

physically stored together. In order to increase the efficiency in the database systems the number of disk

accesses are to be minimized. In clustering the objects of similar properties are placed in one class of

objects and a single access to the disk makes the entire cl