8/11/2019 2. NOTES.doc
1/70
CS2032 DATA WAREHOUSING AND DATA MINING
Department of Information Technoo!"
UNIT I
DATA WAREHOUSING
Data Warehouse Introduction
A data warehouse is a collection of data marts representing historical data from different
operations in the company. This data is stored in a structure optimized for querying and data analysis as a
data warehouse. Table design, dimensions and organization should be consistent throughout a data
warehouse so that reports or queries across the data warehouse are consistent. A data warehouse can also
be viewed as a database for historical data from different functions within a company.
The term Data Warehouse was coined by Bill nmon in !""#, which he defined in the following
way$ A warehouse is a subject-oriented, integrated, time-variant and non-volatile collection of data
in support of managements decision making process!
%e defined the terms in the sentence as follows$
Subject Oriented:Data that gives information about a particular sub&ect instead of about a company's
ongoing operations.
Integrated: Data that is gathered into the data warehouse from a variety of sources and merged into a
coherent whole.
Time-variant:All data in the data warehouse is identified with a particular time period.
Non-volatile:Data is stable in a data warehouse. (ore data is added but data is never removed.
This enables management to gain a consistent picture of the business. t is a single, complete and
consistent store of data obtained from a variety of different sources made available to end users in what
they can understand and use in a business conte)t. t can be
*sed for decision +upport
*sed to manage and control business
*sed by managers and endusers to understand the business and ma-e &udgments
Data Warehousing is an architectural construct of information systems that provides users with current
and historical decision support information that is hard to access or present in traditional operational data
stores
"ther important terminolog#
Enterprise Data warehouse: t collects all information about sub&ects customers, products, sales,
assets, personnel/ that span the entire organization
Data (art$ Departmental subsets that focus on selected sub&ects. A data mart is a segment of a datawarehouse that can provide data for reporting and analysis on a section, unit, department or operation in
1
8/11/2019 2. NOTES.doc
2/70
CS2032 DATA WAREHOUSING AND DATA MINING
the company, e.g. sales, payroll, production. Data marts are sometimes complete individual data
warehouses which are usually smaller than the corporate data warehouse.
Decision Support System (DSS):nformation technology to help the -nowledge wor-er e)ecutive,
manager, and analyst/ ma-es faster 0 better decisions
Drill-down:Traversing the summarization levels from highly summarized data to the underlyingcurrent or old detail
Metadata:Data about data. 1ontaining location and description of warehouse system components$
names, definition, structure2
$enefits of data warehousing
Data warehouses are designed to perform well with aggregate queries running on large
amounts of data.
The structure of data warehouses is easier for end users to navigate, understand and query
against unli-e the relational databases primarily designed to handle lots of transactions. Data warehouses enable queries that cut across different segments of a company's operation.
3.g. production data could be compared against inventory data even if they were originally
stored in different databases with different structures.
4ueries that would be comple) in very normalized databases could be easier to build and
maintain in data warehouses, decreasing the wor-load on transaction systems.
Data warehousing is an efficient way to manage and report on data that is from a variety of
sources, non uniform and scattered throughout a company.
Data warehousing is an efficient way to manage demand for lots of information from lots of
users.
5Data warehousing provides the capability to analyze large amounts of historical data fornuggets of wisdom that can provide an organization with competitive advantage.
"perational and informational Data
6perational Data$
7ocusing on transactional function such as ban- card withdrawals and deposits
Detailed
*pdateable
8eflects current data
nformational Data$
7ocusing on providing answers to problems posed by decision ma-ers
+ummarized
9on updateable
Data Warehouse %haracteristicsA data warehouse can be viewed as an information system with the following attributes$
: t is a database designed for analytical tas-s
: t's content is periodically updated
: t contains current and historical data to provide a historical perspective of information
2
8/11/2019 2. NOTES.doc
3/70
CS2032 DATA WAREHOUSING AND DATA MINING
6perational data store 6D+/
; 6D+ is an architecture concept to support daytoday operational decision support and contains
current value data propagated from operational applications
; 6D+ is sub&ectoriented, similar to a classic definition of a Data warehouse
; 6D+ is integrated
6D+ DATA WA83%6*+3
8/11/2019 2. NOTES.doc
4/70
CS2032 DATA WAREHOUSING AND DATA MINING
. nformation delivery system
Data warehouse is an environment, not a product which is based on relational database
management system that functions as the central repository for informational data.
The central repository information is surrounded by number of -ey components designed to ma-e
the environment is functional, manageable and accessible.
The data source for data warehouse is coming from operational applications. The data entered into
the data warehouse transformed into an integrated structure and format. The transformation process
involves conversion, summarization, filtering and condensation. The data warehouse must be capable of
holding and managing large volumes of data as well as different structure of data structures over the time.
&! Data warehouse database
This is the central part of the data warehousing environment. This is the item number > in the
above arch. diagram. This is implemented based on 8DB(+ technology.
'! (ourcing, Ac)uisition, %lean up, and *ransformation *ools
This is item number ! in the above arch diagram. They perform conversions, summarization, -ey
changes, structural changes and condensation. The data transformation is required so that the information
can by used by decision support tools. The transformation produces programs, control statements, E1F
4
8/11/2019 2. NOTES.doc
5/70
CS2032 DATA WAREHOUSING AND DATA MINING
code, 16B6F code, *9G scripts, and +4F DDF code etc., to move the data into data warehouse from
multiple operational systems.
The functionalities of these tools are listed below$
To remove unwanted data from operational db 1onverting to common data names and attributes
1alculating summaries and derived data
3stablishing defaults for missing data
5Accommodating source data definition changes
Issues to be considered while data sourcing, cleanup, extract and transformation:
Data heterogeneity$ t refers to DB(+ different nature such as it may be in different data modules,
it may have different access languages, it may have data navigation methods, operations, concurrency,
integrity and recovery processes etc.,
Data heterogeneity$ t refers to the different way the data is defined and used in different modules.
Some experts involved in the development of such tools:
=rism +olutions, 3volutionary Technology nc.,
8/11/2019 2. NOTES.doc
6/70
CS2032 DATA WAREHOUSING AND DATA MINING
(eta data helps the users to understand content and find the data. (eta data are stored in a
separate data stores which is -nown as informational directory or (eta data repository which helps to
integrate, maintain and view the contents of the data warehouse. The following lists the characteristics of
info directory@ (eta data$
t is the gateway to the data warehouse environment
t supports easy distribution and replication of content for high performance and availability
t should be searchable by business oriented -ey words
5t should act as a launch platform for end user to access data and analysis tools
t should support the sharing of info
5t should support scheduling options for request
5t should support and provide interface to other applications
t should support end user monitoring of the status of the data warehouse environment
Access tools
ts purpose is to provide info to business users for decision ma-ing. There are five categories$
5Data query and reporting tools
Application development tools
3)ecutive info system tools 3+/
56FA= tools
Data mining tools
4uery and reporting tools are used to generate query and report. There are two types of reporting tools.
They are$
=roduction reporting tool used to generate regular operational reports
Des-top report writer are ine)pensive des-top tools designed for end users.
Managed Query tools:used to generate +4F query. t uses (eta layer software in between users
and databases which offers a pointandclic- creation of +4F statement. This tool is a preferred choice of
users to perform segment identification, demographic analysis, territory management and preparation of
customer mailing lists etc.
pplication de!elopment tools: This is a graphical data access environment which integrates
6FA= tools with data warehouse and can be used to access all db systems
"#$ Tools:are used to analyze the data in multi dimensional and comple) views. To enable
multidimensional properties it uses (DDB and (8DB where (DDB refers multi dimensional data base
and (8DB refers multi relational data bases.
Data mining tools:are used to discover -nowledge from the data warehouse data also can be used
for data visualization and data correction purposes.
6
8/11/2019 2. NOTES.doc
7/70
CS2032 DATA WAREHOUSING AND DATA MINING
.!Data marts
Departmental subsets that focus on selected sub&ects. They are independent used by
dedicated user group. They are used for rapid delivery of enhanced decision support functionality
to end users. Data mart is used in the following situation$
3)tremely urgent user requirement
The absence of a budget for a full scale data warehouse strategy
The decentralization of business needs
The attraction of easy to use tools and mind sized pro&ect
Data mart presents two problems$
!. Scala%ility: A small data mart can grow quic-ly in multi dimensions. +o that while
designing it, the organization has to pay more attention on system scalability, consistency
and manageability issues
>.Data integration
/!Data warehouse admin and management
The management of data warehouse includes,
+ecurity and priority management
(onitoring updates from multiple sources
Data quality chec-s
(anaging and updating meta data
Auditing and reporting data warehouse usage and status
=urging data
8eplicating, sub setting and distributing data
Bac-up and recovery
Data warehouse storage management which includes capacity planning, hierarchical storage
management and purging of aged data etc.,
0!Information deliver# s#stem
; t is used to enable the process of subscribing for data warehouse info.; Delivery to one or more destinations according to specified scheduling algorithm.
'!$uilding a Data warehouse
7
8/11/2019 2. NOTES.doc
8/70
CS2032 DATA WAREHOUSING AND DATA MINING
There are two reasons why organizations consider data warehousing a critical need. n
other words, there are two factors that drive you to build and use data warehouse. They are$
Business &actors:
Business users want to ma-e decision quic-ly and correctly using all available data.
Technological &actors:
To address the incompatibility of operational data stores
T infrastructure is changing rapidly. ts capacity is increasing and cost is decreasing so that
building a data warehouse is easy
*here are several things to be considered while building a successful data warehouse
Business considerations$
6rganizations interested in development of a data warehouse can choose one of the following
*wo approaches1
1. Top : Down Approach +uggested by Bill nmon/
2. Bottom : *p Approach +uggested by 8alph Himball/
&!*op 2 Down Approach
n the top down approach suggested by Bill nmon, we build a centralized repository to house
corporate wide business data. This repository is called 3nterprise Data Warehouse 3DW/. The data in the
3DW is stored in a normalized form in order to avoid redundancy.
The central repository for corporate wide data helps us maintain one version of truth of the
data.The data in the 3DW is stored at the most detail level. The reason to build the 3DW on the most detail
level is to leverage
!. 7le)ibility to be used by multiple departments.
>. 7le)ibility to cater for future requirements.
*he disadvantages of storing data at the detail level are
!. The comple)ity of design increases with increasing level of detail.
>. t ta-es large amount of space to store data at detail level, hence increased cost.
8
8/11/2019 2. NOTES.doc
9/70
CS2032 DATA WAREHOUSING AND DATA MINING
6nce the 3DW is implemented we start building sub&ect area specific data marts which contain
data in a de normalized form also called star schema. The data in the marts are usually summarized based
on the end users analytical requirements. The reason to de normalize the data in the mart is to provide
faster access to the data for the end users analytics. f we were to have queried a normalized schema for the
same analytics, we would end up in a comple) multiple level &oins that would be much slower as
compared to the one on the de normalized schema.
We should implement the topdown approach when
!. The business has complete clarity on all or multiple sub&ect areas data warehosue requirements.
>. The business is ready to invest considerable time and money.
Theadvantage
of using the Top Down approach is that we build a centralized repository to cater
for one version of truth for business data. This is very important for the data to be reliable, consistent
across sub&ect areas and for reconciliation in case of data related contention between sub&ect areas.
The disadvantageof using the Top Down approach is that it requires more time and initial
investment. The business has to wait for the 3DW to be implemented followed by building the data marts
before which they can access their reports.
'! $ottom 3p Approach
The bottom up approach suggested by 8alph Himball is an incremental approach to build a data
warehouse. %ere we build the data marts separately at different points of time as and when the specific
sub&ect area requirements are clear. The data marts are integrated or combined together to form a data
warehouse. +eparate data marts are combined through the use of conformed dimensions and conformed
facts. A conformed dimension and a conformed fact is one that can be shared across data marts.
A 1onformed dimension has consistent dimension -eys, consistent attribute names and consistent
values across separate data marts. The conformed dimension means e)act same thing with every fact table
it is &oined. A 1onformed fact has the same definition of measures, same dimensions &oined to it and at the
same granularity across data marts.
The bottom up approach helps us incrementally build the warehouse by developing and integrating
data marts as and when the requirements are clear. We don't have to wait for -nowing the overall
requirements of the warehouse. We should implement the bottom up approach when
!. We have initial cost and time constraints.
>. The complete warehouse requirements are not clear. We have clarity to only one data mart.
9
8/11/2019 2. NOTES.doc
10/70
CS2032 DATA WAREHOUSING AND DATA MINING
The advantageof using the Bottom *p approach is that they do not require high initial costs and
have a faster implementation timeI hence the business can start using the marts much earlier as compared
to the topdown approach.
The disadvantages of using the Bottom *p approach is that it stores data in the de normalized
format, hence there would be high space usage for detailed data. We have a tendency of not -eeping
detailed data in this approach hence loosing out on advantage of having detail data .i.e. fle)ibility to easily
cater to future requirements. Bottom up approach is more realistic but the comple)ity of the integration
may become a serious obstacle.
!SI"N #ONSI!$%TIONS
To be a successful data warehouse designer must adopt a holistic approach that is considering all
data warehouse components as parts of a single comple) system, and ta-e into account all possible data
sources and all -nown usage requirements.
(ost successful data warehouses that meet these requirements have these common characteristics$
Are based on a dimensional model
1ontain historical and current data
nclude both detailed and summarized data
1onsolidate disparate data from multiple sources while retaining consistency
Data warehouse is difficult to build due to the following reason$
%eterogeneity of data sources
*se of historical data
Jrowing nature of data base
Data warehouse design approach muse be business driven, continuous and iterative engineering
approach. n addition to the general considerations there are following specific points relevant to the data
warehouse design$
Data content
The content and structure of the data warehouse are reflected in its data model. The data model is
the template that describes how information will be organized within the integrated warehouse framewor-.
The data warehouse data must be a detailed data. t must be formatted, cleaned up and transformed to fit
the warehouse data model.
10
8/11/2019 2. NOTES.doc
11/70
CS2032 DATA WAREHOUSING AND DATA MINING
eta data
t defines the location and contents of data in the warehouse. (eta data is searchable by users to
find definitions or sub&ect areas. n other words, it must provide decision support oriented pointers to
warehouse data and thus provides a logical lin- between warehouse data and decision support applications.
Data distribution
6ne of the biggest challenges when designing a data warehouse is the data placement and
distribution strategy. Data volumes continue to grow in nature. Therefore, it becomes necessary to -now
how the data should be divided across multiple servers and which users should get access to which types of
data. The data can be distributed based on the sub&ect area, location geographical region/, or time current,
month, year/.
*ools
A number of tools are available that are specifically designed to help in the
implementation of the data warehouse. All selected tools must be compatible with the given data
warehouse environment and with each other. All tools must be able to use a common (eta data
repository.
Design steps
The following ninestep method is followed in the design of a data warehouse$
!. 1hoosing the sub&ect matter
>. Deciding what a fact table represents
?. dentifying and conforming the dimensions
. 1hoosing the facts
. +toring pre calculations in the fact table
C. 8ounding out the dimension table
. 1hoosing the duration of the db
K. The need to trac- slowly changing dimensions
". Deciding the query priorities and query models
T!#&NI#%' #ONSI!$%TIONS
A number of technical issues are to be considered when designing a data warehouse
environment. These issues include$
11
8/11/2019 2. NOTES.doc
12/70
CS2032 DATA WAREHOUSING AND DATA MINING
The hardware platform that would house the data warehouse
The dbms that supports the warehouse data
The communication infrastructure that connects data marts, operational systems and end
users
The hardware and software to support meta data repository
The systems management framewor- that enables admin of the entire environment
I()'!(!NT%TION #ONSI!$%TIONS
The following logical steps needed to implement a data warehouse$
1ollect and analyze business requirements
1reate a data model and a physical design
Define data sources
1hoose the db tech and platform
3)tract the data from operational db, transform it, clean it up and load it into the warehouse
1hoose db access and reporting tools
1hoose db connectivity software
1hoose data analysis and presentation s@w
*pdate the data warehouse
Access tools
Data warehouse implementation relies on selecting suitable data access tools. The best way to choose
this is based on the type of data can be selected using this tool and the -ind of access it permits for a
particular user. The following lists the various type of data that can be accessed$
+imple tabular form data
8an-ing data
(ultivariable data
Time series data
Jraphing, charting and pivoting data
1omple) te)tual search data
+tatistical analysis data
Data for testing of hypothesis, trends and patterns
=redefined repeatable queries
Ad hoc user specified queries
12
8/11/2019 2. NOTES.doc
13/70
CS2032 DATA WAREHOUSING AND DATA MINING
8eporting and analysis data
1omple) queries with multiple &oins, multi level sub queries and sophisticated search criteria
Data e4traction, clean up, transformation and migration
A proper attention must be paid to data e)traction which represents a success factor for a data
warehouse architecture. When implementing data warehouse several the following selection criteria that
affect the ability to transform, consolidate, integrate and repair the data should be considered$
Timeliness of data delivery to the warehouse
The tool must have the ability to identify the particular data and that can be read by conversion tool
The tool must support flat files, inde)ed files since corporate data is still in this type
The tool must have the capability to merge data from multiple data stores
The tool should have specification interface to indicate the data to be e)tracted
The tool should have the ability to read data from data dictionary
The code generated by the tool should be completely maintainable
The tool should permit the user to e)tract the required data
The tool must have the facility to perform data type and character set translation
The tool must have the capability to create summarization, aggregation and derivation of records
The data warehouse database system must be able to perform loading data directly from these tools
Data placement strategies
: As a data warehouse grows, there are at least two options for data placement. 6ne is to put some of
the data in the data warehouse into another storage media.
: The second option is to distribute the data in the data warehouse across multiple servers.
3ser levels
The users of data warehouse data can be classified on the basis of their s-ill level in accessing the
warehouse. There are three classes of users$
'asual users:are most comfortable in retrieving info from warehouse in pre defined formats and
running pre e)isting queries and reports. These users do not need tools that allow for building standard and
ad hoc reports
$ower sers:can use pre defined as well as user defined queries to create simple and ad hoc
reports. These users can engage in drill down operations. These users may have the e)perience of using
reporting and query tools.
13
8/11/2019 2. NOTES.doc
14/70
CS2032 DATA WAREHOUSING AND DATA MINING
Epert users:These users tend to create their own comple) queries and perform standard analysis
on the info they retrieve. These users have the -nowledge about the use of query and report tools
$enefits of data warehousing 1Data warehouse usage includes,
: Focating the right info
: =resentation of info
: Testing of hypothesis
: Discovery of info
: +haring the analysis
*he benefits can be classified into two1
Tangible benefits quantified @ measureable/$t includes,
: mprovement in product inventory
: Decrement in production cost
: mprovement in selection of target mar-ets
: 3nhancement in asset and liability management
ntangible benefits not easy to quantified/$ t includes,
: mprovement in productivity by -eeping all data in single location and eliminating re-eying of
data
: 8educed redundant processing
: 3nhanced customer relation
+! apping the data warehouse architecture to ultiprocessor architecture
The functions of data warehouse are based on the relational data base technology. The relational
data base technology is implemented in parallel manner. There are two advantages of having parallel
relational data base technology for data warehouse$
#inear Speed up:refers the ability to increase the number of processor to reduce response time.
#inear Scale up:refers the ability to provide same performance on the same requests as the
database size increases
*#pes of parallelism
There are two types of parallelism$
14
8/11/2019 2. NOTES.doc
15/70
CS2032 DATA WAREHOUSING AND DATA MINING
*nter +uery $arallelism: n which different server threads or processes handle multiple requests at
the same time.
*ntra +uery $arallelism:This form of parallelism decomposes the serial +4F query into lower
level operations such as scan, &oin, sort etc. Then these lower level operations are e)ecuted concurrently in
parallel.
ntra query parallelism can be done in either of two ways$
oriontal parallelism:which means that the data base is partitioned across multiple dis-s and
parallel processing occurs within a specific tas- that is performed concurrently on different processors
against different set of data
.ertical parallelism:This occurs among different tas-s. All query components such as scan, &oin,
sort etc are e)ecuted in parallel in a pipelined fashion. n other words, an output from one tas- becomes an
input into another tas-.
Data partitioning1
Data partitioning is the -ey component for effective parallel e)ecution of data base operations.=artition can be done randomly or intelligently.
/andom portioningincludes random data striping across multiple dis-s on a single server. Anotheroption for random portioning is round robin fashion partitioning in which each record is placed on the ne)tdis- assigned to the data base.
*ntelligent partitioningassumes that DB(+ -nows where a specific record is located and does notwaste time searching for it across all dis-s. The various intelligent partitioning include$
ash partitioning:A hash algorithm is used to calculate the partition number based on the value ofthe partitioning -ey for each row
15
8/11/2019 2. NOTES.doc
16/70
CS2032 DATA WAREHOUSING AND DATA MINING
0ey range partitioning:8ows are placed and located in the partitions according to the value of thepartitioning -ey. That is all the rows with the -ey value from A to H are in partition !, F to T are inpartition > and so on.
Schema portioning:an entire table is placed on one dis-I another table is placed on different dis-
etc. This is useful for small reference tables.
ser de&ined portioning:t allows a table to be partitioned on the basis of a user definede)pression.
Data base architectures of parallel processing
There are three DB(+ software architecture styles for parallel processing$
!. +hared memory or shared everything Architecture
>. +hared dis- architecture
?. +hred nothing architecture
Shared (emor* %rchitecture
Tightly coupled shared memory systems, illustrated in following figure have the following
characteristics$
(ultiple =*s share memory.
3ach =* has full access to all shared memory through a common bus.
1ommunication between nodes occurs via shared memory.
=erformance is limited by the bandwidth of the memory bus.
+ymmetric multiprocessor +(=/ machines are often nodes in a cluster. (ultiple +(= nodes can be
used with 6racle =arallel +erver in a tightly coupled system, where memory is shared among the multiple
=*s, and is accessible by all the =*s through a memory bus. 3)amples of tightly coupled systems include
the =yramid, +equent, and +un +parc+erver.
16
8/11/2019 2. NOTES.doc
17/70
CS2032 DATA WAREHOUSING AND DATA MINING
=erformance is potentially limited in a tightly coupled system by a number of factors. These include
various system components such as the memory bandwidth, =* to =* communication bandwidth, the
memory available on the system, the @6 bandwidth, and the bandwidth of the common bus.
=arallel processing advantages of shared memor# s#stemsare these$
(emory access is cheaper than internode communication. This means that internal
synchronization is faster than using the Foc- (anager.
+hared memory systems are easier to administer than a cluster.
A disadvantage of shared memor# s#stems for parallel processing is as follows$
+calability is limited by bus bandwidth and latency, and by available memory.
Shared is+ %rchitecture
+hared dis- systems are typically loosely coupled. +uch systems, illustrated in following figure, have
the following characteristics$
3ach node consists of one or more =*s and associated memory.
(emory is not shared between nodes.
1ommunication occurs over a common highspeed bus.
17
8/11/2019 2. NOTES.doc
18/70
CS2032 DATA WAREHOUSING AND DATA MINING
3ach node has access to the same dis-s and other resources.
A node can be an +(= if the hardware supports it.
Bandwidth of the highspeed bus limits the number of nodes scalability/ of the system.
The cluster illustrated in figure is composed of multiple tightly coupled nodes. The Distributed Foc-
(anager DF( / is required. 3)amples of loosely coupled systems are
8/11/2019 2. NOTES.doc
19/70
CS2032 DATA WAREHOUSING AND DATA MINING
=arallel processing disadvantages of shared dis- systems are these$
nternode synchronization is required, involving DF( overhead and greater dependency on high
speed interconnect.
f the wor-load is not partitioned well, there may be high synchronization overhead.
There is operating system overhead of running shared dis- software.
Shared Nothing %rchitecture
+hared nothing systems are typically loosely coupled. n shared nothing systems only one 1=* is
connected to a given dis-. f a table or database is located on that dis-, access depends entirely on the =*
which owns it. +hared nothing systems can be represented as follows$
+hared nothing systems are concerned with access to dis-s, not access to memory. 9onetheless,
adding more =*s and dis-s can improve scale up. 6racle =arallel +erver can access the dis-s on a shared
nothing system as long as the operating system provides transparent dis- access, but this access is
e)pensive in terms of latency.
+hared nothing systems have advantages and disadvantages for parallel processing$
Advantages
19
8/11/2019 2. NOTES.doc
20/70
CS2032 DATA WAREHOUSING AND DATA MINING
+hared nothing systems provide for incremental growth.
+ystem growth is practically unlimited.
(==s are good for readonly databases and decision support applications.
7ailure is local$ if one node fails, the others stay up.
Disadvantages
(ore coordination is required.
(ore overhead is required for a process wor-ing on a dis- belonging to another node.
f there is a heavy wor-load of updates or inserts, as in an online transaction processing system, it
may be worthwhile to consider datadependent routing to alleviate contention.
5arallel D$( features
+cope and techniques of parallel DB(+ operations
6ptimizer implementation
Application transparency
=arallel environment which allows the DB(+ server to ta-e full advantage of the e)isting facilities
on a very low level
DB(+ management tools help to configure, tune, admin and monitor a parallel 8DB(+ as
effectively as if it were a serial 8DB(+
=rice @ =erformance$ The parallel 8DB(+ can demonstrate a non linear speed up and scale up at
reasonable costs.
5arallel D$( vendors
6racle$ =arallel 4uery 6ption =46/
Architecture$ shared dis- arch
20
8/11/2019 2. NOTES.doc
21/70
CS2032 DATA WAREHOUSING AND DATA MINING
Data partition$ Hey range, hash, round robin
=arallel operations$ hash &oins, scan and sort
nformi)$ eGtended =arallel +erver G=+/
Architecture$ +hared memory, shared dis- and shared nothing models
Data partition$ round robin, hash, schema, -ey range and user defined
=arallel operations$ 9+38T, *=DAT3, D3F3FT3
B($ DB> =arallel 3dition DB> =3/
Architecture$ +hared nothing modelsData partition$ hash
=arallel operations$ 9+38T, *=DAT3, D3F3FT3, load, recovery, inde) creation, bac-up, table
reorganization
+LBA+3$ +LBA+3 (==
Architecture$ +hared nothing models
Data partition$ hash, -ey range, +chema
=arallel operations$ %orizontal and vertical parallelism
! D$( schemas for decision support
The basic concepts of dimensional modeling are$ facts, dimensions and measures. A fact is a
collection of related data items, consisting of measures and conte)t data. t typically represents business
items or business transactions. A dimension is a collection of data that describe one business dimension.
Dimensions determine the conte)tual bac-ground for the factsI they are the parameters over which we
want to perform 6FA=. A measure is a numeric attribute of a fact, representing the performance or
behavior of the business relative to the dimensions.
1onsidering 8elational conte)t, there are three basic schemas that are used in dimensional
modeling$
!. +tar schema
>. +nowfla-e schema
?. 7act constellation schema
(tar schema
21
8/11/2019 2. NOTES.doc
22/70
8/11/2019 2. NOTES.doc
23/70
CS2032 DATA WAREHOUSING AND DATA MINING
fact that the star schema is the simplest architecture, it is most commonly used nowadays and is
recommended by 6racle.
6act *ables
A fact table is a table that contains summarized numerical and historical data facts/ and a
multipart inde) composed of foreign -eys from the primary -eys of related dimension tables. A fact table
typically has two types of columns$ foreign -eys to dimension tables and measures those that contain
numeric facts. A fact table can contain fact's data on detail or aggregated level.
Dimension *ables
Dimensions are categories by which summarized data can be viewed. 3.g. a profit
summary in a fact table can be viewed by a Time dimension profit by month, quarter, year/,
8egion dimension profit by country, state, city/, =roduct dimension profit for product!,
product>/.
A dimension is a structure usually composed of one or more hierarchies that categorizes data. f a
dimension hasn't got a hierarchies and levels it is called flat dimension or list. The primary -eys of each of
the dimension tables are part of the composite primary -ey of the fact table. Dimensional attributes help to
describe the dimensional value. They are normally descriptive, te)tual values. Dimension tables are
generally small in size then fact table.
Typical fact tables store data about sales while dimension tables data about geographic region
mar-ets, cities/, clients, products, times, channels.
easures
(easures are numeric data based on columns in a fact table. They are the primary data which
end users are interested in. 3.g. a sales fact table may contain a profit measure which represents profit on
each sale.
Aggregations are pre calculated numeric data. By calculating and storing the answers to a query before
users as- for it, the query processing time can be reduced. This is -ey in providing fast query performance
in 6FA=.
1ubes are data processing units composed of fact tables and dimensions from the data
warehouse. They provide multidimensional views of data, querying and analytical capabilities to clients.
The main characteristics of star schema$
+imple structure N easy to understand schema
23
8/11/2019 2. NOTES.doc
24/70
CS2032 DATA WAREHOUSING AND DATA MINING
Jreat query effectives N small number of tables to &oin
8elatively long time of loading data into dimension tables N denormalization, redundancy
data caused that size of the table could be large.
The most commonly used in the data warehouse implementations N widely supported by a
large number of business intelligence tools
(nowflake schema1
The snowfla-e schema is an e)tension of the star schema, where each point of the star e)plodes
into more points. n a star schema, each dimension is represented by a single dimensional table, whereas in
a snowfla-e schema, that dimensional table is normalized into multiple loo-up tables, each representing a
level in the dimensional hierarchy.
7or e)ample, the Time Dimension that consists of > different hierarchies$
!.LearO(onthODay
>. Wee- O Day
We will have loo-up tables in a snowfla-e schema$ A loo-up table for year, a loo-up table for
month, a loo-up table for wee-, and a loo-up table for day. Lear is connected to (onth, which is then
connected to Day. Wee- is only connected to Day.
The main advantage of the snowfla+e schemais the improvement in query performance due to
minimized dis- storage requirements and &oining smaller loo-up tables.
The main disadvantage of the snowfla+e schemais the additional maintenance efforts needed due
to the increase number of loo-up tables.
24
http://www.1keydata.com/datawarehousing/www.1keydata.com/datawarehousing/star-schema.htmlhttp://www.1keydata.com/datawarehousing/www.1keydata.com/datawarehousing/star-schema.html8/11/2019 2. NOTES.doc
25/70
CS2032 DATA WAREHOUSING AND DATA MINING
t is the result of decomposing one or more of the dimensions. The manytoone relationships
among sets of attributes of a dimension can separate new dimension tables, forming a hierarchy. The
decomposed snowfla-e structure visualizes the hierarchical structure of dimensions very well.
7act constellation schema$ 7or each star schema it is possible to construct fact constellation
schema for e)ample by splitting the original star schema into more star schemes each of them describes
facts on another level of dimension hierarchies/. The fact constellation architecture contains multiple fact
tables that share many dimension tables.
The main shortcoming of the fact constellation schema is a more complicated design because
many variants for particular -inds of aggregation must be considered and selected. (oreover, dimension
tables are still large.
! Data 74traction, %leanup, and *ransformation *ools
3TF stands for 3)tract, Transform, Foad is Data Warehouse acquisition processes that involves
3)tract the data from outside sources.
Transform the data to fit business needs and ultimately
Foad the the transform data to the data warehouse.
7or e)ample$
!. nformatics.
>. Data +tage.
?. 6racle warehouse builder.
. Ab initio.
3TF can also be used for the integration with legacy systems. 3TF is the Data Warehouse
acquisition processes of 3)tracting, Transforming and Foading data from source systems into the data
warehouse.
74traction
25
8/11/2019 2. NOTES.doc
26/70
CS2032 DATA WAREHOUSING AND DATA MINING
3)traction is the operation of e)tracting data from a source system for further use in a data
warehouse environment. This is the first step of the 3TF process. After the e)traction, this data can be
transformed and loaded into the data warehouse.
*ntroduction to Etraction Methods in Data 1arehouses
The e)traction method you should choose is highly dependent on the source system and also from
the business needs in the target data warehouse environment.
8/11/2019 2. NOTES.doc
27/70
CS2032 DATA WAREHOUSING AND DATA MINING
users to perform segment identification, demographic analysis, territory management and preparation of
customer mailing lists etc.
pplication de!elopment tools: This is a graphical data access environment which integrates
6FA= tools with data warehouse and can be used to access all db systems
"#$ Tools:are used to analyze the data in multi dimensional and comple) views. To enable
multidimensional properties it uses (DDB and (8DB where (DDB refers multi dimensional data base
and (8DB refers multi relational data bases.
Data mining tools:are used to discover -nowledge from the data warehouse data also can be used
for data visualization and data correction purposes.
.! etadata
(eta data$ data about data
eta Data in Data Warehouse
eta Datais one of the most important aspect of data warehousing. t is the data about data stored
in data warehouse and its users.
eta Dataprovides decisionsupportoriented pointer to warehouse data and thus provide logical
lin- between warehouse data and decision support application.
eta Datais the -ey to providing user and application with a road map to the information stored
in the warehouse.
eta Datacan define all attributes, data sources and timing, and rules that govern data use and
data transformation of all data elements.
(etadata metacontent/ is defined as data providing information about one or more aspects of the
data, such as$
(eans of creation of the data
=urpose of the data
Time and date of creation
1reator or author of data
27
8/11/2019 2. NOTES.doc
28/70
CS2032 DATA WAREHOUSING AND DATA MINING
Focation on acomputer networ-where the data was created
+tandardsused
*#pes1
Technical (eta data1
t contains information about data warehouse data used by warehouse designer, administrator to carry out
development and management tas-s. t includes,
nfo about data stores
Transformation descriptions. That is mapping methods from operational db to warehouse db
Warehouse 6b&ect and data structure definitions for target data
The rules used to perform clean up, and data enhancement
Data mapping operations
Access authorization, bac-up history, archive history, info delivery history, data acquisition history,
data access etc.,
. /usiness (eta data:
t contains info that gives info stored in data warehouse to users. t includes,
+ub&ect areas, and info ob&ect type including queries, reports, images, video, audio clips etc.
nternet home pages
nfo related to info delivery system
Data warehouse operational info such as ownerships, audit trails etc.,
"ther *#pes1
28
http://en.wikipedia.org/wiki/Computer_networkhttp://en.wikipedia.org/wiki/Computer_networkhttp://en.wikipedia.org/wiki/Technical_standardhttp://en.wikipedia.org/wiki/Computer_networkhttp://en.wikipedia.org/wiki/Technical_standard8/11/2019 2. NOTES.doc
29/70
CS2032 DATA WAREHOUSING AND DATA MINING
(tructural metadata is used to describe the structure of computer systems such as tables,
columns and inde)es. 8uide metadatais used to help humans find specific items and is usually e)pressed
as a set of -eywords in a natural language.
According to 8alph Himballmetadata can be divided into > similar categoriesQTechnical
metadata and Business metadata. Technical metadata correspond to internal metadata, business
metadatato e)ternal metadata.
Himball adds a third category named 5rocess metadata. 6n the other hand, 9+6 distinguishes
between three types of metadata$ descriptive, structural and administrative.
Descriptive metadatais the information used to search and locate an ob&ect such as title, author,
sub&ects, -eywords, publisherIstructural metadata
gives a description of how the components of the
ob&ect are organizedI and administrative metadatarefers to the technical information including file type.
Two subtypes of administrative metadata are rights management metadata and preservation metadata.
*#pes of Data Warehouse
There are mainly three type of Data Warehouse.
!/. 3nterprise Data Warehouse.
>/. 6perational data store.
?/. Data (art.
7nterprise Data Warehouseprovide a control Data Base for decision support through out the
enterprise.
"perational data storehas a broad enterprise under scope but unli-e a real enterprise DW. Datais refreshed in rare real time and used for routine business activity.
Data artis a sub part of Data Warehouse. t support a particular reason or it is design for
particular lines of business such as sells, mar-eting or finance, or in any organization documents of a
particular department will be data mart
29
http://en.wikipedia.org/wiki/Ralph_Kimballhttp://en.wikipedia.org/wiki/Ralph_Kimball8/11/2019 2. NOTES.doc
30/70
CS2032 DATA WAREHOUSING AND DATA MINING
UNIT II
#USINESS ANA$%SIS
&! 9eporting and :uer# tools and Applications 2 *ool %ategories 2 the ;eed for
Applications
Data )uer# and reporting tools
4uery and reporting tools are divided in to two parts.
8eporting tools
(anaged query tools
9eporting toolsfurther dividing in to two parts.
5roduction reporting toolswill let companies generate regular operational reports or support
high level batch &ob, such as calculating and printing paychec-s.
9eport writer, on the other hand, are e)pensive des-top tools designed for end users.
30
8/11/2019 2. NOTES.doc
31/70
8/11/2019 2. NOTES.doc
32/70
CS2032 DATA WAREHOUSING AND DATA MINING
5nteractive reporting capability
53nterprisewide scalability
5+uperior user interface
57astest time to result
5Fowest cost of ownership
%atalogs
mpromptu stores metadata in sub&ect related folders. This metadata is what will be used to
develop a query for a report. The metadata set is stored in a file called a Rcatalog'. The catalog does not
contain any data. t &ust contains information about connecting to the database and the fields that will be
accessible for reports.
A catalog contains1
; 7oldersQmeaningful groups of information representing columns from one or more tables
; 1olumnsQindividual data elements that can appear in one or more folders
; 1alculationsQe)pressions used to compute required values from e)isting data
; 1onditionsQused to filter information so that only a certain type of information is displayed
; =romptsQpredefined selection criteria prompts that users can include in reports they create
; 6ther components, such as metadata, a logical database name, &oin information, and user classes
=ou can use catalogs to
; view, run, and print reports
; e)port reports to other applications; disconnect from and connect to the database; create reports; change the contents of the catalog; add user classes
5rompts
Lou can use prompts to; filter reports; calculate data items; format data
5icklist 5rompts
A pic-list prompt presents you with a list of data items from which you select one or more values,
so you need not be familiar with the database. The values listed in pic-list prompts can be retrieved from
32
8/11/2019 2. NOTES.doc
33/70
8/11/2019 2. NOTES.doc
34/70
8/11/2019 2. NOTES.doc
35/70
CS2032 DATA WAREHOUSING AND DATA MINING
6ne of the limitations that +4F has, it cannot represent these comple) problems. A query will be
translated in to several +4F statements. These +4F statements will involve multiple &oins, intermediate
tables, sorting, aggregations and a huge temporary memory to store these tables. These procedures
required a lot of computation which will require a long time in computing. The second limitation of +4F is
its inability to use mathematical models in these +4F statements. f an analyst, could create these comple)
statements using +4F statements, still there will be a large number of computation and huge memory
needed. Therefore the use of 6FA= is preferable to solve this -ind of problem.
.! %ategories of ">A5 *ools
(O'%)
This is the more traditional way of 6FA= analysis. n (6FA=, data is stored in a
multidimensional cube. The storage is not in the relational database, but in proprietary formats. That is,
data stored in arraybased structures.
Advantages$
3)cellent performance$ (6FA= cubes are built for fast data retrieval, and are optimal for slicing
and dicing operations.
1an perform comple) calculations$ All calculations have been pregenerated when the cube is
created. %ence, comple) calculations are not only doable, but they return quic-ly.
Disadvantages$
Fimited in the amount of data it can handle$ Because all calculations are performed when the cube
is built, it is not possible to include a large amount of data in the cube itself. This is not to say that
the data in the cube cannot be derived from a large amount of data. ndeed, this is possible. But in
this case, only summarylevel information will be included in the cube itself.
8equires additional investment$ 1ube technology are often proprietary and do not already e)ist in
the organization. Therefore, to adopt (6FA= technology, chances are additional investments in
human and capital resources are needed.
35
8/11/2019 2. NOTES.doc
36/70
CS2032 DATA WAREHOUSING AND DATA MINING
3)amples$ %yperion 3ssbase, 7usion nformation Builders/
$O'%)
This methodology relies on manipulating the data stored in the relational database to give the
appearance of traditional 6FA='s slicing and dicing functionality. n essence, each action of slicing and
dicing is equivalent to adding a SW%383 clause in the +4F statement. Data stored in relational tables
Advantages$
1an handle large amounts of data$ The data size limitation of 86FA= technology is the limitation
on data size of the underlying relational database. n other words, 86FA= itself places no
limitation on data amount.
1an leverage functionalities inherent in the relational database$ 6ften, relational database already
comes with a host of functionalities. 86FA= technologies, since they sit on top of the relational
database, can therefore leverage these functionalities.
Disadvantages$
=erformance can be slow$ Because each 86FA= report is essentially a +4F query or multiple
+4F queries/ in the relational database, the query time can be long if the underlying data size is
large.
Fimited by +4F functionalities$ Because 86FA= technology mainly relies on generating +4F
statements to query the relational database, and +4F statements do not fit all needs for e)ample, it
is difficult to perform comple) calculations using +4F/, 86FA= technologies are thereforetraditionally limited by what +4F can do. 86FA= vendors have mitigated this ris- by building
36
8/11/2019 2. NOTES.doc
37/70
CS2032 DATA WAREHOUSING AND DATA MINING
into the tool outofthebo) comple) functions as well as the ability to allow users to define their
own functions.
3)amples$ (icrostrategy ntelligence +erver, (eta1ube nformi)@B(/
&O'%) 0(1!: (anaged 1uer* !nvironment2
%6FA= technologies attempt to combine the advantages of (6FA= and 86FA=. 7or summary
type information, %6FA= leverages cube technology for faster performance. t stores only the inde)es and
aggregations in the multidimensional form while the rest of the data is stored in the relational database.
3)amples$ =ower=lay 1ognos/, Brio, (icrosoft Analysis +ervices, 6racle Advanced Analytic+ervices
/! ultidimensional ?ersus ultirelational ">A5
These relational implementations of multidimensional database systems are sometimes referred to
as multirelationaldatabase systems. To achieve the required speed, these products use the star or snowfla-e
schemasspecially optimized and denormalized data models that involve data restructuring and
37
8/11/2019 2. NOTES.doc
38/70
CS2032 DATA WAREHOUSING AND DATA MINING
aggregation. The snowfla-e schema is an e)tension of the star schema that supports multiple fact tables
and &oins between them./
6ne benefit of the star schema approach is reduced comple)ity in the data model, which increases
data Slegibility, ma-ing it easier for users to pose business questions of 6FA= nature.Data warehouse
queries can be answered up to !# times faster because of improved navigations.
Two types of database activity$
! 6FT=$ 6nFine Transaction =rocessing
+hort transactions, both queries and updates
e.g., update account balance, enroll in course/
4ueries are simple
e.g., find account balance, find grade in course/
*pdates are frequent
e.g., concert tic-ets, seat reservations, shopping carts/
>. 6FA=$ 6nFine Analytical =rocessing
U Fong transactions, usually comple) queries
U e.g., all statistics about all sales, grouped by dept and
U month/
U SData mining operations
U nfrequent updates
O'T) vs O'%)
6FT= stands for 6n Fine Transaction =rocessing and is a data modeling approach typically used to
facilitate and manage usual business applications. (ost of applications yousee and use are 6FT= based.
6FT= technology used to perform updates on operational or transactional systems e.g., point of
sale systems/
6FA= stands for 6n Fine Analytic =rocessing and is an approach to answer multidimensional queries. 6FA= was conceived for (anagement nformation +ystems and Decision+upport +ystems. 6FA= technology used to perform comple) analysis of the data in a datawarehouse.
The following table summarizes the major dieren!es between "#T$ and "#%$
s&stem design'"#T$ (&stem "#%$ (&stem
38
8/11/2019 2. NOTES.doc
39/70
CS2032 DATA WAREHOUSING AND DATA MINING
"nline Transa!tion $ro!essing)"*erational (&stem+
"nline %nal&ti!al $ro!essing),ata -arehouse+
+ource of data6perational dataI 6FT=s are theoriginal source of the data.
1onsolidation dataI 6FA= data comes fromthe various 6FT= Databases
=urpose of data To control and run fundamentalbusiness tas-s
To help with planning, problem solving, anddecision support
What the data8eveals a snapshot of ongoing
business processes(ultidimensional views of various -inds of
business activities
nserts and*pdates
+hort and fast inserts and updatesinitiated by end users
=eriodic longrunning batch &obs refresh thedata
4ueries8elatively standardized and simplequeries 8eturning relatively fewrecords
6ften comple) queries involvingaggregations
=rocessing+peed
Typically very fast
Depends on the amount of data involvedIbatch data refreshes and comple) queriesmay ta-e many hoursI query speed can beimproved by creating inde)es
+pace8equirements
1an be relatively small if historicaldata is archived
Farger due to the e)istence of aggregationstructures and history dataI requires moreinde)es than 6FT=
Database
Design%ighly normalized with many tables
Typically denormalized with fewer tablesI
use of star and@or snowfla-e schemas
Bac-up and8ecovery
Bac-up religiouslyI operational data iscritical to run the business, data loss isli-ely to entail significant monetaryloss and legal liability
nstead of regular bac-ups, someenvironments may consider simplyreloading the 6FT= data as a recoverymethod
0! *he ultidimensional data odel
The multidimensional data model is an integral part of 6nFine Analytical =rocessing, or 6FA=.
Because 6FA= is online, it must provide answers quic-lyI analysts pose iterative queries during
interactive sessions, not in batch &obs that run overnight. And because 6FA= is also analytic, the queries
are comple). The multidimensional data model is designed to solve comple) queries in real time.
(ultidimensional data model is to view it as a cube. The cable at the left contains detailed sales
data by product, mar-et and time. The cube on the right associates sales number unit sold/ with
dimensionsproduct type, mar-et and time with the unit variables organized as cell in an array.
39
8/11/2019 2. NOTES.doc
40/70
CS2032 DATA WAREHOUSING AND DATA MINING
This cube can be e)pended to include another arraypricewhich can be associates with all or only
some dimensions. As number of dimensions increases number of cubes cell increase e)ponentially.
Dimensions are hierarchical in nature i.e. time dimension may contain hierarchies for years,
quarters, months, wea- and day. J36J8A=%L may contain country, state, city etc.
n this cube we can observe, that each side of the cube represents one of the elements of the
question. The )a)is represents the time, the ya)is represents the products and the za)is represents
different centers. The cells of in the cube represents the number of product sold or can represent the price
of the items.
This 7igure also gives a different understanding to the drilling down operations. The relations
defined must not be directly related, they related directly.
The size of the dimension increase, the size of the cube will also increase e)ponentially. The time
response of the cube depends on the size of the cube.
"perations in ultidimensional Data odel1
; Aggregation roll-up/
: dimension reduction$ e.g., total sales by city
: summarization over aggregate hierarchy$ e.g., total sales by city and year N total sales byregion and by year
40
8/11/2019 2. NOTES.doc
41/70
CS2032 DATA WAREHOUSING AND DATA MINING
; +election slice/ defines a subcube
: e.g., sales where city V =alo Alto and date V !@!@"C
; 9avigation to detailed data drill-down/
: e.g., sales : e)pense/ by city, top ? of cities by average income
; / *nlimited dimensions and aggregation levels$ This depends on the -ind of Business, where
multiple dimensions and defining hierarchies can be made.
n addition to these guidelines an 6FA= system should also support$
1omprehensive database management tools$ This gives the database management to control
distributed Businesses
41
8/11/2019 2. NOTES.doc
42/70
CS2032 DATA WAREHOUSING AND DATA MINING
The ability to drill down to detail source record level$ Which requires that The 6FA= tool should
allow smooth transitions in the multidimensional database.
ncremental database refresh$ The 6FA= tool should provide partial refresh.
+tructured 4uery Fanguage +4F interface/$ the 6FA= system should be able to integrate effectively
in the surrounding enterprise environment.
UNIT III
DATA MINING
&! Data mining knowledge discover# in databasesB
3)traction of interesting nontrivial, implicit, previously un-nown and potentially useful/information or patterns from data in large databases.
42
8/11/2019 2. NOTES.doc
43/70
CS2032 DATA WAREHOUSING AND DATA MINING
Data mining is the practice of automatically searching large stores of data to discover patterns and
trends that go beyond simple analysis. Data mining uses sophisticated mathematical algorithms to segment
the data and evaluate the probability of future events. Data mining is also -nown as Hnowledge Discovery
in Data HDD/.
The -ey properties of data mining are$
Automatic discovery of patterns
=rediction of li-ely outcomes
1reation of actionable information
7ocus on large data sets and databases
Data mining can answer questions that cannot be addressed through simple query and reportingtechniques.
'! Data ining 6unctions
A basic understanding of data mining functions and algorithms is required for using 6racle Data
(ining. This section introduces the concept of data mining functions. Algorithms are introduced in XData
(ining AlgorithmsX.
3ach data mining function specifies a class of problems that can be modeled and solved. Data
mining functions fall generally into two categories$ supervised and unsupervised. 9otions of supervised
and unsupervised learning are derived from the science of machine learning, which has been called a sub
area of artificial intelligence.
Artificial intelligence refers to the implementation and study of systems that e)hibit autonomous
intelligence or behavior of their own. (achine learning deals with techniques that enable devices to learn
from their own performance and modify their own functioning. Data mining applies machine learning
concepts to data.
Supervised ata (ining:
+upervised learning is also -nown as directed learning. The learning process is directed by a
previously -nown dependent attribute or target. Directed data mining attempts to e)plain the behavior of
the target as a function of a set of independent attributes or predictors.
43
http://docs.oracle.com/cd/B28359_01/datamine.111/b28129/intro_concepts.htm#BHCDJDAFhttp://docs.oracle.com/cd/B28359_01/datamine.111/b28129/intro_concepts.htm#BHCDJDAFhttp://docs.oracle.com/cd/B28359_01/datamine.111/b28129/intro_concepts.htm#BHCDJDAFhttp://docs.oracle.com/cd/B28359_01/datamine.111/b28129/intro_concepts.htm#BHCDJDAFhttp://docs.oracle.com/cd/B28359_01/datamine.111/b28129/intro_concepts.htm#BHCDJDAF8/11/2019 2. NOTES.doc
44/70
CS2032 DATA WAREHOUSING AND DATA MINING
+upervised learning generally results in predictive models. This is in contrast to unsupervised
learning where the goal is pattern detection.
The building of a supervised model involves training, a process whereby the software analyzes
many cases where the target value is already -nown. n the training process, the model XlearnsX the logic
for ma-ing the prediction. 7or e)ample, a model that see-s to identify the customers who are li-ely to
respond to a promotion must be trained by analyzing the characteristics of many customers who are -nown
to have responded or not responded to a promotion in the past.
3nsupervised ata (ining
*nsupervised learning is nondirected. There is no distinction between dependent and independent
attributes. There is no previously-nown result to guide the algorithm in building the model.
*nsupervised learning can be used for descriptive purposes. t can also be used to ma-e
predictions.
ata pre-processing
Data pre-processing is an often neglected but important step in the data mining process. The
phrase Xgarbage in, garbage outXis particularly applicable to data miningandmachine learningpro&ects.
Datagathering methods are often loosely controlled, resulting in outofrange values e.g., ncome$ Y!##/,
impossible data combinations e.g., Jender$ (ale, =regnant$ Les/, missing values,etc. Analyzing data that
has not been carefully screened for such problems can produce misleading results. Thus, the representation
and quality of data is first and foremost before running an analysis.
f there is much irrelevant and redundant information present or noisy and unreliable data, then
-nowledge discoveryduring the training phase is more difficult. Data preparation and filtering steps can
ta-e considerable amount of processing time. Data preprocessing includes cleaning, normalization,
transformation, feature e)tractionand selection, etc. The product of data preprocessing is the final training
set. Hotsiantis et al. >##C/ present a well-nown algorithm for each step of data preprocessing.
+! %lassification of Data ining (#stems
ata mining classification scheme:
!. Decisions in data mining
: Hinds of databases to be mined
44
http://en.wikipedia.org/wiki/GIGOhttp://en.wikipedia.org/wiki/Data_mininghttp://en.wikipedia.org/wiki/Machine_learninghttp://en.wikipedia.org/wiki/Machine_learninghttp://en.wikipedia.org/wiki/Machine_learninghttp://en.wikipedia.org/wiki/Missing_valueshttp://en.wikipedia.org/wiki/Missing_valueshttp://en.wikipedia.org/wiki/Knowledge_discoveryhttp://en.wikipedia.org/wiki/Knowledge_discoveryhttp://en.wikipedia.org/wiki/Data_cleaninghttp://en.wikipedia.org/wiki/Feature_extractionhttp://en.wikipedia.org/wiki/Training_sethttp://en.wikipedia.org/wiki/Training_sethttp://en.wikipedia.org/wiki/GIGOhttp://en.wikipedia.org/wiki/Data_mininghttp://en.wikipedia.org/wiki/Machine_learninghttp://en.wikipedia.org/wiki/Missing_valueshttp://en.wikipedia.org/wiki/Knowledge_discoveryhttp://en.wikipedia.org/wiki/Data_cleaninghttp://en.wikipedia.org/wiki/Feature_extractionhttp://en.wikipedia.org/wiki/Training_sethttp://en.wikipedia.org/wiki/Training_set8/11/2019 2. NOTES.doc
45/70
CS2032 DATA WAREHOUSING AND DATA MINING
: Hinds of -nowledge to be discovered
: Hinds of techniques utilized
: Hinds of applications adapted
>. Data mining tas-s
: Descriptive data mining
: =redictive data mining
Decisions in data mining
Databases to be mined
o 8elational, transactional, ob&ectoriented, ob&ectrelational, active, spatial, time
series, te)t, multimedia, heterogeneous, legacy, WWW, etc.
- Hnowledge to be mined
o 1haracterization, discrimination, association, classification, clustering, trend,
deviation and outlier analysis, etc.
o (ultiple@integrated functions and mining at multiple levels
- Techniques utilized
o Databaseoriented, data warehouse 6FA=/, machine learning, statistics,
visualization, neural networ-, etc.
- Applications adapted
o 8etail, telecommunication, ban-ing, fraud analysis, D9A mining, stoc- mar-et
analysis, Web mining, Weblog analysis, etc.
. Data mining tasks
: =rediction Tas-s
o *se some variables to predict un-nown or future values of other variables
: Description Tas-s
o 7ind humaninterpretable patterns that describe the data.
45
8/11/2019 2. NOTES.doc
46/70
CS2032 DATA WAREHOUSING AND DATA MINING
#ommon data mining tas+s
: 1lassification Z=redictive[
: 1lustering ZDescriptive[
: Association 8ule Discovery ZDescriptive[
: +equential =attern Discovery ZDescriptive[
: 8egression Z=redictive[
: Deviation Detection Z=redictive[
%lassifications of data mining s#stems1
+upervised learning classification/
+upervision$ The training data observations, measurements, etc./ are
accompanied by labels indicating the class of the observations 9ew data is classified based on the training set
*nsupervised learning clustering/
The class labels of training data is un-nown
Jiven a set of measurements, observations, etc. with the aim of establishing the e)istence of
classes or clusters in the data.
%lassification
predicts categorical class labels discrete or nominal/
classifies data constructs a model/ based on the training set and the values class
labels/ in a classifying attribute and uses it in classifying new data
;umeric 5rediction
models continuousvalued functions, i.e., predicts un-nown or missing values
*#pical applications
1redit@loan approval
(edical diagnosis$ if a tumor is cancerous or benign
7raud detection$ if a transaction is fraudulent
Web page categorization$ which category it is
46
8/11/2019 2. NOTES.doc
47/70
CS2032 DATA WAREHOUSING AND DATA MINING
! Data ining *ask 5rimitives
The set of tas7-rele!ant data to be mined$ This specifies the portions of the database or the set of
data in which the user is interested. This includes the database attributes or data warehouse dimensions of
interest referred to as the rele!ant attri%utes or dimensions/.
The 7ind o& 7nowledge to be mined$ This specifies the data mining &unctions to be per
formed, such as characterization, discrimination, association or correlation analysis, classification,
prediction, clustering, outlier analysis, or evolution analysis.
The %ac7ground 7nowledge to be used in the discovery process$ This -nowledge about the domainto be mined is useful for guiding the -nowledge discovery process and for evaluating the patterns found.
'oncept hierarchies are a popular form of bac-ground -nowledge, which allow data to be mined
at multiple levels of abstraction. An e)ample of a concept hierarchy for the attribute or dimension/ age is
shown in 7igure. *ser beliefs regarding relationships in the data are another formof bac- ground
-nowledge.
The interestingness measures and thresholds for pattern evaluation$ They may be used to guide the
mining process or, after discovery, to evaluate the discovered patterns. Different -inds of -nowledge may
have different interestingness measures. 7or e)am ple, interestingness measures for association rules
includesupport and con&idence.
8ules whose support and confidence values are below userspecified thresholds are considered
uninteresting. The e)pected representation &or !isualiing the discovered patterns$ This refers to the
forminwhich discovered patterns are to be displayed,which may include rules, tables, charts, graphs,
decision trees, and cubes.
47
8/11/2019 2. NOTES.doc
48/70
CS2032 DATA WAREHOUSING AND DATA MINING
.! Data5reprocessing!
The realworld data that is to be analyzed by data mining techniques are$
1' Incomplete1lac-ing attribute values or certain attributes of interest, or containing only aggregate
data. (issing data, particularly for tuples with missing values for some attributes, may need to be
inferred.
2' ;ois# $ containing errors, or outlier values that deviate from the e)pected. ncorrect data may also
result from inconsistencies in naming conventions or data codes used, or inconsistent formats for
input fields, such as date. t is hence necessary to use some techniques to replace the noisy data.
3' Inconsistent 1containing discrepancies between different data items. some attributes representing
a given concept may have different names in different databases, causing inconsistencies and
redundancies. 9aming inconsistencies may also occur for attribute values. The inconsistency in
data needs to be removed.
4' Aggregate Information1 t would be useful to obtain aggregate information such as to the sales
per customer regionQsomething that is not part of any precomputed data cube in the data
warehouse.
48
8/11/2019 2. NOTES.doc
49/70
CS2032 DATA WAREHOUSING AND DATA MINING
5' 7nhancing mining process1Farge number of data sets may ma-e the data mining process slow.
%ence, reducing the number of data sets to enhance the performance of the mining process is
important.
6' Improve Data :ualit#1Data preprocessing techniques can improve the quality of the data,thereby helping to improve the accuracy and efficiency of the subsequent mining process. Data
preprocessing is an important step in the -nowledge discovery process, because quality decisions
must be based on quality data. Detecting data anomalies, rectifying them early, and reducing the
data to be analyzed can lead to huge payoffs for decision ma-ing.
Different forms of Data 5rocessing
Data %leaning1
Data cleaning routines wor- to Sclean the data by filling in missing values, smoothing noisy
data, identifying or removing outliers, and resolving inconsistencies.
f users believe the data are dirty, they are unli-ely to trust the results of any data mining that
has been applied to it. Also, dirty data can cause confusion for the mining procedure,
resulting in unreliable output. But, they are not always robust.
Therefore, a useful preprocessing step is used some datacleaning routines.
Data Integration1
Data integration involves integrating data from multiple databases, data cubes, or files.
+ome attributes representing a given concept may have different names in different databases,
causing inconsistencies and redundancies. 7or e)ample, the attribute for customer
identification may be referred to as customer\id in one data store and cust\id in another.
9aming inconsistencies may also occur for attribute values.
Also, some attributes may be inferred from others e.g., annual revenue/.
%aving a large amount of redundant data may slow down or confuse the -nowledge
discovery process. Additional data cleaning can be performed to detect and remove
redundancies that may have resulted from data integration.
Data *ransformation1
Data transformation operations, such as normalization and aggregation, are additional data
preprocessing procedures that would contribute toward the success of the mining process.
9ormalization$ 9ormalization is scaling the data to be analyzed to a specific range such as
Z#.#, !.#[ for providing better results.
49
8/11/2019 2. NOTES.doc
50/70
CS2032 DATA WAREHOUSING AND DATA MINING
Aggregation$ Also, it would be useful for data analysis to obtain aggregate information such
as the sales per customer region. As, it is not a part of any precomputed data cube, it would
need to be computed. This process is called Aggregation.
Data 9eduction1
Data reduction obtains a reduced representation of the data set that is much smaller in volume, yet
produces the same or almost the same/ analytical results. There are a number of strategies for
data reduction.
data aggregation e.g., building a data cube/,
attribute subset selection e.g., removing irrelevant attributes through correlation analysis/,
dimensionality reduction e.g., using encoding schemes such as minimum length encoding or
wavelets/,
and numerosity reduction e.g., Sreplacing the data by alternative, smaller representations
such as clusters or parametric models/.
generalization with the use of concept hierarchies,by organizing the concepts into varying
levels of abstraction.
Data discretization is very useful for the automatic generation of concept hierarchies from
numerical data.
50
8/11/2019 2. NOTES.doc
51/70
CS2032 DATA WAREHOUSING AND DATA MINING
UNIT & I'
ASSOCIATION RU$E MINING AND C$ASSI(ICATION
&! 6re)uent 5attern Anal#sis
7requent pattern$ a pattern a set of items, subsequences, substructures, etc./ that occurs frequently
in a data set
7irst proposed by Agrawal, mielins-i, and +wami ZA+"?[ in the conte)t of frequent itemsets and
association rule mining
(otivation$ 7inding inherent regularities in data
What products were often purchased togetherMQ Beer and diapersM]
What are the subsequent purchases after buying a =1M
What -inds of D9A are sensitive to this new drugM
1an we automatically classify web documentsM
Applications$
Bas-et data analysis, crossmar-eting, catalog design, sale campaign analysis, Web log clic-
stream/ analysis, and D9A sequence analysis
Wh# Is 6re)! 5attern ining Important
Dimension@level constraint
o in relevance to region, price, brand, customer category
8ule or pattern/ constraint
o small sales price ^ _!#/ triggers big sales sum N _>##/
nterestingness constraint
o strong rules$ min\support ?, min\confidence C#
%onstrained ining vs! %onstraint-$ased (earch
1onstrained mining vs. constraintbased search@reasoning
o Both are aimed at reducing search space
o 7inding all patterns satisfying constraints vs. finding some or one/ answer in
constraintbased search in A
o 1onstraintpushing vs. heuristic search
o t is an interesting research problem on how to integrate them
1onstrained mining vs. query processing in DB(+
o Database query processing requires to find all
o 1onstrained pattern mining shares a similar philosophy as pushing selections
deeply in query processing
*he Apriori Algorithm C 74ample
53
8/11/2019 2. NOTES.doc
54/70
CS2032 DATA WAREHOUSING AND DATA MINING
+!Decision *ree Induction
54
8/11/2019 2. NOTES.doc
55/70
CS2032 DATA WAREHOUSING AND DATA MINING
nformation produced by data mining techniques can be represented in many different
ways. Decision tree structures are a common way to organize classification schemes. n
classifying tas-s, decision trees visualize what steps are ta-en to arrive at a classification. 3very
decision tree begins with what is termed a root node, considered to be the XparentX of every other
node. 3ach node in the tree evaluates an attribute in the data and determines which path it should
follow. Typically, the decision test is based on comparing a value against some constant.
1lassification using a decision tree is performed by routing from the root node until arriving at a
leaf node.
The illustration provided here is a cannonical e)ample in data mining, involving the
decision to play or not play based on climate conditions. n this case, outloo- is in the position of
the root node. The degrees of the node are attribute values. n this e)ample, the child nodes are
tests of humidity and windy, leading to the leaf nodes which are the actual classifications. This
e)ample also includes the corresponding data, also referred to as instances. n our e)ample, there
are " XplayX days and Xno playX days.
55
8/11/2019 2. NOTES.doc
56/70
CS2032 DATA WAREHOUSING AND DATA MINING
Decision trees can represent diverse types of data. The simplest and most familiar is
numerical data. t is often desirable to organize nominal data as well. 9ominal quantities are
formally described by a discrete set of symbols. 7or e)ample, weather can be described in either
numeric or nominal fashion. We can quantify the temperature by saying that it is !! degrees
1elsius or > degrees 7ahrenheit. We could also say that it is cold, cool, mild, warm or hot. The
former is an e)ample of numeric data, and the latter is a type of nominal data. (ore accurately,
the e)ample of cold, cool, mild, warm and hot is a special type of nominal data, described as
ordinal data. 6rdinal data has an implicit assumption of ordered relationships between the values.
1ontinuing with the weather e)ample, we could also have a purely nominal description li-e
sunny, overcast and rainy. These values have no relationships or distance measures.
The type of data organized by a tree is important for understanding how the tree wor-s at
the node level. 8ecalling that each node is effectively a test, numeric data is often evaluated in
terms of simple mathematical inequality. 7or e)ample, numeric weather data could be tested by
finding if it is greater than !# degrees 7ahrenheit. 9ominal data is tested in Boolean fashionI in
other words, whether or not it has a particular value. The illustration shows both types of tests. n
the weather e)ample, outloo- is a nominal data type. The test simply as-s which attribute value is
represented and routes accordingly. The humidity node reflects numeric tests, with an inequality
of less than or equal to #, or greater than #.
Decision tree induction algorithms function recursively. 7irst, an attribute must be selected
as the root node. n order to create the most efficient i.e, smallest/ tree, the root node must
effectively split the data. 3ach split attempts to pare down a set of instances the actual data/ until
they all have the same classification. The best split is the one that provides what is termed the
most information gain.
nformation in this conte)t comes from the concept of entropy from information theory, as
developed by 1laude +hannon. Although XinformationX has many conte)ts, it has a very specific
mathematical meaning relating to certainty in decision ma-ing. deally, each split in the decision
tree should bring us closer to a classification. 6ne way to conceptualize this is to see each step
along the tree as removing randomness or entropy. nformation, e)pressed as a mathematical
quantity, reflects this. 7or e)ample, consider a very simple classification problem that requires
creating a decision tree to decide yes or no based on some data. This is e)actly the scenario
visualized in the decision tree. 3ach attributes values will have a certain number of yes or no
classifications. f there are equal numbers of yeses and noPs, then there is a great deal of entropy in
56
8/11/2019 2. NOTES.doc
57/70
CS2032 DATA WAREHOUSING AND DATA MINING
that value. n this situation, information reaches a ma)imum. 1onversely, if there are only yeses
or only noPs the information is also zero. The entropy is low, and the attribute value is very useful
for ma-ing a decision.
The formula for calculating intermediate values is as follows$
)*Machine $earnin!
The general problem of machine learning is to search a, usually very large, space of potential
hypotheses to determine the one that will best fit the data and any prior -nowledge. The data may be
labelled or unlabelled. f labels are given then the problem is one of supervised learning in that the true
answer is -nown for a given set of data. f the labels are categorical then the problem is one of
classification, e.g. predicting the species of a flower given petal and sepal measurements. f the labels are
realvalued the problem is one of regression, e.g. predicting property values from crime, pollution, etc.
statistic. f labels are not given then the problem is one of unsupervised learning and the aim is
characterize the structure of the data, e.g. by identifying groups of e)amples in the data that are
collectively similar to each other and distinct from the other data.
S+per,i-e. $earnin!
Jiven some e)amples we wish to predict certain properties, in the case where there are available a
set of e)amples whose properties have already been characterized the tas- is to learn the relationship
between the two. 6ne common early approach was to present the e)amples in turn to a learner. The learner
ma-es a prediction of the property of interest, the correct answer is presented, and the learner ad&usts its
hypothesis accordingly. This is -nown as learning with a teacher, or supervised learning.
n supervised learning there is necessarily the assumption that the descriptors available are in some
related to a quantity of interest. 7or instance, suppose that a ban- wishes to detect fraudulent credit card
transactions. n order to do this some domain -nowledge is required to identify factors that are li-ely to be
indicative of fraudulent use. These may include frequency of usage, amount of transaction, spending
patterns, type of business engaging in the transaction and so forth. These variables are the predictive, or
independent, variables 4. t would be hoped that these were in some way related to the target, or
dependent, variable . Deciding which variables to use in a model is a very difficult problem in generalI this
is -nown as the problem of feature selection and is 9=complete. (any methods e)ist for choosing the
predictive variables, if domain -nowledge is available then this can be very useful in this conte)t. %ere we
assume that at least some of the predictive variables at least are in fact predictive. L Assume, then, that the
relationship between and is given by the &oint probability density .
57
8/11/2019 2. NOTES.doc
58/70
CS2032 DATA WAREHOUSING AND DATA MINING
UNIT & '
C$USTERING AND A//$ICATIONS AND TRENDS IN DATA MINING
&!%luster Anal#sis
Data clustering is a method in which we make cluster of objects that are somehow similar in
characteristics. The criterion for checking the similarity is implementation dependent.
Clustering is often confused with classification, but there is some difference between the two. In
classification the objects are assigned to pre defined classes, whereas in clustering the classes are also to be
defined.
Precisely, Data Clustering is a technique in which, the information that is logically similar is
physically stored together. In order to increase the efficiency in the database systems the number of disk
accesses are to be minimized. In clustering the objects of similar properties are placed in one class of
objects and a single access to the disk makes the entire cl