BSIT 53 New

181
B.Sc.(IT) - 5 th Semester BSIT - 53 Data Warehousing & Data Mining INFORMATION TECHNOLOGY PROGRAMMES Bachelor of Science in Information Technology - B.Sc.(IT) Master of Science in Information Technology - M.Sc. (IT) In collaboration with KUVEMPU UNIVERSITY Directorate of Distance Education Kuvempu University Shankaraghatta, Shimoga District, Karnataka Universal Education Trust Bangalore BSIT - 54 Software Quality and Testing

description

B.Sc(IT) 54

Transcript of BSIT 53 New

Page 1: BSIT 53 New

B.Sc.(IT) - 5th Semester

BSIT - 53 Data Warehousing & Data Mining

INFORMATION TECHNOLOGY PROGRAMMESBachelor of Science in Information Technology - B.Sc.(IT)Master of Science in Information Technology - M.Sc. (IT)

Incollaboration

with

KUVEMPU UNIVERSITY

Directorate of Distance EducationKuvempu University

Shankaraghatta, Shimoga District, Karnataka

Universal Education TrustBangalore

BSIT - 54 Software Quality and Testing

Page 2: BSIT 53 New

II

Titles in this Volume :BSIT - 53 Data Warehousing & Data Mining

Prepared by UNIVERSAL EDUCATION TRUST (UET)Bangalore

First Edition : May 2005Second Edition : May 2012

Copyright © by UNIVERSAL EDUCATION TRUST, BangaloreAll rights reserved

No Part of this Book may be reproducedin any form or by any means without the writtenpermission from Universal Education Trust, Bangalore.

All Product names and company names mentionedherein are the property of their respective owners.

NOT FOR SALEFor personal use of Kuvempu UniversityIT - Programme Students only.

Corrections & Suggestionsfor Improvement of Study materialare invited by Universal Education Trust, Bangalore.

E-mail : [email protected]

Printed at :Pragathi Print CommunicationsBangalore - 20Ph : 080-23340100

BSIT - 54 Software Quality and Testing

Page 3: BSIT 53 New

III

DATA WAREHOUSING &DATA MINING

(BSIT - 53)

: Contributing Authors :

Dr. K. S. ShreedharaUBDT College, Davangere

&

Indira S. P.GMIT, Davangere

Page 4: BSIT 53 New

IV

BLANK PAGE

Page 5: BSIT 53 New

V

a

Contents

DATA WAREHOUSING

COURSE SUMMARY 1

Chapter 1

INTRODUCTION TO DATA MANAGEMENT 3

1.0 The Concept of Data Bases............................................................... 31.1 The Concept of Data Bases............................................................... 51.2 Management Information Systems..................................................... 61.3 The Concept of Dataware House and Data Mining............................. 61.4 Concept of Views.............................................................................. 71.5 Concept of Normalization.................................................................. 8

Chapter 2

DEFINITION OF DATAWARE HOUSING 11

2.0 Introduction...................................................................................... 112.1 Definition of a Data Ware House....................................................... 122.2 The Dataware House Delivery Process.............................................. 122.3 Typical Process Flow in a Data Warehouse........................................ 16

2.3.1 Extract and Load Process.................................................... 172.3.2. Data cleanup and transformation.......................................... 182.3.3 Backup and Archiving......................................................... 182.3.4. Query Management............................................................ 19

Page 6: BSIT 53 New

VI

2.4 Architecture for Dataware House ..................................................... 192.4.1 The Load Manager ............................................................. 20

2.4.2 The Ware House Manager ................................................. 21

2.4.3 Query Manager ................................................................. 232.5 The Concept of Detailed Information ................................................. 23

2.6 Data Warehouse Schemas ................................................................ 23

2.7 Partitioning of Data ........................................................................... 242.8 Summary Information ........................................................................ 25

2.9 Meta Data ........................................................................................ 25

2.10 Data Marts ....................................................................................... 26BLOCK SUMMARY ....................................................................... 27

Chapter 3

DATA BASE SCHEME 30

3.1 Star Flake Schemas .......................................................................... 313.1.1 Who are the facts and who are the Dimensions? ...................... 31

3.2 Designing of Fact Tables ................................................................... 36

3.3 Designing Dimension Tables .............................................................. 393.4 Designing the Star-Flake Schema ....................................................... 41

3.5 Query Redirection ............................................................................. 42

3.6 Multi Dimensional Schemas ............................................................... 43BLOCK SUMMARY ....................................................................... 44

Chapter 4

PARTITIONING STRATEGY 46

4.1 Horizontal Partitioning ....................................................................... 47

4.2 Vertical Partitioning ........................................................................... 494.2.1 Normalisation .......................................................................... 50

4.2.2 Row Splitting .......................................................................... 51

4.3 Hardware partitioning ........................................................................ 524.3.1 Maximising the processing and avoiding bottlenecks................... 52

4.3.2 Stripping data across the nodes ................................................ 53

4.3.3 Horizontal hardware paritioning ................................................ 54BLOCK SUMMARY ....................................................................... 54

Contents

Page 7: BSIT 53 New

VII

Chapter 5

AGGREGATIONS 56

5.1 The Need for Aggregation ................................................................ 565.2 Definition of Aggregation ................................................................... 575.3 Aspects to be looked into while Designing the Summary Tables ........... 59

BLOCK SUMMARY ....................................................................... 63

Chapter 6

DATA MART 65

6.1 The Need for Data Marts .................................................................. 656.2 Identify the Splits in Data .................................................................. 686.3 Identify the Access Tool Requirements .............................................. 686.4 Role of Access Control Issues in Data Mart Design ........................... 68

BLOCK SUMMARY ....................................................................... 70

Chapter 7

META DATA 72

7.1 Data Transformation and Loading ...................................................... 727.2 Data Management ............................................................................ 747.3 Query Generation ............................................................................. 76

BLOCK SUMMARY ....................................................................... 78

Chapter 8

PROCESS MANAGERS 79

8.1 Need for Managers to a Dataware House .......................................... 808.2 System Management Tools ................................................................ 80

8.2.1 Configuration Manager ............................................................ 818.2.2 Schedule Manager .................................................................. 818.2.3 Event Manager ....................................................................... 828.2.4 Database Manager .................................................................. 838.2.5 Backup up recovery manager .................................................. 84

8.3 Data Warehouse Process Managers .................................................. 858.3.1 Load manager ......................................................................... 858.3.2 Ware house Manager .............................................................. 868.3.3 Query Manager ...................................................................... 88BLOCK SUMMARY ....................................................................... 89

Contents

Page 8: BSIT 53 New

VIII

DATA MINING

COURSE SUMMARY 92

Chapter 9

INTRODUCTION TO DATA MINING 94

9.0 Introduction ...................................................................................... 949.1 What is Data Mining ? ...................................................................... 959.2 Whata Kind of Data can be Mined ? .................................................. 989.3 What can Data Mining do ? ............................................................... 1009.4 How do we Categorized Data Mining Systems ? ................................ 1029.5 What are the Issues in Data Mining ? ................................................. 1029.6 Reasons for the Growing Popularity of Data Mining ............................ 1059.7 Applications ..................................................................................... 1059.8 Exercise ........................................................................................... 106

Chapter 10

DATA PREPROCESSING AND DATA MINING PRIMITIVES 108

10.0 Introduction ..................................................................................... 10810.1 Data Preparation .............................................................................. 108

10.1.1 Select data ......................................................................... 10810.1.2 Data Cleaning .................................................................... 10910.1.3 New data construction ........................................................ 11010.1.4 Data formatting .................................................................. 110

10.2 Data Mining Primitives ..................................................................... 11110.2.1 Defining data mining primitives ............................................ 111

10.3 A Data Mining Querying Language ................................................... 11310.3.1 Syntax for Task-relevant data specification........................... 114

10.4 Designing Graphical User Interfaces Based on a Data MiningQuery Language .............................................................................. 117

10.5 Architectures of Data Mining Systems ............................................... 11910.6 Exercise .......................................................................................... 120

Chapter 11

DATA MINING TECHNIQUES 122

11.0 Introduction ...................................................................................... 122

Contents

Page 9: BSIT 53 New

IX

11.1 Associations ..................................................................................... 12211.1.1 Data Mining with Apriori algorithm ...................................... 123

11.1.2 Implementation Steps .......................................................... 124

11.1.3 Improving the efficiency of Aprori ....................................... 12511.2 Data Mining with Decision Trees ....................................................... 125

11.2.1 Decision tree working concept ............................................ 126

11.2.2 Other Classification Methods .............................................. 12811.2.3 Prediction .......................................................................... 130

11.2.4 Nonlinear Regression .......................................................... 132

11.2.5 Other Regression Models .................................................... 13311.3 Classifier Accuracy ........................................................................... 133

11.3.1 Estimating Classifier Accuracy ............................................ 134

11.4 Bayesian Classification ..................................................................... 13411.4.1 Bayes Theorem .................................................................. 135

11.4.2 Naive Bayesian Classification ............................................. 135

11.4.3 Bayesian Belief Networks .................................................. 13711.4.4 Training Bayesian Belief Networks ..................................... 139

11.5 Neural Networks for Data Mining ..................................................... 140

11.5.1 Neural Network Topologies ................................................ 14011.5.2 Feed-Forward Networks .................................................... 141

11.5.3 Classification by Backpropagation ....................................... 142

11.5.4 Backpropagation ................................................................ 14211.5.5 Backpropagation and Interpretability .................................... 146

11.6 Clustering in Data Mining .................................................................... 147

11.6.1 Requirements for clustering ................................................ 14711.6.2 Type of Data in Cluster Analysis ......................................... 149

11.6.3 Interval-Scaled Variables .................................................... 150

11.6.4 Binary Variables ................................................................. 15211.6.5 Nominal, Ordinal and Ratio-Scaled Variables ....................... 154

11.6.6 Variables of Mixed Types .................................................... 156

11.7 A Categorization of Major Clustering Methods .................................... 15711.8 Clustering Algorithm ......................................................................... 158

11.8.1 K-means algorithm ............................................................. 158

11.8.2 Important issues in automatic cluster detection ..................... 15911.8.3 Application Issues .............................................................. 160

11.9 Genetic Algorithms ........................................................................... 160

Contents

Page 10: BSIT 53 New

X

11.10 Exercise ......................................................................................... 161

Chapter 12 163

GUIDELINES FOR KDD ENVIRONMENT

12.0 Introduction ..................................................................................... 163

12.1 Guidelines ....................................................................................... 163

12.2 Exercise .......................................................................................... 165

Chapter 13

DATA MININGAPPLICATION 167

13.0 Introduction ...................................................................................... 167

13.1 Data Mining for Biomedical and DNA Data Analysis .......................... 16713.2 Data Mining for Financial Data Analysis ............................................ 169

13.3 Data Mining for the Retail Industry .................................................... 170

13.4 Other Applications ............................................................................ 17113.5 Exercise .......................................................................................... 171

Contents

Page 11: BSIT 53 New

a

Dat a War eh o u sin g

COURSE SUMMARY

The use of computers for data storage and manipulating is a fairly old phenomenon. In fact, one ofthe main reasons for the popularity of computers is their ability to store and provide data accuratelyover long periods of time. Of late, computers are also being used for decision making. The main

use of historical data is to provide trends, so that future sequences can be predicted. This task can also bedone by the computers which have sufficient capabilities in terms of hardware and software.

Once this aspect was explored, it was possible to make use of computers as a “storehouse” or“warehouse” of data. As the name suggests, huge volumes of data are collected from different sourcesand are stored in a manner that is congenial for retrieval. This is the basic concept of data ware housing.

By definition, data ware house stores huge volumes of data which pertain the various locations andtimes. The most primary task of the ware house manager is to properly “label” them and be “able” toprovide them on demand. Of course, at the next level, it becomes desirable that the manager is able to dosome amount of scanning, filtering etc., so that the user can ask for data that satisfies specific questions– like the number of blue colored shirts sold in a particular location last summer – and get the data. Mostdatabase management systems also provide queries to do this job, but a ware house will have to cater tomuch larger and rather adhoc type of data.

In this course, we take you through the preliminaries of a warehouse operation. You are introduced tothe concept of a data ware house as a process – how the computer looks at the entire operation. You willbe introduced to the various concepts of data extraction, loading, transformation and archiving. Also theprocess architecture with the concepts of software called managers is introduced.

1BSIT 33 Data Warehousing and Data Mining

Page 12: BSIT 53 New

Then the concept of actual storage of data – the database schema is dealt with in some detail. We lookat star flake schemas, fact tables, methods of designing them and also the concept of query redirection.

There is also a concept of partitioning the data, to ease the amount of scanning to answer the queries.We talk of horizontal and vertical partitioning as well as hardware partitioning processes. An introductionto the concept of aggregations is also provided.

Data ware houses store huge volumes of data – like the store house of a huge business organization.There is a need for smaller “retail” shops – which can provide more frequently used data, without goingback to the store house. This is the concept of “data marts”. We look at several aspects of data marting.

The data in a ware house is not static – It keeps changing . There is a need to order and maintain thedata. This leads as to the concept of meta data – “Data about data”. The metadata help us to have theinformation about the type of data that we have in the ware house and arrive at suitable decisions abouttheir manipulations.

Finally, an insight in to the working of the various managers – the load manager, the ware housemanager and the query manager is provided.

Data Warehousing - Course Summary2

Page 13: BSIT 53 New

Chapter 1

In t r o d u ct io n t o Dat a Man ag em en t

BLOCK INTRODUCTION

In this introductory chapter, you will be introduced to the use of computers for data management. Thiswill give a sound background to initiate you into the concepts of data warehousing and datamining. Inbrief, data is any useful piece of information- but most often data is normally a set of facts and figures

– like the total expenditure on books during the last year, total number of computers sold in two months,number of railway accidents in 6 months – can be anything. The introduction in this chapter expects noprevious background on the students part, and it starts from level zero. We start with how the concepts ofcomputers could be used for storing and manipulating data, how can this become useful to the variousapplications, what are the various terminologies used in the different contexts. It also deals with severalconcepts that are used in the subsequent chapters at the conceptual level, so that the student is comfortableas he proceeds to the subsequent chapters. He can also get back to this introductory chapter, if at asubsequent level he finds some background is missing for him to continue his studies. For those studentswho already have some background of the subject, some of the concepts dealt here may appear redundant,but it is advised that they go through this block atleast once to ensure a continuity of ideas. Though someof the ideas – like a simple concept of the parts of a computer are already known to the student, adifferent perspective of the same topic is made available. Hence the need the “begin at the beginning”.

1.0 INTRODUCTION

We begin with the parts of a computer and it’s primary role. The computer was originally designed to bea fast computing device- that which can perform arithmetic and logical operations faster. The concept ofusing computers for data manipulation came much later.

3BSIT 53 Data Warehousing and Data Mining

Page 14: BSIT 53 New

4

A Computer will have 3 basic parts –

i. A central processing unit that does all the arithmetic and logical operations. This can bethought of as the heart of any computer and computers are identified by the type of CPUthat they use.

ii. The memory is supposed to hold the programs and data. All the computers that we cameacross these days are what are known as “stored program computers”. The programs are tobe stored before hand in the memory and the CPU accesses these programs line by line andexecutes them.

iii. The Input/output devices: These devices facilitate the interaction of the users with the computer.The input devices are used to send information to the computer, while the output devicesaccept the processed information form the computer and make it available to the user.

One can note that the I/O devices interact with the CPU and not directly with the memory. Anymemory access is to be done through the CPU only.

We are not discussing the various features of each of these devices at this stage.

A typical computer works in the following method- The user sends the programs as well as data,which is stored in the memory. The CPU accesses the program line by line and executes it. Any data thatis needed to execute the program is also drawn from the memory. The output (or the results) are eitherwritten back to the memory or sent to a designated output device as the case may be. Looking the otherway, the programs “modify” the data. For example, if two numbers 6 and 8 are added given 14 as theanswer, we can consider that the two input data are modified into their sum.

The other concept that we should consider is that even though originally computers were used tooperate on numbers only, they can stored and manipulate characters and sentences also – though in avery limited sense.

Incidentally, there are two types of memories – Primary memory, which is embedded in the computerand which is the main source of data to the computer and the secondary memory like floppy disks, CDsetc., which can be carried around and used in different computers. They cost much less than the primary

Chapter 1 - Introduction to Data Management

Page 15: BSIT 53 New

5BSIT 53 Data Warehousing and Data Mining

memory, but the CPU can access data only from the primary memory. The main advantage of computermemories, both primary and secondary, is that they can store data indefinitely and accurately.

The other aspects we need to consider, before moving on to the next set of aspects, is that ofcommunication. Computers can communicate amongst themselves – through a local area/wide areanetworks, which means data, one entered to a computer can be accessed by the other computers eitherby the use of a secondary storage device like a CD/floppy or through a network. With the advent of theinternet, the whole world has become a network. So, theoretically data on any computer can be madeavailable on any other computer (subject to restriction, of course).

There is one more aspect that we need to consider. With the advent of computerization, most of theoffices, including sales counters etc. use computers. It is the data generated by these computers – like thesales details, salary paid, accounts details etc., that actually form the databases, the details of which wewill consider in a short while. Even aspects like the speeds of the machines etc. are nowadays convertibledirectly into data and stored in computers, which means the data need not be entered even once. As andwhen the data are generated, they are stored on the computer.

One more aspect we have to consider is that the cost of the memories have came down drasticallyover the years. In the initial stages of the computer evolution, the memory, especially the main memorywas very costly and people were always trying to minimize the memory usage. But with the advent oftechnology, the cost of the memory has become quite low and it is possible for us to store huge amountsof data over long durations of time, at an affordable rate.

We shall now list the various aspects we have discussed, that become helpful in the context of thiscourse.

Computer carry memory with them, which is quite cheap.

Since secondary memories retain data until they are erased (mostly) they form a good placefor storing data.

The data stored can also be transmitted, from one computer to another, either throughsecondary devices or through networks.

The computers can use this data in their calculations.

1.1 THE CONCEPT OF DATA BASES

We have seen in the previous section how data can be stored in computer. Such stored data becomesa “database” – a collection of data. For example, if all the marks scored by all the students of a class arestored in the computer memory, it can be called a database. From such a database, we can answerquestions like – who has scored the highest marks? ; In which subject the maximum number of students

Page 16: BSIT 53 New

6

have failed?; Which students are weak in more than one subject? etc. Of course, appropriate programshave to be written to do these computations. Also, as the database becomes too large and more and moredata keeps getting included at different periods of time, there are several other problems about “maintaining”these data, which will not be dealt with here.

Since handling of such databases has become one of the primary jobs of the computer in recent years,it becomes difficult for the average user to keep writing such programs. Hence, special languages –called database query languages- have been deviced, which makes such programming easy, there languageshelp in getting specific “queries” answered easily.

1.2 MANAGEMENT INFORMATION SYSTEMS

The other important job of a computer is in producing management reports-Details that a seniormanagement person would be interested in knowing about the performance of his organization, withoutgoing to the specifics. Things like the average increase/decrease in sales, employee performance details,market conditions etc, can be made available in a concise form like tables, barcharts, pie charts etc. if therelevant details are available. Effectively, this again boils down to handling huge amounts of data, doingsimple arithmetic/statistical operations and producing the data in relevant form. Traditionally such operationswere being undertaken by a group of statisticians and secretaries, but now the computer can do it muchfaster, more conveniently and also available to the manager at the click of a button.

1.3 THE CONCEPT OF DATAWARE HOUSE AND DATA MINING

Now, we are in a position to look at these concepts. The data, when it becomes abundantly large, givesrise to a “warehouse”, a storehouse of data. One common feature amongst all such warehouses is thatthey have large amounts of data otherwise the type of data can be anything from the student data to thedetails of the citizens of a city or the sales of previous years or the number of patients that came to ahospital with different ailments. Such data becomes a storehouse of information. Most organization tendto predict the future course of action based on the reports of the previous years. But the queries that arisehere are many times more complicated than in a simple database. For example, instead of simply findingthe marks scored by the weak students, we would like to analyse the tendency of the student to score lowmarks in some subject. Instead of simply finding out the sales generated by the different outlets, we wouldlike to know what are the similar and dissimilar patterns amongst these outlets and how they point to theoverall prospects of the company etc. Since the manager need not be an IT professional on one hand andsince the data to be handled is too huge on the other, these concepts are to be dealt with, with utmost care.

Chapter 1 - Introduction to Data Management

Page 17: BSIT 53 New

7BSIT 53 Data Warehousing and Data Mining

1.4 CONCEPT OF VIEWS

Data is normally stored in tabular form, unless storage in other formats becomes advantageous, westore data in what are technically called “relations” or in simple terms as “tables”.

A Simple table.

Inside the memory, the contents of a table are stored one by the side of the other, but still we canimagine it to be a rectangular table that we are accustomed to.

However, when such tables tend to contain large number of “fields” (in the above table, each columnname, age, class etc. is a field), several other problems crop up. When there are 100 such fields, noteverybody who wants to process the table would be interested in all the 100 fields. More importantly, wemay also not want all of them to be allowed to look into all the fields.

For example in a table of employee details, we may not want an employee to know the salary of someof other employee or an external person to know the salary of any employee. In the table of studentsabove, if some one is interested in knowing the average age of students of a class, he may not beinterested in their marks. Thus, the “view” will now include the concept of marks. Thus, the same tablemay look differently to different people. Each person is entitled to his “logical *view” of the system.

Quite often, we look at the data differently. For example, we would like to look at the data differently.For example, we would like to look it as a hierarchy, where one/more aspects are subordinates to yetanother objects. Thus the employee table can be looked upto in two ways.

The same table in hierarchical form

Page 18: BSIT 53 New

8

The problem is that the same table is sometime looked upto as a hierarchical form and sometimes asa tabular form. It is to be noted that in all these cases, we are not going to rewrite the data to a differentplace/in a different format, we simply use the same set of data, but interpret differently, while the computerprograms are perfectly capable of doing it, we, as humans, for our understanding write it in differentformats. In succeeding chapters, we came across a number of instances where the views are representedin different pictorial forms, but it should be remembered that it in for our convenience only.

In come cases, it may also be that the data are actually in different tables. For example, in the abovecase of managers, their family details may be in an altogether different table, it is for the software toselect and combine the fields. Such concepts, we call “Schema” – a user’s view of data, which we useextensively in the coming sections.

We also speak of “dimensions”, which are again similar to the fields of the above table. For examplein the student table, the student has the dimensions of name, age, class, marks1 and marks 2 etc. Theunderstanding is whenever you represent any student in the table, he is represented by these fact. Similarlythe entity of manager has the dimensions of name, designation, age and salary. Quite often we use theterm dimensions.

1.5 CONCEPT OF NORMALIZATION

Normalization is dealt with in several chapters of any books on database management systems. Here,we will take the simplest definition, which suffices our purpose namely any field should not have subfields.

Again consider the following student table.

Here under the field marks, we have 3 sub fields: marks for subject1, marks for subject2 and subject3.

However, it is preferable split these subfields to regular fields as shown below

Chapter 1 - Introduction to Data Management

Page 19: BSIT 53 New

9BSIT 53 Data Warehousing and Data Mining

Quite often, the original table which comes with subfields will have to be modified suitable, by theprocess of “normalization”.

Self Evaluation - I

1. What are the main parts of the computer?

2. How can data be transferred form one computer to another?

3. Can the CPU access data directly from the secondary memory?

4. For how long can the secondary memories store data?

5. What is a database?

6. What is the function of a database query language?

7. What is the use of a management information system?

8. What is a relation?

9. Why are different views of a given data needed?

10. Give the simplest definition of normalized data?

Answers

1. CPU, memory, I/O devices

2. Through floppies/CDs or through network connections.

3. No.

4. Until they are erased by the user.

5. A very large collection of useful data.

6. It makes database programming easy.

7. Produces information in a form that is useful to top managers.

8. Data stored in tabular form.

9. Because not all the users need to/should know all the fields.

Page 20: BSIT 53 New

10. The relation should not have any subfields (Sub dimensions).

Self Evaluation - II

1. With neat diagram explain the parts of computer.

2. Explain the concept of view. Give example.

Chapter 1 - Introduction to Data Management10

Page 21: BSIT 53 New

Chapter 2

Definition of Data Warehousing

BLOCK INTRODUCTION

In this chapter, you will be introduced to the fundamental concepts of a data ware house. A data warehouse is not only a repository of huge volumes of data, but is also a system from where you can getsupport to draw meaningful conclusions. We begin with formal definition of a data ware house and

look at the process of evolution of a data ware house. It involves the cooperation of every body in thecompany starting from the IT strategists down to the average users. There is also scope for futuredevelopment. We look at the various components that go into to the design of a ware house. Then welook at the typical flow of process in a ware house. This would trace the movement of data starting fromit’s acquisition to archiving.

The next step is to study the architecture of a typical ware house. The concept of different warehouse managers and their activities are introduced. We also see that concept of ware house schemas –methods of storing data in a ware house. We also briefly get ourselves introduced to some miscellaneousconcepts of data ware houses.

2.0 INTRODUCTION

In the last two decades, computerization has reached tremendous scales. New computer systemshave been installed to gain competitive edge in all sorts of business applications, starting from supermarkets,computerized billing systems, computerized manufacturing to online transactions.

However, it is also realized that enormous knowledge is also available in these systems, which can beutilized in several other ways. In fact, in today’s world, the competitive edge will come more from theproactive use of information ratter than more and more optimization. The information can be tapped fordecision making and for stealing a march over the rival organization.

11BSIT 53 Data Warehousing and Data Mining

Page 22: BSIT 53 New

12 Chapter 2 - Definition of Dataware Housing

However, none of the computer systems are designed to support this kind of activity - i.e. to tap thedata available and convert them into suitable decisions. They are not able to support the operational andmultidimensional requirements. Hence, a new set of systems called data ware houses are being developed.They are able to make use of the available data, make it available in the form of information that canimprove the quality of decision making and the profitability of the organization.

2.1 DEFINITION OF A DATA WARE HOUSE

In it’s simplest form, a data ware house is a collection of key pieces of information used to manage anddirect the business for the most profitable outcome. It would decide the amount of inventory to be held,the no. of employees to be hired, the amount to be procured on loan etc.,.

The above definition may not be precise - but that is how data ware house systems are. There aredifferent definitions given by different authors, but we have this idea in mind and proceed.

It is a large collection of data and a set of process managers that use this data to make informationavailable. The data can be meta data, facts, dimensions and aggregations. The process managers can beload managers, ware house managers or query managers. The information made available is such thatthey allow the end users to make informed decisions.

2.2 THE DATAWARE HOUSE DELIVERY PROCESS

This section deals with the dataware house from a different view point - how the different componentsthat go into it enable the building of a data ware house. The study helps us in two ways:

i) To have a clear view of the data ware house building process.

ii) To understand the working of the data ware house in the context of the components.

Now we look at the concepts in details.

i. IT Strategy

The company must and should have an overall IT strategy and the data ware housing has to be a partof the overall strategy. This would not only ensure that adequate backup in terms of data and investmentsare available, but also will help in integrating the ware house into the strategy. In other words, a data warehouse can not be visualized in isolation.

Page 23: BSIT 53 New

13BSIT 53 Data Warehousing and Data Mining

ii. Business Case Analysis

This looks at an obvious thing, but is most often misunderstood. The overall understanding of thebusiness and the importance of various components there in is a must. This will ensure that one canclearly justify the appropriate level of investment that goes into the data ware house design and also theamount of returns accruing.

Unfortunately, in many cases, the returns out of the ware housing activity are not quantifiable. At theend of the year, one cannot say - I have saved / generated 2.5 crore Rs. because of data ware housing -sort of statements. Data ware house affects the business and strategy plans indirectly - giving scope forundue expectations on one hand and total neglect on the other. Hence, it is essential that the designermust have a sound understanding of the overall business, the scope for his concept (data ware house) inthe project, so that he can answer the probing questions.

iii. Education

This has two roles to play - one to make people, specially top level policy makers, comfortable with theconcept. The second role is to aid the prototyping activity. To take care of the education concept, an

IT Strategy

Education Business case analysis

Technical Blueprint Business Requirements

Build the vision

History Load

Adhoc Enquiry

Automation

Requirement Evolution Future Growth

Page 24: BSIT 53 New

14

initial (usually scaled down) prototype is created and people are encouraged to interact with it. Thiswould help achieve both the activities listed above. The users became comfortable with the use of thesystem and the ware house developer becomes aware of the limitations of his prototype which can beimprovised upon.

Normally, the prototypes can be dispensed with, once their usefulness is over.

iv. Business Requirements

As has been discussed earlier, it is essential that the business requirements are fully understood by thedata ware house planner. This would ensure that the ware house is incorporated adequately in the overallsetup of the organization. But it is equally essential that a scope of 15-25% for future enhancements,modifications and long term planning is set apart. This is more easily said than done because futuremodifications are hardly clear even to top level planners, let alone the IT professionals. However, absenceof such a leeway may endup in making the system too constrained and worthless in a very near future.

Once the business requirements are understood, the following aspects also are to be decided.

i) A logical model to store the data within the data ware house.

ii) A set of mapping rules - i.e. the ways and means of putting data into and out of the model.

iii) The business rules to be applied.

iv) The format of query and the profile.

Another pit fall at this stage is that some of the data may not be available at this stage (because someof the data may get generated as and when the system is put to use). Normally, using artificially generateddata to supplement the nonavailability of data is not found very reliable.

v. Technical Blue Prints

This is the stage where the overall architecture that satisfies the requirements is delivered. At thisstage, the following items are decided upon.

i) The system architecture for the ware house.

ii) The server and data mart architecture.

iii) Design of the data base.

iv) Data retention strategy.

v) Data backup and recovery mechanism.

vi) Hardware and infrastructure plans.

Chapter 2 - Definition of Dataware Housing

Page 25: BSIT 53 New

15BSIT 53 Data Warehousing and Data Mining

vi. Building the Vision

Here the first physical infrastructure becomes available. The major infrastructure components are setup, first stages of loading and generation of data start up. Needless to say, we hasten slowly and startwith a minimal state of data. The system becomes operational gradually over 4-6 months or even more.

vii. History Load

Here the system is made fully operational by loading the required history into the ware house - i.e.what ever data is available over the previous years is put into the dataware house to make is fullyoperational. To take an example, suppose the building the vision has been initiated with one years’ salesdata and is operational. Then the entire previous data - may be of the previous 5 or 10 years is loaded.Now the ware house becomes fully “loaded” and is ready to take on live “queries”.

viii. Adhoc Query

Now we configure a query tool to operate against the data ware house. The users can ask questionsin a typical format (like the no. of items sold last month, the stock level of a particular item during the lastfortnight etc). This is converted into a data base query and the query is answered by the database. Theanswer is again converted to a suitable form to be made available to the user.

ix. Automation

This phase automates the various operational processes like

i) Extracting and loading of data from the sources.

ii) Transforming the data into a suitable form for analysis.

iii) Backing up, restoration and archiving.

iv) Generate aggregations.

v) Monitoring query profiles.

x. Extending Scope

There is not single mechanism by which this can be achieved. As and when needed, a new set of datamay be added, new formats may be included or may be even involve major changes.

Page 26: BSIT 53 New

16

xi. Requirement Evolution

Business requirements will constantly change during the life of the ware house. Hence, the processthat supports the ware house also needs to be constantly monitored and modified. This necessitates thatto begin with, the ware house should be made capable of capturing these changes and of growing withsuch changes. This would extend the life and utility of the system considerably.

In the next two chapters, we look at the overall flow of processes and the architecture of a typical dataware house. The typical data ware house process follows the delivery system we had discussed in theprevious section. We also keep in mind that typically data ware houses are built with large data volumes(100 GB or more). Cost and time efficiency are vital factors. Hence a perfect tuning of the processesand architecture is very essential for ensuring optimal performance of the data ware house. When younote that typical warehouse queries may run from a few minutes to hours to elicit an answer, you canunderstand the importance of these two stages. Unless a perfect fine tuning is done, the efficiencies maybecome too low and the only option available may be to rebuild the systems from the scratch.

2.3 TYPICAL PROCESS FLOW IN A DATA WARE HOUSE

Any data ware house must support the following activities

i) Populating the ware house (i.e. inclusion of data)

ii) Day-to-day management of the ware house.

iii) Ability to accommodate the changes.

The processes to populate the ware house have to be able to extract the data, clean it up, and make itavailable to the analysis systems. This is done on a daily / weekly basis depending on the quantum of thedata population to be incorporated.

The day to day management of data ware house is not to be confused with maintenance and managementof hardware and software. When large amounts of data are stored and new data are being continuallyadded at regular intervals, maintaince of the “quality” of data becomes an important element.

Ability to accommodate changes implies the system is structured in such a way as to be able to copewith future changes without the entire system being remodeled. Based on these, we can view theprocesses that a typical data ware house scheme should support as follows.

Chapter 2 - Definition of Dataware Housing

Page 27: BSIT 53 New

17BSIT 53 Data Warehousing and Data Mining

2.3.1 Extract and Load Process

This forms the first stage of data ware house. External physical systems like the sales counters whichgive the sales data, the inventory systems that give inventory levels etc. constantly feed data to thewarehouse. Needless to say, the format of these external data is to be monitored and modified beforeloading it into the ware house. The data ware house must extract the data from the source systems, loadthem into their data bases, remove unwanted fields (either because they are not needed or because theyare already there in the data base), adding new fields / reference data and finally reconciling with theother data. We shall see a few more details of theses broad actions in the subsequent paragraphs.

A mechanism should be evolved to control the extraction of data, check their consistency etc.For example, in some systems, the data is not authenticated until it is audited. It may also bepossible that the data is tentative and likely to change - for example estimated losses in a naturalcalamity. In such cases, it is essential to draw up a set of modules which decide when and howmuch of the available data will be actually extracted.

Having a set of consistent data is equally important. This especially happens when we arehaving several online systems feeding the data. When data is being received from severalphysical locations, unless some type of “tuning up” w.r.t. time is done, data becomes inconsistent.For example sales data from 3-5 P.M. on Thursday may be inconsistent from the same data,from the same 3-5 P.M., if it is for Friday. Though it looks trivial, such inconsistencies arethoroughly harmful to the system and more importantly, very difficult to locate, once allowedinto the system.

Once data is extracted from the source systems, it is loaded into a temporary data storagebefore it is “Cleaned” and loaded into the warehouse. The checks to find whether it is consistent

Archive and Reload data

Data Ware House

Data transformationData Movement

Query

UsersData Sources

Extract & load data

Page 28: BSIT 53 New

18

may be complex and also since data keeps changing continuously, errors may go unnoticed,unless monitored on a regular basis. In many cases, data lost may be confused with non existentdata. For example, if the data about purchase details of 10 customers are lost, one may alwayspresume that they have not made any purchases at all. Though it is impossible to make a list ofall possible inconsistencies, the system should be able to check for such eventualities and moreimportantly correct them automatically. For example, in the case of a lost data, if a doubt arises,it may ask for retransmission.

2.3.2 Data cleanup and Transformation

Data needs to be cleaned up and checked in the following ways

i) It should be consistent with itself.

ii) It should be consistent with other data from the same source.

iii) It should be consistent with other data from other sources.

iv) It should be consistent with the information already available in the data ware house.

While it is easy to list act the needs of a “clean” data, it is more difficult to set up systems thatautomatically cleanup the data. The normal course is to suspect the quality of data, if it does not meet thenormally standards of commonsense or it contradicts with the data from other sources, data alreadyavailable in the data ware house etc. Normal intution doubts the validity of the new data and effectivemeasures like rechecking, retransmission etc., are undertaken. When none of these are possible, onemay even resort to ignoring the entire set of data and get on with next set of incoming data.

Once we are satisfied with the quality of data, it is usually transformed into a structure that facilitatesit’s storage in the ware house. This structural transformation is done basically to ensure operational andquery performance efficiency.

2.3.3 Backup and Archiving

In a normal system, data within the ware house is normally kept backed up at regular intervals to guardagainst system crashes, data losses etc. The recovery strategies depend on the type of crashes and theamount of losses.

Even apart from that, older data need to be archived. They are normally not required for the day today operations of the data ware house, but may be needed under extraordinary circumstances. Forexample, if we normally take decisions based on last 5 years of data, any data pertaining to years beyondthis will be archived.

Chapter 2 - Definition of Dataware Housing

Page 29: BSIT 53 New

19BSIT 53 Data Warehousing and Data Mining

2.3.4 Query Management

This process manages the queries and speeds up the querying by directing them to the most effectivesource. It also ensures that all system resources are used effectively by proper execution. It would alsoassist in ware house management.

The other aspect is to monitor query profiles. Suppose a new type of query is raised by an end userwhile system uses the available resources and query tables to answer the query, it will also note thepossibility of such queries being raised repeatedly and prepares summary tables to answer the same.

One other aspect that the query management process should take care is to ensure that no singlequery can affect the overall system performance. Suppose a single query has asked for a piece ofinformation that may need exhaustive searching of a large number of tables. This would tie up most of thesystem resources, there by slowing down the performance of other queries. This is again to be monitoredand remedied by the query management process.

2.4 ARCHITECTURE FOR DATA WARE HOUSE

The architecture for a data ware is indicated below. Before we proceed further, we should be clearabout the concept of architecture. It only gives the major items that make up a data ware house. The sizeand complexity of each of these items depend on the actual size of the ware house itself, the specificrequirements of the ware house and the actual details of implementation.

Architecture of a data ware house

Users

Detailed information

Summary information

Meta data

Ware house Manager

Query ManagerLoad

Manager

Operational data

External data

! !

!

Page 30: BSIT 53 New

20

Before looking into the details of each of the managers we could get a broad idea about their functionalityby mapping the processes that we studied in the previous chapter to the managers. The extracting andloading processes are taken care of by the load manager. The processes of cleanup and transformationof data as also of back up and archiving are the duties of the ware house manage, while the querymanager, as the name implies is to take case of query management.

2.4.1 The Load Manager

The load manager is the system component that performs the operations necessary to support theextract and load processes. It consists of a set of programs writen in a language like C, apart fromseveral off-the-shelf tools (readily available program segments).

It performs the following operations:

i) To extract data from the source (s)

ii) To load the data into a temporary storage device

iii) To perform simple transformations to map it to the structures of the data ware house.

Most of these are back end operations, performed normally after the end of the daily operation,without human intervention.

Architecture of load Manager

Extracting data from the source depends on the configuration of the source systems, and normally canbe expected to be on a LAN or some other similar network. Simple File Transfer Protocol (FTP) shouldbe able to take care of most situations. This data is loaded into a temporary data storage device. Since

Load Manager

Data Extractor

Copy management tool

Loader

FileStructures

Ware houseData

TemporaryData

Chapter 2 - Definition of Dataware Housing

Page 31: BSIT 53 New

21BSIT 53 Data Warehousing and Data Mining

the sources keep sending data at the rates governed by the data availability and the speed of the network,it is highly essential that the data is loaded into the storage device as fast as possible. The matter becomesmore critical, when several sources keep sending data to the same ware house. Some authors even callthe process “fast load” to insist on the criticality of time.

The load manager also is expected perform simple transformations on the data. This is essentialbecause the data received by the load manager from the sources have different formats and it is essentialfor the data to fit to a standard format before it is stored in the ware house data base. The load managershould be able remove the unnecessary columns (for example each source sends details like name of theitem and the code of the item along with the data. When they are being concatenated, the name and codeshould appear only once. So, the extra columns can be removed). It should also convert the data into astandard data type of typical lengths etc.

2.4.2 The Ware House Manager

The ware house manager is a component that performs all operations necessary to support the warehouse management process. Unlike the load manager, the warehouse management process is driven bythe extent to which the operational management of the data ware house has been automated.

Architecture of a Ware House Manager

The ware house manger can be easily termed to be the most complex of the ware house components,and performs a variety of tasks. A few of them can be listed below.

i) Analyze the data to confirm data consistency and data integrity.

ii) Transform and merge the source data from the temporary data storage into the ware house

Ware House Manager

Stored Processes

Back up / recovery

Controlling Processes

Temporary data Storage

Star flake Schema

Summary Tables

Page 32: BSIT 53 New

22

iii) Create indexes, cross references, partition views etc.,.

iv) Check for normalization’s.

v) Generate new aggregations, if needed.

vi) Update all existing aggregations

vii) Create backups of data.

viii) Archive the data that needs to be archived.

The concept of consistency and integrity checks is extremely important if the data ware house is tofunction satisfactorily over a period of time. The effectiveness of the information generated by the warehouse is dependent on the “quality” of data available. If new data that is inconsistent with the alreadyexisting data is added, the information generated will be no more dependable. But checking for consistencycan be a very tricky thing indeed. It largely depends on the volume and type of data being stored and atthis stage we will simply accept it that the ware house manager is able to take up the same successfully.

Once the data is available in different tables, it is essential that complex transformations are to be doneon them to facilitate their merger with the ware house data. One common transformation is to ensurecommon basis of comparison - by reconciling key reference items and rearranging the related itemsaccordingly. The problem is that the related items may be inter-referenced and hence rearranging themmay be a complex process.

The next stage is to transform the data into a format suitable for decision support queries. Normallythe bulk of data is arranged at the centre of the structure, surrounded by the reference data. There arethree types of schemas - Star Schema, Snowflak Schema and Star flake schema. They are dealt in somedetail in a subsequent chapter.

The next stage is to create indexes for the information. This is a time consuming affair. Theseindexes help to create different views of the data. For example, the same data can be viewed on a dailybasis, quarter basis or simply on incremental basis. The ware house manager is to create indexes tofacilitate easy accessibility for each of them. At the same time, having a very large number of views canmake addition of new data and successive indexing process to become time consuming.

The ware house manger also generates summaries automatically. Since most queries select only asubset of the data from a particular dimension, these summaries are of vital importance in improving theperformance of the ware house.

Since the type of summaries needed also keep changing, meta data is used to effect such changes(The concept is dealt with later)

The other operation that the ware house manager should take up is to provide query statistics. Thestatistics are collected by the query manage as it intercepts any query hitting the database.

Chapter 2 - Definition of Dataware Housing

Page 33: BSIT 53 New

23BSIT 53 Data Warehousing and Data Mining

2.4.3 Query Manager

This performs all the operations necessary to support the query management process.

Architecture of query Manager

The query manager performs the following operations

i) Directs queries to the appropriate table(s)

ii) Schedule the execution of user queries.

2.5 THE CONCEPT OF DETAILED INFORMATION

The idea of a data ware house is to store the detailed information. But obviously all possible details ofall the available information can not be stored online - so the question to be solved is what degree of detailis required - in other words how much detail is detailed enough for us? There are no fixed answers andthis makes the design process more complex.

2.6 DATA WARE HOUSE SCHEMAS

Star schemas are data base schemas that structure the data to exploit a typical decision supportenquiry. When the components of typical enquiry’s are examined, a few similarities stand out.

i) The queries examine a set of factual transactions - sales for example.

ii) The queries analyze the facts in different ways - by aggregating them on different bases /graphing them in different ways.

The central concept of most such transactions is a “fact table”. The surrounding references are calleddimension tables. The combination can be called a star schema.

Query Manager

Query redirection

Procedures to generate

views

Query management

tool

Query Schedulin

g

Meta dataDetailed

informationSummary

Information

Page 34: BSIT 53 New

24

A star schema to represent the sales analysis.

The central Fact data is surrounded by the dimensional data.

The fact table contains a factual information. This is as collected by the sources. Fact table is themajor component of the database. Since the fact data contains data that is used by all the databasecomponents and also data keeps adding to it, it is essential that from the beginning, it is maintainedaccurately. Defining it’s contents accurately is one of the major focus areas of business requirementsstage. The dimension data is the information used to analyze the facts stored in the fact data. Thestructuring of the information in this way helps in optimizing the query performance. The dimension datanormally change w.r.t. time, as and when our needs to assess the fact data changes.

2.7 PARTITIONING OF DATA

In most ware houses, the size of the fact data tables tends to become very large. This leads to severalproblems of management, backup, processing etc. These difficulties can be over come by partitioningeach fact table into separate partitions.

Most often, the queries tend to be about the recent data rather than old data. We will be moreinterested in what happened last week or last month than during the February two years ago.

Also the queries themselves tend to be mostly similar in nature. Most of the queries will be of the runof the mill type.

Data ware houses tend to exploit these ideas by partitioning the large volume of data into data sets.For example, data can be partitioned on weekly / monthly basis, so as the minimize the amount of datascanned before answering a query. This technique allows data to be scanned to be minimized, without theoverhead of using an index. This improves the overall efficiency of the system. However, having toomany partitions can be counter productive and an optimal size of the partitions and the number of suchpartitions is of vital importance.

Sales Transactions

Suppliers

Products

Time

Location

Customers

Chapter 2 - Definition of Dataware Housing

Page 35: BSIT 53 New

25BSIT 53 Data Warehousing and Data Mining

Participating generally helps in the following ways.

i) Assists in better management of data

ii) Ease of backup / recovery since the volumes are less.

iii) The star schemes with partitions produce better performance.

iv) Since several hardware architectures operate better in a partitioned environment, the overallsystem performance improve

2.8 SUMMARY INFORMATION

This area contains all the predefined aggregations generated by the ware house manager. This helpsin the following ways

i) Speeds up the performance of commonly used queries

ii) Need not backed up, since it can be generated afresh, if the data is lost.

However, summary data tends to increase the operational costs on one hand and needs to be updatedevery time new data is loaded.

In practice, optimal performance can be achieved in the following manner. Since, all types of queriescan not be anticipated before hand, the summary information contains only the commonly encounteredqueries. However, any other query need not always be generated afresh, but by combining the summaryinformation in different ways.

For example, a system maintains a summary information of the sale of it’s products weekwise in eachof the cities. Suppose one wants to know the combined sale in all South Indian cities during last month,one need not start afresh. The summary data available for each South Indian city for the four weeks oflast month can be combined to answer the query. Thus, the overall performance of query processingincreases many fold.

But it need not mean the systems performance it self improves geometrically. The increase in theoperational management cost, for creating and updating the tables eat up a large part of the systemperformance. Thus, it is always essential to strike a balance number of summaries beyond which itbecomes counter productive.

2.9 META DATA

This area stores all the meta data definitions used by all processes within the data ware house. Now,what is this meta data? Meta data is simply data about data. Data normally describe the objects, their

Page 36: BSIT 53 New

26

quantity, size, how they are stored etc. Similarly meta data stores data about how data (of objects) isstored, etc.

Meta data is useful in a number of ways. It can map data sources to the common view of informationwithin the warehouse. It is helpful in query management, to direct query to most appropriate source etc.,.

The structure of meta data is different for each process. It means for each volume of data, there aremultiple sets of meta data describing the same volume. While this is a very convenient way of managingdata, managing meta data itself is not a very easy task.

2.10 DATA MARTS

A data mart is a subset of information content of a data ware house, stored in it’s own data base. Thedata of a data ware house may have been collected through a ware house or in some cases, directly fromthe source. In a crude sense, if you consider a data ware house as a whole sale shop of data, a data martcan be thought of as a retailer.

They are normally created along functional or departmental lines, in order to exploit a natural break ofdata. Ideally the queries raised in a data mart should not require data from outside the mart, though insome practical cases, it may need data from the central ware house (Again compare the whole sale -retail analogies)

Thus the technique to divide in to data marts is to identify those subsets of data which are more or lessself contained and group them into separate marts.

Chapter 2 - Definition of Dataware Housing

Page 37: BSIT 53 New

27BSIT 53 Data Warehousing and Data Mining

However, all the specialized tools of the ware house cannot be implemented at the data mart level, andhence the specialized operations need to be performed at the central ware house and the data populatedto the marts. Also, a single ware house may not be able to support more than a few data marts forreasons of maintaining data consistency, lead time needed for populating the marts etc.,

BLOCK SUMMARY

In this chapter, we got an overview of a what a data ware house is. We began with the definition ofa data ware house and proceeded to see a typical data ware house delivery process we saw how, startingwith the IT strategy of a company, we go through the various stages of a ware house building process andalso how there is a scope for future expansion made available.

The next stage was to study a typical process flow in a data ware house. They are broadly studiedunder the heads of extract and load processes, data clean up and transformation, backup and archivingand query management. We then took a look at a typical ware house architecture. We got ourselvesintroduced to the concepts of the load manager, the query manager and the ware house manager. Welooked into their activities in brief. The next step was to get introduced to the concept of schemas, datapartition, data summaries and meta data. We also looked at the concept of data marts.

Each of these concepts will be expanded in the coming chapters

SELF EVALUATION - I

1. Define a data ware house.

2. What are the roles of education in a data ware housing delivery process.

3. What is history load?

4. Name the 3 major activities of a data ware house?

5. What is data loading process?

6. What are the different ways in which data is to be consistent?

7. What is archiving?

8. What is the main purpose of query management?

9. Name the functions of the load manager

10. Name any 3 functions of the ware house manager.

11. Name the duties of the query manager.

12. What is meta data?

Page 38: BSIT 53 New

28

13. What is a data mart?

14. What is the purpose of summary information.

SELF EVALUATION - II

1. With diagram explain the dataware house delivery process.

2. With diagram explain the architecture of dataware house.

3. With diagram explain the architecture of load manager.

4. With diagram explain the architecture of query manager.

5. With diagram explain the warehouse manager.

ANSWER TO SELF EVALUATION - I

1. Collection of key pieces of information to arrive at suitable managerial decisions.

2. a) to make people comfortable with technology

b) to aid in prototyping.

3. Loading the data of previous years into a newly operational ware house.

4. a) populating the data

b) day to day management

c) accommodating changes.

5. Collecting data from source, remove unwanted fields, adding new fields / reference data and reconciling with

other data.

6. a) consistent with itself

b) consistent with other data from same source.

c) consistent with data from other sources.

d) consistent with data already in the warehouse.

7. Removing old data, not immediately needed, from the ware house and storing it else where.

8. To direct the queries to the most effective sources.

9. a) to extract data from the source.

b) to load data into a temporary storage device.

c) to perform simple transformation

10. a) To analyze data for consistency and integrity.

Chapter 2 - Definition of Dataware Housing

Page 39: BSIT 53 New

29BSIT 53 Data Warehousing and Data Mining

b) To transform and merge source data into the ware house.

c) create indexes, check normalizations etc.,.

11. a) direct queries to appropriate tables.

b) schedule the execution of queries.

12. Data-about-data-It helps in data management.

13. A subset of information content of a ware house, stored in it’s database for faster processing.

14. Speeds up the performance of commonly used queries.

Page 40: BSIT 53 New

Chapter 3

Data Base Schema

BLOCK INTRODUCTION

In this chapter, we look at the concept of schema – a logical arrangement of facts to facilitate storageand retrieval of data. We familiarize ourselves with the star flake schemas, fact tables and ability todistinguish between facts and dimensions. The next stage is to determine the key dimensions that

apply to each fact. We also learn that a fact in one context becomes a dimension in a different contextand one has to be careful in dealing with them. The next stage is to learn to design the fact tables.Several issues like the cost-benefit ratio, the desirable period of retention of data, minimizing the columnssizes of the fact table etc are discussed.

Next we move on to the design of dimension tables and how to represent hierarchies and networks.The other schemas we learn to design are the star flake schema and the multi-dimensional schemas. Theother aspect we look into is the concept of query redirection.

A schema, by definition, is a logical arrangements of facts that facilitate ease of storage and retrieval,as described by the end users. The end user is not bothered about the overall arrangements of the data orthe fields in it. For example, a sales executive, trying to project the sales of a particular item is onlyinterested in the sales details of that item where as a tax practitioner looking at the same data will beinterested only in the amounts received by the company and the profits made. He is not worried about theitem numbers, part numbers etc. In other words, each of them has his own schema of the same database.The ware house, in turn, should be able to allow each of them to work according to his own schema andget the details needed.

The process of defining a schema is involves defining a vision and building it, after a detailed requirementanalysis and technical blue print development.

Chapter 3 - Data Base Schema30

Page 41: BSIT 53 New

3.1 STAR FLAKE SCHEMAS

One of the key factors for a data base designer is to ensure that a database should be able to answerall types of queries, even those that are not initially visualized by the developer. To do this, it is essentialto understand how the data within the database is used.

In a decision support system, which is what a data ware is supposed to provide basically, a largenumber of different questions are asked about the same set of facts. For example, given a sales dataquestion like

i) What is the average sales quantum of a particular item?

ii) Which are the most popular brands in the last week?

iii) Which item has the least tumaround item.

iv)How many customers returned to procure the same item within one month. Etc.,.

Can be asked. They are all based on the sales data, but the method of viewing the data to answer thequestion is different. The answers need to be given by rearranging or cross referencing different facts.

The basic concept behind the schema (that we briefly introduced in the previous section) is thatregardless of how the facts are analyzed, the facts are not going to change. Hence, the facts can be usedas a read only area and the reference data, which keep changing our a period of time, depending on thetype of queries of the customers, will be read / write. Hence, the typical star schema.

Another star schema to answer details about customers

3.1.1 Who are the Facts and who are the Dimensions?

The star schema looks a good solution to the problem of ware housing. It simply states that one shouldidentify the facts and store it in the read-only area and the dimensions surround the area. Whereas thedimensions are liable to change, the facts are not. But given a set of raw data from the sources, how does

Customer details

Time

Customer events

Customer accounts

Customer location

31BSIT 53 Data Warehousing and Data Mining

Page 42: BSIT 53 New

32

one identify the facts and the dimensions? It is not always easy, but the following steps can help in thatdirection.

i) Look for the fundamental transactions in the entire business process. These basic entities arethe facts.

ii) Find out the important dimensions that apply to each of these facts. They are the candidates fordimension tables.

iii) Ensure that facts do not include those candidates that are actually dimensions, with a set offacts attached to it.

iv)Ensure that dimensions do not include these candidates that are actually facts.

We shall elaborate each with some detail.

LOOK FOR ELEMENTAL TRANSACTION

This step involves understanding the primary objectives from which data is being collected and whichare the transactions that define the primary objectives of the business. Depending on the primary businessof the company, the facts will change. As per the sales example, when data from the sales outlet keepcoming, they give details about the items, numbers, amount sold, tax liabilities, customers who bought theitems etc. It is essential to identify that the sales figures are of the primary concern.

To give a contrast, assuming that similar data may also come from a production shop. It is essential to

Chapter 3 - Data Base Schema

Look for elemental transactions

Identify the dimensions and facts

Check for the facts which are actually

dimensions

Check for the dimensions which are

actually facts

Page 43: BSIT 53 New

33BSIT 53 Data Warehousing and Data Mining

understand that the facts about the production numbers are the ones about which the questions are goingto be asked. Having said that, we would also note that it is not always easy to identify the facts at acursory glance.

The next stage is to ask the question, whether these facts are going to be modified / operated uponduring the process? Again this is a difficult question, because once the facts are in the ware house, theymay get changed not only by the other transactions within the system, but also from outside the system.For example, things like tax liabilities are likely to change based on the government policies, which areexternal to the system. But the sales volumes, do not change once the sales are made. Hence the salesvolumes are fit to be considered for including in the fact tables.

It must have become clear by now that identifying the elemental transactions needs an indepth knowledgeof the system for which the ware house is being built as also the external environment and the way thedata is operated. Also the frequency and mode of attachment of new data to the existing ware housefacts is an important factor and needs to be identified at this stage itself.

DETERMINE THE KEY DIMENSIONS THAT APPLY TO EACHFACT

In fact, this is a logical follow up of the previous step. Having identified the facts, one should be ableto identify in what ways these facts can be used. More technically, it means finding out which entities areassociated with the entity represented in the fact table. For example, entities like profit, turn around, taxliabilities etc. are associated with the sales entity, in the sense, these entities, when enquired for wouldmake use of the sales data. But the key problem is to identify those entities that are not directly listed, butmay become applicable. For example, in the sales data, the entity like storage area or transportation costsmay not have been appearing directly, but questions like what is the storage area needed to cater to thistrade volume may be asked.

ENSURE THAT A CANDIDATE FACT IS NOT A DIMENSIONTABLE WITH DENORMALISED FACTS

One has to start checking each of the facts and the dimensions to ensure that they match, whatappears to be candidate fact tables can indeed be a combination of facts and dimensions. To clear thedoubts, look at the following example. Consider the case of a “Customer” field in a sales company. Atypical record can be of the following type

Name of the customer

His address

Page 44: BSIT 53 New

34

Dates on which

The customer recorded with the company

The customer requested for items

The items were sent

Bills were sent

The customer made the payment

Payments were encashed etc..

The name indicates that it has details about the customer, but most of the details are dates on whichcertain events took place. As for as the company was concerned, this is the most natural way of storingthe facts. Hence, each of the dates should be so represented that each date becomes a row in the facttable. This may slightly increase the size of the database, but is the most natural way of storing the dataas well as retrieving data from it. The star schema for the same is as follows.

One can typically ask queries of the following nature in this type of data.

i. The mean delivery time of items

ii. The normal delay between the delivery of items and receipt of payments

iii. The normal rate of defaults etc.

The key point arising out of the above discussions is as follows.

When a data is to be stored, look for items within the candidate fact tables that are actually denormalisedtables. i.e. the candidate fact table is a dimension containing repeating groups of factual attributes. Forexample, in the above case, the fact “dates on which” was actually a set of repeating groups of attributes

Chapter 3 - Data Base Schema

facts Dimension

Operational events

customer

Dates

Address of customer

Page 45: BSIT 53 New

35BSIT 53 Data Warehousing and Data Mining

- “dates of operations” when such data is encountered, one should ensure to design the tables such thatthe rows will not vary over time. For example in the above case the customer might have requested foritems on different dates. In such a case, when a new date for “item requested” comes up, it should notreplace the previous value. Otherwise, as one can clearly see, reports like “how many times did thecustomer ordered the items” can never be generated. One simple way of achieving this is to make theitems “Read only”. As and when new dates keep coming, new rows will be added.

CHECK THAT A CANDIDATE DIMENSION IS NOT ACTUALLYA FACT TABLE

This ensures that key dimensions are not fact tables.

Consider the following example.

Let us elaborate a little on the example. Consider a customer A. If there is a situation, where thewarehouse is building the profiles of customer, then A becomes a fact - against the name A, we can list hisaddress, purchases, debts etc. One can ask questions like how many purchases has A made in the last 3months etc. Then A is fact. On the other hand, if it is likely to be used to answer questions like “howmany customers have made more than 10 purchases in the last 6 months”, and one uses the data of A, aswell as of other customers to give the answer, then it becomes a fact table. The rule is, in such cases,avoid making A as a candidate key.

It is left to the candidate to think of examples that make promotions as facts as well as dimensions.

The golden rule is if an entity can be viewed in 3 or more different views, it is probably a fact table. Afact table can not become a key dimensions.

Entity Fact/dimension Condition Customer Fact

Dimension

If it appears in a customer profile or customer database

If it appears in the sales analysis of data warehouse (for ex. No. of customers)

Promotion Fact

Dimension

If it appears in the promotion analysis of a data warehouseIn other situations

Page 46: BSIT 53 New

36

3.2 DESIGNING OF FACT TABLES

The above listed methods, when iterated repeatedly will help to finally arrive at a set of entities that gointo a fact table. The next question is how big a fact table can be? An answer could be that it should bebig enough to store all the facts, still making the task of collecting data from this table reasonably fast.Obviously, this depends on the hardware architecture as well as the design of the database. A suitablehardware architecture can ensure that the cost of collecting data is reduced by the inherent capability ofthe hardware on the other hand the database designed should ensure that whenever a data is asked for,the time needed to search for the same is minimum. In other words, the designer should be able tobalance the value of information made available by the database and cost of making the same dataavailable to the user. A larger database obviously stores more details, so is definitely useful, but the costof storing a larger database as well as the cost of searching and evaluating the same becomes higher.Technologically, there is perhaps no limit on the size of the database.

How does one optimize the cost- benefit ratio? There are no standard formulae, but some of thefollowing facts can be taken not of.

i. Understand the significance of the data stored with respect to time. Only those data that arestill needed for processing need to be stored. For example customer details after a period oftime may become irrelevant. Salary details paid in 1980s may be of little use in analyzing theemployee cost of 21st century etc. As and when the data becomes obsolete, it can be removed.

(There may arise a special case, when some body may ask for a historical fact after, say 50years. But such cases are rare and do not warrant maintaining huge volumes of data)

ii. Find out whether maintaining of statistical samples of each of the subsets could be resorted toinstead of storing the entire data. For example, instead of storing the sales details of all the 200towns in the last 5 years, one can store details of 10 smaller towns, five metros, 10 bigger citiesand 20 villages. After all data warehousing most often is resorted to get trends and not theactual figures. The subsets of these individual details can always be extrapolated to get thedetails instead of storing the entire data.

iii. Remove certain columns of the data, if you feel it is no more essential. For example, in arailway database the column of age and sex of the passenger is stored. But to analyse thetraffic over a period of time, these two do not really mean much and can be convenientlyremoved.

iv. Determine the use of intelligent and non intelligent keys.

v. Incorporate time as one of the factors into the data table. This can help in indicating theusefulness of the data over a period of time and removal of absolete data.

vi. Partition the fact table. A record may contain a large number of fields, only a few of which areactually needeed in each case. It is desirable to group those fields which will be useful into

Chapter 3 - Data Base Schema

Page 47: BSIT 53 New

37BSIT 53 Data Warehousing and Data Mining

smaller tables and store separately. For example, while storing data about employees, his familydetails can become a separate table as also his salary detail can be stored in a different one.Normally, when computing his salary, taxes etc., the no. of children etc. do not matter.

Now let us look into each of the above in a little more detail in the perspective of data warehousing.

IDENTIFICATION OF PERIOD OF RETENTION OF DATA

You ask a business man, he says he wants the data to be retained for as long as possible, 5,10,15 years,the longer the better. The more data you have, the better the information generated. But such a view ofthings is unnecessarily simplistic.

One need not have large amounts of data to get accurate results. One should retain relevant data.Having a larger volume of data means larger storage costs and more efforts in compilation of data. Butmore accurate data? Need not always be.

The database designer should try a judicious mix of detail of data with degrees of aggregation. Oneshould retain only the relevant portions of the data for the appropriate time only. Consider the following.

If a company wants to have an idea of the reorder levels, details of sales of last 6 months to one yearmay be enough. Sales pattern of 5 year is unlikely to be relevant today.

If an electricity company wants to know the daily load patter, loading pattern of about one month maybe enough – but one should take care to look at the appropriate month – the load of winter months may bedifferent form the load of summer months.

So it is essential to determine the retention period for each function, but once it is drawn, it becomeseasy to decide on the optimum value of data to be stored. After all, data warehousing deals more aboutpatterns and statistics, than actual figures.

DETERMINE WHETHER THE SAMPLES CAN REPLACE THEDETAILS

As discussed earlier, instead of storing the entire data, it may be sufficient to store representativesamples. An electricity company, instead of storing the load pattern of all the 5 lakh houses of a city, canstore data about a few hundred houses each of the lower income, middle and posh homes. Similarapproximation can be done about different classes of industries and the results can be scaled by a suitablefactor. If the subsets are drawn properly, the exercise should give information as accurate as the actualdata would have given, at a much lower effort.

Select the appropriate columns. The more the number of columns in the data, more will be the storage

Page 48: BSIT 53 New

38

space and also the search time. Hence it is essential to identify all superflus data and remove them.Typically all status indicaters, intermediate values, bits of reference data replicated for query performancecan be recommended

i. Examine each attribute of the data

ii. Find out if it is a new factual event ?

iii. Whether the same data is available else where, even if in a slightly different form ?

iv. Is the data only a control data ?

In principle, all data that is present in some form elsewhere or can be derived from the available dataat other places can be deleted. Also all intermediate data can be deleted since it can always be reproduced/it may not be needed at all.

For example, in a tax data base, items like the tax liability can always be derived given the incomedetails. Hence the column on tax liability need not be stored in the warehouse.

MINIMISE THE COLUMN SIZES IN THE FACT TABLE

Efforts need to be made to save every byte of data while representing the facts. Since the data in atypical ware house environment will be typically of a few million rows, even one byte of data saved ineach row could save large volumes of storage. Efforts should be made to ensure proper data representation,removal of unnecessary accuracy and elimination of derived data.

Incorporate Time into the Fact Table

Time can be incorporated in different ways into a fact table. The most straight forward way is tocreate a foreign key to store actual physical dates. The actual dates like 24 March 2004 can be stored.However, in some cases, it may be more desirable to store the dates relative to a starting date. Forexample 1 January of each year may be start of a table, so that each date entry can indicate the no. ofdays passed since them. For example 24 can indicate 24th January, while 32 will indicate first Februaryetc. Note that some computation is required, but we are reducing the storage required drastically. Insome cases, where the actual dates are really not needed, but only a range of dates is enough, the columncan be skipped altogether. For example, in cases like sales analysis over a period of time, the fact that asale is made between 1st January and 31st January is enough and the actual dates are not important. Thenall data pertaining to the period can be stored in one table, without bothering about storing the actual data.

Thus we summarise that the possible techniques for storing dates are

i. Storing the physical date

Chapter 3 - Data Base Schema

Page 49: BSIT 53 New

39BSIT 53 Data Warehousing and Data Mining

ii. Store an offset from a given date

iii. Storing a date range.

3.3 DESIGNING DIMENSION TABLES

After the fact tables have been designed, it is essential to design the dimension tables. However, thedesign of dimension tables need not be considered a critical activity, though a good design helps in improvingthe performance. It is also desirable to keep the volumes relatively small, so that restructuring cost will beless.

Now we see some of the commonly used dimensions.

Star Dimension

They speed up the query performance by denormalising reference information into a single table.They presume that the bulk of queries coming are such that they analyze the facts by applying a numberof constraints to a single dimensioned data.

For example, the details of sales from a stores can be stored in horizontal rows and select one/few ofthe attributes. Suppose a cloth store stores details of the sales one below the other and questions like howmany while shirts of size 85" are sold in one week are asked. All that the query has to do is to put therelevant constraints to get the information.

This technique works well in solutions where there are a number of entitles, all related to the keydimension entity.

Product

One Example of star dimension

The method may not work well where most of the columns are not accessed often. For example,each query asks only about either the name “shirt” or color “blue” etc., but not things like blue shirt of size90" with cost less than 500 Rs.

When a situation arises where a few columns are only accessed often, it is also possible to store thosecolumns into a separate dimension.

Unit Section Department Name Color Size Cost

Page 50: BSIT 53 New

40

Hierarchies and Networks

There are many instances, where it is not possible to denormalise all data into relations as detailedabove. You may note that star dimension is easy where all the entities are linked to a key by a one-to-onerelation. But, if there is a many to many relation, such single dimensional representation is not possibleand one has to resort either to multidimensional storage (Network) or a top down representation (hierarchy).Out of these, the data that are likely to be accessed more often is demoralised into a star product table, asdetailed in the above section. All others are stored into snow flake schema.

However, there is another aspect that should be taken care of. In many cases, the dimensions themselves,may not be static. At least some of them, if not all, vary overtime. This is particularly true for dimensionsthat use hierarchies or networks t group basic concepts, because the business will probably change theway in which it categories the dimension over a period of time.

Consider the following example. In a retailing scenario, the product dimension typically contains ahierarchy used to categories products into departments, sections, business units etc. As the businesschanges over the years, the products are re-categorised.

Take a more specific example. A department store has taken several steps over the years to upgradeand modify the quality of it’s men’s wear. Now it wants to know whether those efforts were successful,how far and so on. One simple query would be to compare the sales of the present year to the sales of 10or 15 years ago and hence draw the conclusions. But the definition of men’s wear it’self might havechanged. Things like T-shirts and Jeans, which are called “uni-wear” today were called men’s wearpreviously. So, several queries, each looking at a separate sets of products have to be raised, they are tobe combined to produce a single answer. To take case of such eventualities, the dimensions in thedimension table have to come with data ranges – the periods over which the rows are valid. For example,over the data range of, say 1985-1995, T-shirts were categerised as men’s wear. When, after 1995, theycame to be categorized as unit-wear, a separate row is to be inserted, when T-Shirts are categorized asuni-wear (say from 1995 till date). This would ensure that the changes made to dimensions are reflectedin the networks.

Some times, it may be necessary to answer queries, which are not exact days / dates, but definitionslike the first week of each month. One would like to compare the sales of the first week of this month withsales in the first weeks of the previous months (obviously, sales during the first weeks will be more briskthan in the latter months). Hoever, in such cases, it is desirable not to have ranges like first week, insteadhave product queires – like a product query of dates 1 through 7 of each month. The above method ofsimply inserting new dates ranged rows to take care of modifications leads to a problem. Over a periodof time, the dimension tables grows to sizes, where not only large amounts of memories are involved andmanaging becomes difficult, but more importantly, queries start to take much larger times to execute. Insuch cases, one should think of partitioning the table horizontally (say data of previous 10 years into aseparate table) and a creating a combinatory view. The clear indication of the need is when the full-tablescan of the dimension table starts taking an appreciable amount of time.

Chapter 3 - Data Base Schema

Page 51: BSIT 53 New

41BSIT 53 Data Warehousing and Data Mining

3.4 DESIGNING THE STAR-FLAKE SCHEMA

A star flake schema, as we have defined previously, is a schema that uses a combination of denormalisedstar and normalized snow flake schemas. They are most appropriate in decision support data warehouses. Generally, the detailed transactions are stored within a central fact table, which may be partitionedhorizontally or vertically. A series of combinatory data base views are created to allow the user to accesstools to treat the fact table partitions as a single, large table.

The key reference data is structured into a set of dimensions. Theses can be referenced from the facttable. Each dimension is stored in a series of normalized tables (snow flakes), with an additionaldenormalised star dimension table.

But the problem lies else where. It may not be easy to structure all entries in the model into specificsets of dimensions. A single entity / relationship can be common across more than one dimension. Forexample, the price of a product may be different at different locations. Then, instead of locating theproduct and then the price directly, one has to take the intersection of the product and the locality and thenget the corresponding price. It may also be that the same product may be priced differently at differenttimes (may be seasonal and offseasonal prices) then a the model may appear as follows.

Pricing Model of a Stores

The basic concept behind designing star flake schemas is that entities are not strictly defined, but a

Basket transaction

Store

Region

Basket item

Product

DepartmentTime

PriceBusiness Unit

Page 52: BSIT 53 New

42

degree of cross over between dimensions is allowed. This is the way one comes across in the real worldenvironment. But at the same time, one should not provide an all pervasive intersection of schemas. It isessential to keep the following in mind.

i) The number of intersecting entities should be relatively small.

ii) The intersecting entities should be clearly defined and understood within the business.

A Database of star flake schema: It typically stores the details of a retail store sales details, possibly of dress

materials.

However, a star flake schema, not only takes sufficient time to design, but also likely to keep changingoften.

3.5 QUERY REDIRECTION

One of the basic requirements for successful operation of star flake schema (or any schema, for thatmatter) is the ability to direct a query to the most appropriate source. Note that once the available datagrows beyond a certain size, partitioning becomes essential. In such a scenario, it is essential that, in order

Chapter 3 - Data Base Schema

Facts Snow flake dimensions

Star dimensions

Sales transactions

Department Business unit

Products Style

Products Size

Time Color

Time

Locations

Locations

Region

Week

Month

Summer Sales

Easter sales

Page 53: BSIT 53 New

43BSIT 53 Data Warehousing and Data Mining

to optimize the time spent on querying, the queries should be directed to the appropriate partitions thatstore the date required by the query.

The basic method is to design the access tool in such away that it automatically defines the locality towhich the query is to be redirected. We discuss in more detail some of the guidelines on the style of queryformation now.

Fact tables could be combined in several ways using database views. Each of these views canrepresent a period of time-say one view for every month or one for every season etc. Each of theseviews should be able to get a union of the required facts from different partitions. The query can be builtby simply combining these unions.

Another aspect is the use of synonyms. For example the same fact maybe viewed differently bydifferent users – “sales of year 2000” is the sales data for the sales department whereas it is viewed asthe off load by the production department. It can also be used by the auditor with a different name. Ifpossible, the query support system should support these synonyms to ensure proper redirection.

Views should also be able to combine vertically partitioned tables to ensure that all the columns aremade available to the query. However, the trick is to ensure that only a few of the queries would like tosee columns across the vertical partitions – because it is definitely a time consuming exercise. Thepartitioning (vertical) is to be done in such a way that most of the data is available in a single partition.

The same arguments hold good for queries that need to process several tables simultaneously.

3.6 MULTI DIMENSIONAL SCHEMAS

Before we close, we see the interesting concept of multi dimensions. This is a very convenientmethod of analyzing data, when it goes beyond the normal tabular relations.

For example, a store maintains a table of each item it sells over a month as a table, in each of it’s 10outlets

Sales of item 1

This is a 2 dimensional table. One the other hand, if the company wants a data of all items sold by it’s

Outlet no. --> 1 2 3 4Date 1234

Page 54: BSIT 53 New

44

outlets, it can be done by simply by superimposing the 2 dimensional table for each of these items – onebehind the other. Then it becomes a 3 dimensional view.

Then the query, instead of looking for a 2 dimensional rectangle of data, will look for a 3 dimensionalcuboid of data.

There is no reason why the dimensioning should stop at 3 dimensions. In fact almost all queries can bethought of as approaching a multi-dimensioned unit of data from a multidimensioned volume of theschema.

A lot of designing effort goes into optimizing such searches.

BLOCK SUMMARY

The chapter began with the definition of schema – a logical arrangement of facts that helps in storageand retrieval of data. The concept of star flake schemas, fact tables was discussed. The basic methodof distinguishing between facts and dimensions were discussed. It was also indicated that a fact in onecontext can become a dimension in another case and vice versa. We also learnt to determine the keydimensions that apply to each fact. We also touched upon the method of designing the fact-tables,dimension tables and the methods of representing hierarchies and networks.

The concepts of star flake schema, query redirection and multi dimensional schemas were alsodiscussed.

SELF EVALUATION - I

1. What is a schema?

2. Distinguish between facts and dimensions?

3. Can a fact become a dimension and vice versa?

4. What basic concept defines the size of a fact table?

5. What is the importance of period of retention of data?

Item3

Item2Item1

Chapter 3 - Data Base Schema

Page 55: BSIT 53 New

45BSIT 53 Data Warehousing and Data Mining

6. Name the 3 methods of incorporating time into the data?

7. Define a star – flake schema?

8. What is query redirection?

ANSWER TO SELF EVALUATION - I

1. A logical arrangement of facts that facilitate ease of storage and retrievel.

2. A fact is a piece of data that do not change with time whereas a dimension is a description which is likely to

change.

3. Yes, if the viewers objective changes.

4. That it should be big enough to store all the facts without compromising on the speed of query processing.

5. It is the period for which data is retained in the warehouse. Later on, it is archived.

6. a) storing the physical date.

b) store an offset from a given date.

c) store a date range.

7. It is a combination of demoralized and normalized snow flake schemas. Sending the query to the most appropriate

part of the ware house.

SELF EVALUATION - II

1. With diagram explain star flake schemas in detail. 2

2. Explain the designing of fact tables in detail.

3. Write diagram explain multidimensional schemas.

Page 56: BSIT 53 New

Chapter 4

Partitioning Strategy

BLOCK INTRODUCTION

In this chapter, we look into the trade offs of partitioning. Partitioning is needed in any large data warehouse to ensure that the performance and manageability is improved. It can help the query redirectionto send the queries to the appropriate partition, thereby reducing the overall time taken for query

processing.

Partitions can be horizontal or vertical. In horizontal partitioning, we simply the first few thousandentries in one partition, the second few thousand in the next and so on. This can be done by partitioningby time, where in all data pertaining to the first month / first year is put in the first partition, the second onein the second partition and so on. The other alternatives can be based on different sized dimensions,partitioning an other dimensions, petitioning on the size of the table and round robin partitions. Each ofthem have certain advantages as well as disadvantages.

In vertical partitioning, some columns are stored in one partition and certain other columns of the samerow in a different partition. This can again be achieved either by normalization or row splitting. We willlook into their relative trade offs.

Partitioning can also be by hardware. This is aimed at reducing bottle necks and maximize the CPUutilization.

We have seen in the previous chapters, in different contexts, the need for partitioning. There are anumber of performance related issues as well as manageability issues that decide the partitioning strategy.To begin with we assume it has to be resorted to for the simple reason of the bulk of data that is normallyhandled by any normal ware house.

Chapter 4 - Partitioning Strategy46

Page 57: BSIT 53 New

4.1 HORIZONTAL PARTITIONING

This is essentially means that the table is partitioned after the first few thousand entries, and the nextfew thousand entries etc. This is because in most cases, not all the information in the fact table needed allthe time. Thus horizontal partitioning helps to reduce the query access time, by directly cutting down theamount of data to be scanned by the queries.

The most common methodology would be to partition based on the time factor – each year or eachmonth etc. can be a separate portion. There is not reason that the partitions need to be of the same size.However, if there is too much variation in size between the different partitions, it may affect the performanceparameters of the warehouse as such one should consider the alternative ways of partitioning, and not goby the period itself as the deciding factor.

a) Partition by Time into Equal Segments:

This is the most straight forward method of partitioning by months or years etc. This will help if thequeires often come regarding the fortnightly or monthly performance / sales etc.

The advantage is that the slots are reusable. Suppose we are sure that we will no more need the dataof 10 years back, then we can simply delete the data of that slot and use it again.

Of course there is a serious draw back in the scheme – if the partitions tend to differ too much in size.The number of visitors visiting a till station, say in summer months, will be much larger than in wintermonths and hence the size of the segment should be big enough to take case of the summer rush. This, ofcourse, would mean wastage of space during winter month data space.

Partitioning tables into Same Sized Segments

Year 1 Year 2 Year 3

Hill resort details

January

February

March

December

47BSIT 53 Data Warehousing and Data Mining

Page 58: BSIT 53 New

48

b) Partitioning by Time into Different Sized Segments

The above figure should give the details. Since in the summer months of March, April, May moreoccupancy is reported, we use much larger sized partitions etc.

This is useful technique to keep the physical tables small and also the operating costs low.

The problem is to find a suitable partitioning strategy. In all cases, the solution may not be so obviousas in the case of hill station occupancy. It may also happen that the sizes may have to be varied over aperiod of time. This would also lead to movement of large portions of data within the warehouse over aperiod of time. Hence, careful consideration about the likely increase in the overall costs due to thesefactors should be given before adapting this method.

c) Partitioning on other Dimension

Data collection and storing need not always be partitioned based on time, though it is a very safe andrelatively straight forward method. It can be partitioned based on the different regions of operation,different items under consideration or any other such dimension. It is beneficial to look into the possibletypes of queries one many encounter, before deciding on the dimension – suppose most of the queries arelikely to be based on the region wise performance, region wise sales etc., then having the region as thedimension of partition is worthwhile. On the other hand, if most often we are worried about the totalperformance of all regions, total sales of a month or total sales of a product etc, then region wise partitioningcould be a disadvantage, since each such queries will have to move across several partitions.

There is a more important problem – suppose our basis of partition – the dimension itself is going tochange in future suppose we have partitioned based on the regions. But in a future data, the definition of

Year 1 Year 2 Year 3

Hill resort details

January January January

February

March

December

February February

March March

December December

Chapter 4 - Partitioning Strategy

Page 59: BSIT 53 New

49BSIT 53 Data Warehousing and Data Mining

region itself is going to change – 2 or more regious redefined. Then we end up building the entire facttable again, moving the data in the process. This should be avoided at all costs.

d) Partition by the Size of the Table

In certain cases, we will not be sure of any dimension on which partitions can be made. Neither thetime nor the products or regions etc. serve as a good guide, nor are we sure of the type of queries that weare likely to frequently encounter. In such cases, it is ideal to partition by size. Keep loading the data untila prespecified memory is consumed, then create a new partition. However, this creates a very complexsituation similar to simply dumping the objects in a room, without any (labeling) and we will not be able toknow what data is in which partition. Normally metadata (data about data) may be needed to keep trackof the identifications of data stored in each of the partitions.

e) Using Round Robin Partitions:

Once the warehouse is holding full amount of data, if a new partition is required, it can be done only byreusing the oldest partition. Then meta data is needed to note the beginning and ending of the historicaldata.

This method, though simple, may land into trouble, if the sizes of the partitions are not same. Specialtechniques to hold the overflowing data may become necessary.

4.2 VERTICAL PARTITIONING

As the name suggests, a vertical partitioning scheme divides the table vertically –– i.e. each row isdivided into 2 or more partitions.

Consider the following table

Now we may need to split this table because of any one of the following reasons:

i. We may not need to access all the data pertaining to a student all the time. For example, wemay need either only his personal details like age, address etc. or only the examination details of

Student name

Age Address Class Fees paid Marks scored in different subject

Page 60: BSIT 53 New

50

marks scored etc. Then we may choose to split them into separate tables, each containing dataonly about the relevant fields. This will speed up accessing.

ii. The no. of fields in a row become inconveniently large, each field itself being made up of severalsubfields etc. In such a scenario, it is always desirable to split it into two or more smaller tables.

The vertical partitioning itself can be achieved in two different ways: (i) normalization and (ii) rowsplitting.

4.2.1 Normalisation

The usual approach in normalization in database applications is to ensure that the data is divided intotwo or more tables, such that when the data in one of them is updated, it does not lead to anamolies of data(The student is advised to refer any book on data base management systems for details, if interested).The idea is to ensure that when combined, the data available is consistent.

However, in data warehousing, one may even tend to break the large table into several “denormalized”smaller tables. This may lead to lots of extra space being used. But it helps in an indirect way – It avoidsthe overheads of joining the data during queries.

To make things clear consider the following table

The original table is as follows:

We may split into 2 tables, vertically

Student name

Age Address Class Fees paid Marks scored in different subject

Student name

Age Address Class Fees paid

Chapter 4 - Partitioning Strategy

Page 61: BSIT 53 New

51BSIT 53 Data Warehousing and Data Mining

Note that the fields of student name and class are repeated, but that helps in reducing repeated joinoperations, since the normally used fields of student name and marks scored are available in both thefields.

With only two tables, it may appear to be trivial case of savings, but when several large tables are to berepeatedly joined, it can lead to large savings in computation times.

4.2.2 Row Splitting

The second method of splitting is the row splitting is shown in below fig. 3.4:

The method involved identifying the not so frequently used fields and putting them into another table.This would ensure that the frequently used fields can be accessed more often, at much lesser computationtime.

It can be noted that row splitting may not reduce or increase the overall storage needed, but normalizationmay involve a change in the overall storage space needed. In row splitting, the mapping is 1 to 1 whereasnormalization may produce one to many relationships.

Original Table After normalization

Student name

Class Marks scored in different subject

1:1 Mapping Many : 1 Mapping

Page 62: BSIT 53 New

52

Original Table Row Splitting

4.3 HARDWARE PARTITIONING

Needless to say, the dataware design process should try to maximize the performance of the system.One of the ways to ensure this is to try to optimize by designing the data base with respect to specifichardware architecture. Obviously, the exact details of optimization depends on the hardware platforms.Normally the following guidelines are useful:

i. Maximize the processing, disk and I/O operations.

ii. Reduce bottlenecks at the CPU and I/O

The following mechanisms become handly

4.3.1 Maximising the Processing and Avoiding Bottlenecks

One of the ways of ensuring faster processing is to split the data query into several parallel queries,convert them into parallel threads and run them parallel. This method will work only when there aresufficient number of processors or sufficient processing power to ensure that they can actually run inparallel. (again not that to run five threads, it is not always necessary that we should have five processors.But to ensure optimality, even a lesser number of processors should be able to do the job, provided theyare able to do it fast enough to avoid bottlenecks at eh processor).

Shared architectures are ideal for such situations, because one can be almost sure that sufficientprocessing powers are available at most of the times. A typical shared architecture looks at follows.

Chapter 4 - Partitioning Strategy

1:1 Mapping 1:1 Mapping

Fig. 3.4: Row Splitting

Page 63: BSIT 53 New

53BSIT 53 Data Warehousing and Data Mining

Of course, in such a networked environment, where each of the processors is able access data onseveral active disks, several problems of data contention and data integrity need to be resolved. Thoseaspects will not be discussed at this stage.

4.3.2 Stripping Data Across the Nodes

This mechanism distributes the data by dividing a large table into several smaller units and storing themin each of the disks. (The architecture is the same as above) There sub tables need not be of equal size,but are so distributed to ensure optimum query performance. The trick is to ensure that the queries aredirected to the respective processors, which access the corresponding data disks to service the queries.

It may be noted that in a scenario, there is an overhead of about 5-10%, to divide the quires intosubqueries, transport them over the network etc.

Also, the method is unsuitable for smaller data volumes, since, in such a situation, the overheads tendto dominate and bring down the overall performance.

Also, if the distribution of data across the disks is not proper, it may lead to an inefficient system andwe may have to redistribute the data in such a scenario.

processors

disks

Query 1 Query 2 Query 3

? ? ? ? ??

Network

subqueries

Page 64: BSIT 53 New

54

4.3.3 Horizontal Hardware Partitioning

This technique spreads the processing load by horizontally partitioning the fact table into smaller segmentsand physically storing each segment into a different node. When a query needs to access in severalpartitions, the accessing is done in a way similar to the above methods.

If the query is parallelized, then each subquery can run on the other nodes, as long as the total no. ofsubprocesses do not exceed the number of available nodes.

This technique will minimize the traffic on the network. However, if most of the queries pertain to asingle data unit or a processor, we may land into another type of problem . Since the data unit or theprocessor in question has limited capabilities, it becomes a bottleneck. This may affect the overallperformance of the system. Hence, it is essential to identify such units and try to distribute the data intoseveral units so as to redistribute the load.

Before we conclude, we point out that several parameters, like the size of the partition, the key aboutwhich the petition is made, the number of parallel devices etc, affect the performance. Naturally, a largernumber of such parallel units improve the performance, but at the much higher costs. Hence, it is essentialto work out the minimum size of the partitions that bring out the best performance from the system.

BLOCK SUMMARY

In this chapter, we started familiarizing our selves with the need for partitioning. It greatly helps inensuring better performance and manageability.

We looked at the concepts of horizontal and vertical partitioning. Horizontal partitioning can be donebased on time or the size of the block or both. One can think of storing the data, for example, of eachmonth in one block. This simple method, however, may become wasteful if the amount of data in eachmonth is not same. The solution is to have different sized petitions if we know before hand the amount ofdata that goes into each of the partitions. Partitioning need not always be on time. It can be on otherdimensions as well. It can also be a round robin sort. Each of these methods have their own merits anddemerits.

Vertical partitioning can be done by either normalization or row splitting. We took examples tounderstand the concepts involved. We also discussed about hardware partitioning and the issues involved.

SELF EVALUATION - I

1. What is horizontal partitioning?

2. What is vertical partitioning?

Chapter 4 - Partitioning Strategy

Page 65: BSIT 53 New

55BSIT 53 Data Warehousing and Data Mining

3. Name one advantage and one disadvantage of equal segment partitioning?

4. What is the concept of partitioning on dimensions?

5. Name the disadvantage of partitioning by size?

6. Name the two methods of vertical splitting?

7. What is the need for hardware partitioning?

8. What is parallelizing a query?

ANSWER TO SELF EVALUATION - I

1. The first few entries are in first block , the second few in the second block etc.

2. A few columns are in one block, some other columns in another block, though they belong to the same row.

3. a) Slots are reusable.

b) If the amount of data is varying, it is wasteful.

4. Partitioning can be on any dimension like region, unit size, article etc.,.

5. Searching for a given data becomes very cumbersome.

6. Normalization and row splitting.

7. It helps to

a) optimize processing operations

b) reduce bottlenecks at CPU and I/O.

8. Dividing a query into sub queries and running them in parallel using threads.

SELF EVALUATION - II

1. With diagrams explain the types of partitioning in detail.

2. With example explain the concept of normalization in detail.

Page 66: BSIT 53 New

Chapter 5

Aggregations

BLOCK INTRODUCTION

In this chapter, we look at the need for aggregation. It is performed to speed up common queries. Thecost of aggregation should off set the speed up of queries. We first ascertain ourselves that wheneverwe expect similar types of queries repeatedly arriving at the ware house, some home work can be

done before hand, instead of processing each query on the fly (as it comes). This means, we partiallyprocess the data and create summary tables, which become usable for the commonly encounteredqueries. Of course, an uncommon query has to be processed on the fly.

Of course the design of the summary table is a very important factor that determines the efficiency ofoperation. We see several guide lines to assist us in the process of developing useful summary tables.We also look at certain thumbrules that guide us in the process of aggregation. Aggregation is performedto speed up the normal queries and obviously, the cost of creating and managing the aggregations shouldbe less than the amount of speeding up of the queries. Otherwise, the whole exercise turns out to befutile.

5.1 THE NEED FOR AGGREGATION

Data aggregation is an essential component of any decision support data ware house. It helps us toensure a cost – effective query performance, which in other words means that costs incurred to get theanswers to a query would be more than off set by the benefits of the query answer. The data aggregationattempts to do this by reducing the processing power needed to process the queries. However, too muchof aggregations would only lead to unacceptable levels of operational costs.

Too little of aggregations may not improve the performance to the required levels. A file balancing of

56 Chapter 5 - Aggregations

Page 67: BSIT 53 New

the two is essential to maintain the requirements stated above. One thumbrule that is often suggested isthat about three out of every four queries would be optimized by the aggregation process, whereas thefourth will take it’s own time to get processed.

The second, though minor, advantage of aggregations is that they allow us to get the overall trends inthe data. While looking at individual data such overall trends may not be obvious, whereas aggregateddata will help us draw certain conclusions easily.

5.2 DEFINITION OF AGGREGATION

Most of the common queries will analyze

i) Either a subset of the available data

ii) Combination (aggregation) of the available data

Most of the queries can be answered only by analyzing the data in several dimensions. Thus simplequestions like how many mobiles were sold last month? are not often asked. Rather, questions like howmany more mobiles can be sold in the next six months are asked. To answer them one will have toanalyze the available data base shown in Fig. 4.1 on the following parameters like:

i) The income of the population

ii) Their occupation and hence the need for communication

iii) Their age groups

iv) Social trends etc.,.

In a simple scenario, to get the above conclusions, one should be able to get the data about thepopulation of the area, which may be in the following format and hence draw the necessary conclusionsby traversing along the various dimensions, selecting the relevant data and finally aggregating suitably.

One simple way to identify the number of (would be) mobile users is to identify that section of thepopulation with income above a threshold, with a particular family size, those whose professions needfrequent traveling, preferably of the younger age group, who could be the potential users.

Note that a simple query for each of these run on the database produces a rather complicated sets ofdata. The final answer is combine these sets of data in a suitable format.

57BSIT 53 Data Warehousing and Data Mining

Page 68: BSIT 53 New

58

A Typical data base to process mobile sales

But a detailed look into the setup tells us that by properly arranging the queries and processing them inoptimal formats one could greatly reduce the computations needed. For example, one need not search forall citizens above 18 years of age and all familiar with incomes greater than 15,000 per month etc. Onecan simply prepare these table before hand and they can be used as and when required. As and when thedata changes, the summaries (Like number of citizens above 18 years of age, families with incomesgreater than 15,000 etc) need to be changed.

The advantage is that the bulk of the sub queries are carried out before the actual execution of thequery itself. This reduces the time delay between the time of raising the query and the results being madeavailable (Though, the total computation time may not be less than if the entire query were to be answeredon the fly)

The draw back is that in many cases, the summaries (or sub- queries – if you want to call it) areclosely coupled to the type of queries being raised and will have to be changed if the queries is changed.

The other advantage is that these pre-aggregation summaries also allow us to look at specific trendsmore easily. The summaries highlight the trends, provide an overall view of the picture at large, ratherthan isolated views.

Population data

Age groups

Income

Profession

Family size

LocationReligion

Chapter 5 - Aggregations

Page 69: BSIT 53 New

59BSIT 53 Data Warehousing and Data Mining

5.3 ASPECTS TO BE LOOKED INTO WHILE DESIGNING THESUMMARY TABLES

The main purpose of using summary tables is to cut down the time taken to execute a specific query.The main methodology involves minimizing the volume of data being scanned each time the query is to beanswered. In other words, partial answers to the query are already made available. For example, in theabove cited example of mobile market, if one expects

i) The citizens above 18 years of age

ii) With salaries greater than 15,000 and

iii) With professions that involve traveling are the potential customers, then, every time the query isto be processed (may be every month or every quarter), one will have to look at the entire database to compute these values and then combine them suitably to get the relevant answers. Theother method is to prepare summary tables, which have the values pertaining toe ach of thesesub-queries, before hand, and then combine them as and when the query is raised . It can benoted that the summaries can be prepared in the background (or when the number of queriesrunning are relatively sparse) and only the aggregation can be done on the fly.

Summary table are designed by following the steps given below

i) Decide the dimensions along which aggregation is to be done.

ii) Determine the aggregation of multiple facts.

iii) Aggregate multiple facts into the summary table.

iv)Determine the level of aggregation and the extent of embedding.

v) Design time into the table.

vi) Index the summary table.

i. Determine the Aggregation Dimensions

Summary tables should be created so as to make full use of the existing schema structures. In otherwords, the summary tables should continue to retain all the dimensions that are not being aggregated.This technique is sometimes referred to as “Subsuming the dimension” .

The concept is to ensure that all those dimensions that do not get modified due to aggregation continueto be available in the summary table. No doubt, this would ensure that the flexibility of the summary tableis maintained for as long as possible.

In some cases, the summarizing also may be done partially. For example, we would like to know thenumber of people residing in each of the localities. But, we are not interested in all the localities may be

Page 70: BSIT 53 New

60

only in a few privileged localities. In such a case, summarizing is done in the respect of only thoselocalities. If any data pertaining to the other localities arise, one will have to go back to the primary dataand get the details on the fly.

ii. Determine the Aggregation of Multiple Values

The objective is to include into the summary table any aggregated value that can speed up the queryprocessing. If the query uses more than one aggregated value on the same dimension, combine thesecommon values into a set of columns on the same table. If this looks complicated, look at the followingexample.

Suppose there is a summary table of sales being setup. The data available, of course, is the daily sales.These data can be used to create a summary table of weekly sales. Suppose the query may also needdetails about the highest daily sales and lowest daily sales.

A query like “indicate the week where the weekly sales were good, but one/more days registered verylow sales” will be an example.

Note that the weekly sales as well as the high / low sales details can be summarized on the dimensionof sales.

In such a situation, the summary table can contain columns that refer to i) weekly sales ii) highestdaily sale of the week and iii) lowest daily sale of the week. This would ensure that we do not go aboutcomputing the same set of data repeatedly.

However, too many such aggregated values in the same summary table is not desirable. Every suchnew column would bring down the performance of the system, both while creating the summary table, aswell as while operating them. Though, no precise number of such aggregated columns is available, it isessential to look at the amount of time needed to handle / create such new columns in companion with theassociated improvement in performance.

iii. Aggregate the Multiple Facts into the Summary Table

This again is a very tricky issue. One should think of the possibility of amalgamating a number ofrelated facts into the same summary table, if such an amalgamation is desirable. The starting point is tolook at the query that is likely to come up quite often. If the query is to look at the same set of factsrepeatedly, then it is desirable to place these facts in to a single summary table. Consider a case wherequeries are expected to repeatedly ask for the amount of sales, their cost, the profits made and thevariation w.r.t. the previous week’s sales. The most ideal method is to combine them in to a singlesummary table, so that the actual aggregation effort at the time of the query processing is reduced.

As discussed in the previous section, here also, one should ensure that too many facts are not combinedtogether, since such a move can reduce the overall performance instead of improving it.

Chapter 5 - Aggregations

Page 71: BSIT 53 New

61BSIT 53 Data Warehousing and Data Mining

iv. Determine the Level of Aggregation and the Extent of Embedding

Aggregating a dimension at a specific level implies that no further detail is available in that summary.It is also essential that a huge number of summary tables are not created, as they tend to be rather counterproductive. Since the summary tables are produced to ensure that repeated computations are reducedand also due to the fact that creation of summary tables themselves involves a certain amount ofcomputations, not more than 250 -300 such tables are recommended.

A few thumbrules could be of use.

i) As for as possible, aggregate at a level below the level required by the frequently encounteredqueries. This would ensure that some flexibility is available for aggregation but the amount ofaggregation to be done on the fly is a minimum. However, the ratio of the number of rows in thedata to the aggregated rows should optimal. i.e. the number of computed rows (aggregated)should not be high w.r.t. the independent data.

ii) Whenever the above condition can not be satisfied, i.e. a table ends up with too many aggregatedrows, try to break the summary table into two tables.

It may be noted that the summary tables need to be recreated every time the basic data changes (sincein such a situation, the summary also changes). This aspect of creation of summary tables consumesquite an amount of time. If the policy is to use non-intelligent keys with in the summary table, then thisaspect consumes quite an amount of time.

Normally non-intelligent keys are used to ensure the need to restuructre the fact data if the key ischanged in future put the otherway, whenever the key is changed, the organization of the data table needsto be changed, if the organization depends on the intelligent keys are used. But in a summary table, in anycase, the table is changed when ever one/more facts change and hence nothing is to be gained by usingthe non-intelligent keys. Hence, one can as well have the intelligent keys.

v. Design Time into the Summary Table

Remember, in the case of fact tables, it was suggested that time can be stored into them to speed upthe operation. Calculations like weekly, monthly etc, details can straightaway take place if time were tobe incorporated in the summary table. Similarly, one can make use of the concept of time to speed up theoperations.

i) A physical date can be stored: This is the simplest but possibly a very convenient way of storingtime. The physical date is normally stored within the summary table, and preferably, intelligentkeys are used.

When ever the dimension of time is aggregated, (say daily, weekly etc) the data of the actualtime (say 12 P M etc.,) are lost. Suppose you store each sales data with the actual item of thesales. But if the aggregation is done about, say, the total sales per week, then the value of actual

Page 72: BSIT 53 New

62

time becomes useless. In such cases, either special care is to be taken to preserve the dates orthe dates need not be stored in the first place.

ii) Store an off set from a start data or start time: Again, if and when actual date / times, areneeded, they may be computed beginning from the offset. But, it may involve sufficient amountof computations, when a large number of such dates / times are to be computed.

Again, as in the previous case, they are lost in the summary tables.

iii) Store a data range: Instead of storing the actual dates / times, one can store the range of dateswithin which the values are applicable. Again, converting them to actual times is time consuming.The quality of access tools governs the use of data ranges within the summary. If the tools arenot very efficient, then the use of data ranges within the summary tools can become counterproductive.

vi. Index the Summary Tables

One should always consider using a high level of indexation on summary tables. As the aim is to directas many queries as possible to summary tables, it is worth investing some effort in indexing. Since thesummary table are normally of reasonable size, indexing is most often worth while. However, if most ofthe queries scan all the rows of the tables, then such indexes may end up being only an over head.

vii. Identifying the Summary Tables that need to be created

This is a very tricky, but a very important issue in summarizing. To some extent it depends on thedesigner and the extent to which the summaries can be seamlessly created. But a few techniques can behelpful.

One should examine the levels of aggregation within each key dimension and determine the most likelycombinations of interest. Then consider these combinations based on the likely queries one can expect inthe given environment. Each of these overlapping combinations becomes a summary table content.

This can be repeated, till a sufficient number of summary tables are created. However, most often, asthe system gets used and the normally encountered query profiles become clear, some of these summarytables are dropped and new ones are created.

The size of the summary table also is an important factor. Since the philosophy behind a summarytable is to ensure that the amount of data to be scanned is to be kept relatively small, having a very largesummary table defeats this very purpose. Normally, summary tables with a higher degree of aggregationtend to be smaller and vice versa. The usefulness of a summary table is limited by the average amount ofdata to be scanned by the queries.

Chapter 5 - Aggregations

Page 73: BSIT 53 New

63BSIT 53 Data Warehousing and Data Mining

BLOCK SUMMARY

Aggregation is the concept of combining raw data to create summary tables which become usefulwhen processing the normally encountered queries. There is a over head involved in creating thesesummary tables. The cost saved due to the speeding up of the queries should more than off set the costof creating and maintaining these summary tables, if the concept of aggregation were to be beneficial.

We discussed the basic process of creating summary tables as

i) Determining the dimensions to be aggregated.

ii) Determining the aggregation of multiple values.

iii) Determining the aggregation multiple facts.

iv)Determining the level of aggregating and embedding.

v) Incorporating time into the summary table and

vi) Indexing summary tables.

SELF EVALUATION - I

1. What is the need for aggregation?

2. What is a summary table?

3. What is the trade off involved in aggregation?

4. What is subsuming a dimension?

5. What is the golden rule that determines the level of aggregation?

6. Is using intelligent keys in a summary table desirable?

7. Which method of storing time is more appropriate in aggregation?

8. What is the role of indexing?

ANSWER TO SELF EVALUATION - I

1. It helps in speeding up the processing of normal queries.

2. Partially aggregated table which helps in reducing the time of scanning of normal queries.

3. The cost of aggregation should be less than the cost saved due to reduced scanning of normal queries.

4. After aggregation, the summary table will retain all the dimensions that have not been aggregated.

5. Aggregate one level below the level required for known common queries.

Page 74: BSIT 53 New

6. Yes

7. Store physical dates directly into the summary table.

8. It helps in choosing the appropriate summary table.

SELF EVALUATION - II

1. With suitable example, explain the concept of aggregation.

2. Explain the designed steps for summary tables in detail.

Chapter 5 - Aggregations64

Page 75: BSIT 53 New

Chapter 6

Dat a Mar t

BLOCK INTRODUCTION

In this chapter, the brief introduction to the concept of data marts is provided. The data mart stores asubset of the data available in the ware house, so that one need not always have to scan through theentire content of the ware house. It is similar to a retail outlet. A data mart speeds up the queries,

since the volume of data to be scanned is much less. It also helps to have tail or made processes fordifferent access tools, imposing control strategies etc.,.

The basic problem is to decide when to have a data mart and when to go back to the ware house. Thethumbrule is to make use of the natural splits in the organization / data or have one data mart for thedifferent access tools etc. Each of the data marts is to be provided with it’s own sub set of data informationand also it’s own summary information obviously this is a costly affair. The cost of maintaining theadditional hardware and software are to be offset by the faster query processing of the data mart. Wealso look at the concept of copy management tools.

6.1 THE NEED FOR DATA MARTS

In a crude sense, if you consider a data ware house as a store house of data, a data mart is a retailoutlet of data. Searching for any data in a huge store house is difficult, but if the data is available, youshould be positively able to get it. On the other hand, in a retail out let, since the volume to be searchedfrom is small, you can be able to access the data fast. But it is possible that the data you are searching formay not be available there, in which case you have to go back to your main store house to search for thedata.

65BSIT 53 Data Warehousing and Data Mining

Page 76: BSIT 53 New

66

Coming back to technical terminology, one can say the following are the reasons for which data martsare created.

i) Since the volume of data scanned is small, they speed up the query processing.

ii) Data can be structured in a form suitable for a user access too

iii) Data can be segmented or partitioned so that they can be used on different platforms and alsodifferent control strategies become applicable.

There are certain disadvantages also

i. The cost of setting up and operating data marts is quite high.

ii. Once a data strategy is put in place, the datamart formats become fixed. It may be fairlydifficult to change the strategy later, because the datamarts formats also have to be changes.

Hence, there are two stages in setting up data marts.

i. To decide whether data marts are needed at all. The above listed facts may help you to decidewhether it is worth while to setup data marts or operate from the warehouse itself. The problemis almost similar to that of a merchant deciding whether he wants to set up retail shops or not.

ii. If you decide that setting up data marts is desirable, then the following steps have to be gonethrough before you can freeze on the actual strategy of data marting.

a) Identify the natural functional splits of the organization.

b) Identify the natural splits of data.

c) Check whether the proposed access tools have any special data base structures.

d) Identify the infrastructure issues, if any, that can help in identifying the data marts.

e) Look for restrictions on access control. They can serve to demarcate the warehousedetails.

Now, we look into each one of the above in some detail.

A thorough look at the business organization help us to know whether there is an underlying structurethat help us to decide on data marting. The business can be split based on the regional organizations,product organizations, type of data that becomes available etc. For example when the organization is setup in several regions and the data ware house gets details from each of these regions, one simple way ofsplitting is to set up a data mart for each of these regions. Probably the details or forecasts of one regionis available on each of these data marts.

Chapter 6 - Data Mart

Page 77: BSIT 53 New

67BSIT 53 Data Warehousing and Data Mining

Similarly, if the organization is split into several departments each of these departments can becomethe subject of one data mart.

If such physical splits are not obvious, one can even think of the way data needs to be presented. Onedatamart for daily reports, one for monthly reports etc.

Once you have drawn up a basis for splitting, you should try to justify based on the hardware costs,business benefits and feasibility studies. For example, it may appear to be most natural to split based onthe regional organizations, setting up say 100 datamarts for 100 regions and interconnecting them may notbe a very feasible proposition.

Also, there is a “load window” problem. The data warehouse can be thought of as a huge volume andeach datamart provides a “window” to it. Obviously, each window provides only a partial view of theactual data. The more is the number of datamarts, the more such windows will be there and there will beproblems of maintaining the overlaps, managing data consistency etc. In a large data ware house, theseproblems will be definitely not trivial and unless managed in a professional manner, can lead to datainconsistencies. The problem with these inconsistencies is that they are hard to trace and debug.

Then the other problem always remains – If the split of the organization changes for some reason, thenthe whole structure of data marting needs to be redefined.

Department (A)

Department (B)

Input data

Input data

Detailed Information

Meta Data

Summary Information

Data Mart 1

Data Mart 2

Front end

Front end

Front end

Front end

Page 78: BSIT 53 New

68

6.2 IDENTIFY THE SPLITS IN DATA

The issues involved here are similar to those in the splits of organizations. The type of data coming inor the way it is stored helps us to identify the splits. For example, one may be storing the consumer itemsdata differently from the capital assets data or the data is being collected and stored dealer wise. In suchcases, one can set up the marts for each of the portions identifiable. The trade offs involved are exactlyidentical to what was discussed in the previous section.

6.3 IDENTIFY THE ACCESS TOOL REQUIREMENTS

Data marts are required to support internal data structures that support the user access tools. Datawithin those structures are not actually controlled by the ware house, but the data is to be rearranged andup dated by the ware house. This arrangement (called populating of data) is suitable for the existingrequirements of data analysis. While the requirements are few and less complicated, any populatingmethod may be suitable, but as the demands increase (as it happens over a period of time) the populatingmethods should match the tools used.

As a rule, this rearrangement (or populating) is to be done by the ware house after acquiring the datafrom the source. In other words, the data received from the source should not directly be arranged in theform of structures as needed by the access tools. This is because each piece of data is likely to be usedby several access tools which need different populating methods. Also, additional requirements maycome up later. Hence each data mart is to be populated from the ware house based on the access toolrequirements of the data ware house. This will ensure data consistency across the different marts.

6.4 ROLE OF ACCESS CONTROL ISSUES IN DATA MARTDESIGN

This is one of the major constraints in data mart designs. Any data warehouse, with it’s huge volumeof data is, more often than not, subject to various access controls as to who could access which part ofdata. The easiest case is where the data is partitioned so clearly that a user of each partition cannotaccess any other data. In such cases, each of these can be put in a data mart and the user of each canaccess only his data.

In the data ware house, the data pertaining to all these marts are stored, but the partitioning areretained. If a super user wants to get an overall view of the data, suitable aggregations can be generated.

However, in certain other cases the demarcation may not be so clear. In such cases, a judiciousanalysis of the privacy constraints so as to optimize the privacy of each data mart is maintained.

Chapter 6 - Data Mart

Page 79: BSIT 53 New

69BSIT 53 Data Warehousing and Data Mining

Design of Data Mart

Design based on function

Data marts, as described in the previous sections can be designed, based on several splits noticeableeither in the data or the organization or in privacy laws. They may also be designed to suit the user accesstools. In the latter case, there is not much choice available for design parameters. In the other cases, itis always desirable to design the data mart to suit the design of the ware house itself. This helps tomaintain maximum control on the data base instances, by ensuring that the same design is replicated ineach of the data marts. Similarly the summary information’s on each of the data mart can be a smallerreplica of the summary of the data ware house it self.

It is a good practice to ensure that each summary is designed to utilize all the dimension data in the star

Tier 3

Tier 1 Tier 2

Console of Mart 1

Console of Mart 2

Detailed info

Summary info

Detail Summary

Detail Summary

Page 80: BSIT 53 New

70

flake schema. In a simple scheme, the summary tables from the data ware house may be directly copiedto the data mart (or the relevant portions), but the data mart is so structured that it operates only on thosedimensions that are relevant to the mart.

The second case is when we populate the data base design specific to a user access tool. In such asituation, we may have to probably transform the data into the required structure. In some cases, theycould simply be a transformation into the different data base tables, but in other cases new data structuresthat suit each of the access tools need to be created. Such a transformation may need several degrees ofdata aggregation using stored procedures.

Before we close this discussion, one warning note needs to be emphasized. You may have noticeddata marting indirectly leads to aggregation, but it should not be used as an alternative to aggregation,since the costs are higher, but the data marts still will not be able to provide the “overview” capability ofaggregations.

BLOCK SUMMARY

The chapter introduced us to the concept of data mart, which can be compared to a retail outlet. Itspeeds up the queries, but can store only a sub set of the data and one will have to go back to the warehouse for any additional data. It also helps to form data structures suitable for the user access tools andalso to impose access control strategies etc.,.

Normally it is possible to identify splits in the functioning of organization or in the data collected. Theideal method is to use these splits to divide the data between different data marts. Access tools requirementsand control strategies can also dictate the setting up of data marts need to have their own detailedinformation and summary information stored in them.

SELF EVALUATION - I

1. Define data marting

2. Name any 4 reasons for data marting

3. Name any 4 methods of splitting the data between data marts.

4. Which is the best schema for data marts?

5. Is data cleaning an important issue in data marts?

ANSWER TO SELF EVALUATION - I

1. Creating a subset of data for easy accessing

Chapter 6 - Data Mart

Page 81: BSIT 53 New

71BSIT 53 Data Warehousing and Data Mining

2. a) Speed up queries by reducing the data to be scanned.

b) To suit specific access tools.

c) To improve control strategies.

d) segment data into different plat forms.

3. a) Use Natural split in organization.

b) natural split in data

c) to suit access tools

d) to suit access control issues.

4. Star flake schema

5. No, it is taken care of by the main ware house.

SELF EVALUATION - II

1. With neat diagram explain the organization of Datamart in detail.

Page 82: BSIT 53 New

Chapter 7

Met a Dat a

BLOCK INTRODUCTION

Meta data is data about data. Since data in a dataware house is both voluminous and dynamic,it needs constant monitoring. This can be done only if we have a separate set of data aboutdata is stored. This is the purpose of meta data.

Meta data is useful for data transformation and load data management and query generation.

This chapter introduces a few of the commonly used meta data functions for each of them.

Meta data, by definition is “data about data” or “data that describes the data”. In simple terms, thedata warehouse contains data that describes different situations. But there should also be some data thatgives details about the data stored in a data warehouse. This data is “metadata”. Metadata, apart formother things, will be used for the following purposes.

1. Data transformation and Loading

2. Data Management

3. Query Generation

7.1 DATA TRANSFORMATION AND LOADING

This type of metadata is used during data transformation. In a simple dataware house, this type ofmetadata may not be very important and may not be even present. But as more and more sources startfeeding the warehouse, the necessity for metadata is felt. It is also useful in matching the formats of thedata source and the data warehouse. More is the mismatch between the two, greater is the need for this

72 Chapter 7 - Meta Mart

Page 83: BSIT 53 New

type of metadata. Also, when the data being tansformed form the source changes, instead of changingthe data warehouse design itself, the metadata can capture these changes and automatically generate thetransformation programs.

For each source data field, the following information is required.

Source field

Unique identifier

Name

Type

Location

System

Object

The fields are self evident. The type field indicate detain like the storage type of data.

The destination field needs the following meta data.

Destination

Name

Type

Table name

The other information to be stored is the transformations that need to be applied to convert the sourcedata into the destination data.

This needs the following fields.

Transformation(s)

Name

Language

Module name

Syntax

The attribute language is the name of the language in which the transformation program is written.

The transformation can be a simple conversion of type (from integer to real; char to integer etc) ormay involve fairly complex procedures.

73BSIT 53 Data Warehousing and Data Mining

Page 84: BSIT 53 New

74

It is evident that most of these transformations are needed to take care of the difference in format inwhich data is sent from the source and the format which is to be stored in the warehouse. There are othercomplications like different types of mappings, accuracy of data available/stored and so on. Thetransformation and mapping tools are able to take care of all this. But the disadvantage is that they arequite costly on one hand and the resultant codes need not be optimal on the other.

7.2 DATA MANAGEMENT

Meta data should be able to describe data as it resides in the data warehouse. This will help thewarehouse manager to control data movements. The purpose of the metadata is to describe the objectsin the database. Some of the descriptions are listed here.

Tables

Columns

Names

Types

Indexes

Columns

Name

Type

Views

Columns

Name

Type

Constraints

Name

Type

Table

Columns

Chapter 7 - Meta Mart

Page 85: BSIT 53 New

75BSIT 53 Data Warehousing and Data Mining

The metadata should also allow for cross referencing of columns of different tables, which maycontain the same data, whether they are having the same names or not. It is equally important to be ableto keep track of a particular column as it goes through several aggregations. To take care of such asituation, metadata can be stored in the following format for each of the fields

Field

Unique identifier

Field name

Description

The unique identifier helps us to identify a particular column form other columns of the same name.

Similarly, for each table, the following information is to be stored

Table

Table name

Columns

Column name

Reference identifier

Again the names of the fields are self explanatory. The reference identifier helps to uniquely identifythe table. Aggregation is similar to tables and hence the following format is used.

Aggregation

Aggregation name

Columns

Column name

Reference identifier

Aggregation.

There are certain functions that operate on the aggregations some of them are

Min

Max

Average

Sum etc..

Page 86: BSIT 53 New

76

Their functions are self explanatory.

Partitions are subsets of tables. They need the following metadata associated with them

Partition

Partition name

Table name

Range allowed

Range contained

Partition key

Reference identifier

The names are again self explanatory.

7.3 QUERY GENERATION

Meta data is also required to generate queries. The query manger uses the metadata to build a historyof all queries run and generator a query profile for each user, or group of uses.

We simply list a few of the commonly used meta data for the query. The names are self explanatory.

Query

Table accessed

Column accessed

Name

Reference identifier

Restrictions applied

Column name

Table name

Reference identifier

Restrictions

Chapter 7 - Meta Mart

Page 87: BSIT 53 New

77BSIT 53 Data Warehousing and Data Mining

Join criteria applied

Column name

Table name

Reference identifier

Column name

Table name

Reference identifier

Aggregate function used

Column name

Reference identifier

Aggregate function

Group by criteria

Column name

Reference identifier

Sort direction

Syntax

Resources

Disk

Read

Write

Temporary

Each of these metadata need to be used specific to certain syntax. We shall not be going into the

details here.

Before we close, we shall have one point of caution. Gathering possible, they should be gathered in

the background.

Page 88: BSIT 53 New

BLOCK SUMMARY

We get ourselves familiarized with several meta data operations.

SELF EVALUATION

1. Explain all the steps of Data transformation and loading.

2. In detail explain data management.

3. In detail explain query generation.

78 Chapter 7 - Meta Mart

Page 89: BSIT 53 New

Chapter 8

Pr o cess Man ag er s

BLOCK INTRODUCTION

In this chapter, we look at certain software managers that keep the data ware house going. We haveseen on several previous occasions that the warehouse is a dynamic identify and needs constantmaintenance. This was originally being done by human managers, but software managers have

taken over recently we look at two category of managers.

System managers and

Process managers.

The systems managers themselves are divided into various categories.

Configuration manager to take care of system configurations.

Schedule manage to look at scheduling aspects.

Event managers to identify specific “Events” and activate suitable corrective actions.

Data base manager and system managers to handle the various user related aspects.

A backup recovery manager to keep track of backups.

Amongst the process managers, we have the

Load manager to take case of source interaction, data transformation and data load.

Ware house manger to take care of data movement, meta data management and performancemonitoring.

Query manager to control query scheduling and monitoring.

79BSIT 53 Data Warehousing and Data Mining

Page 90: BSIT 53 New

80 Chapter 8 - Process Managers

We understand the functions of each of them in some detail.

In this chapter, we briefly discuss about the system and warehouse mangers. The managers arespecific software and the underlying processes that perform certain specific tasks. A manger can also belooked upon as a tool. Sometimes, we use the terms manger and tool interchangeably.

8.1 NEED FOR MANAGERS TO A DATAWARE HOUSE.

Data warehouses are not just large databases. They are complex environments that integrate manytechnologies. They are not static, but will be continuously changing both contentwise and structurewise.Thus, there is a constant need for maintenance and management. Since huge amounts of time, moneyand efforts are involved in the development of data warehouses, sophisticated management tools arealways justified in the case of data warehouses.

When the computer systems were in their initial stages of development, there used to be an army ofhuman managers, who went around doing all the administration and management. But such a schemebecame both unvieldy and prone to errors as the systems grew in size and complexity. Further most of themanagement principles were adhoc in nature and were subject to human errors and fatigue.

In such a scenario, the need for complex tools which can go around managing without human interventionwas felt and the concept of “mangers tools” came up. But one major problem with such mangers is thatthey need to interact with the humans at some stage or the other. This needs a lot of care to be taken toallow for human intervention. Further, when different tools are used for different tasks, the tools shouldbe able interact amongst themselves, which brings the concept of compatibility into picture. Taking thesefactors into account, several standard managers have been devised. They basically fall into two categories.

1. System Management Tools

2. Data Warehouse Process Management Tools.

We shall briefly look into the details of each of these categories.

8.2 SYSTEM MANAGEMENT TOOLS

The most important jobs done by this class of managers includes the following

1. Configuration Managers

2. Schedule Managers

3. Event Managers

Page 91: BSIT 53 New

81BSIT 53 Data Warehousing and Data Mining

4. Database Mangers

5. Back Up Recovery Managers

6. Resource and Performance a Monitors.

We shall look into the working of the first five classes, since last type of managers are less critical innature.

8.2.1 Configuration Manager

This tool is responsible for setting up and configuring the hardware. Since several types of machinesare being addressed, several concepts like machine configuration, compatibility etc. are to be taken careof, as also the platform on which the system operates. Most configuration managers have a singleinterface to allow the control of all types of issues.

8.2.2 Schedule Manager

The scheduling is the key for successful warehouse management. Almost all operations in the warehouse need some type of scheduling. Every operating system will have it’s own scheduler and batchcontrol mechanism. But these schedulers may not be capable of fully meeting the requirements of a datawarehouse. Hence it is more desirable to have specially designed schedulers to manage the operations.Some of the capabilities that such a manager should have include the following

Handling multiple queues

Interqueue processing capabilities

Maintain job schedules across system outages

Deal with time zone differences

Handle job failures.

Restart failed jobs

Take care of job priorities

Management of queues

Notify a user that a job is completed.

It may be noted that these features are not exhaustive. On the other hand not all the schedule mangersneed to support all these features also.

Page 92: BSIT 53 New

82 Chapter 8 - Process Managers

While supporting the above cited jobs, the manager also needs to take care of the following operations,which may be transparent to the user

Overnight processing

Data load

Data transformation

Index creation

Aggregation creation

Data movement

Back up

Daily scheduling

Report generations

Etc…

8.2.3 Event Manager

An event is defined as a measurable, observable occurrence of a defined action. If this definition isquite vague, it is because it encompasses a very large set of operations. The event manager is a softwarethat continuously monitors the system for the occurrence of the event and then take any action that issuitable (Note that the event is a “measurable and observable” occurrence). The action to be taken isalso normally specific to the event.

Most often the term event refers to an error, problem or at least an uncommon event. The eventmanager starts actions that either corrects the problems or limits the damage.

A partial list of the common events that need to be monitored are as follows:

Running out of memory space.

A process dying

A process using excessing resource

I/O errors

Hardware failure

Page 93: BSIT 53 New

83BSIT 53 Data Warehousing and Data Mining

Lack of space for a table

Excessive CPU usage

Buffer cache hit ratios falling below thresholds etc.

It is obvious that depending on the hardware, the platforms and the type of data being stored, theseevents can keep changing.

The most common way of resolving the problem is to call a procedure that takes the corrective actionfor the respective event. Most often, the problem resolution is done automatically and human interventionis needed only in extreme cases. One golden rule while defining the procedures is that the solving of oneevent should not produce “side effects” as for as possible. Suppose a table has run out of space, theprocedure to take care of this should provide extra space elsewhere. But the process should not end upin snatching away the space from some other table, which may cause problems later on. However, it isvery difficult to define and implement such perfect procedures.

The other capability of the event manager is the ability to raise alarms. For example, when the spaceis running out, it is one thing to wit for the completion of the event and take corrective action. But it canalso raise an alarm after say 90% of the space is used up, so that suitable corrective action can be takenup early.

8.2.4 Database Manager

The database manger normally will also have a separate (and often independent) system managermodule. The purpose of these managers is to automate certain processes and simplify the execution ofothers. Some of operations are listed as follows.

Ability to add/remove users

o User management

o Manipulate user quotas

o Assign and deassign the user profiles

Ability to perform database space management

o Monitor and report space usage

o Garbage management

o Add and expand space

Page 94: BSIT 53 New

84 Chapter 8 - Process Managers

Manage summary tables

Assign or deassign space

Reclaim space from old tables

Ability to manage errors

Etc..

User management is important in a data warehouse because a large number of users, each of whomhas the potential to use large amounts of resources, are to be managed. Added to this is the complexity ofmanaging the access controls and the picture in complete. The managers normally maintain profiles androles for each user and use them to take care of the access control aspects.

The measure of the success of a manager is it’s ability to manage space both inside and outside thedatabase. In some cases, an incremental change can trigger of huge changes and hence space managementand reclaimation of unused space as well as unwinging the fragmented chunks is a critical factor. Themanager should be able to clearly display the quantum and location of the space used, so that properdecisions can be taken.

The need for temporary space to take care of interim storages is another important factor. Thoughthey do not appear in the final tally, insufficient / adhoc space can lead to inefficient performance of thesystem. Proper utilization and tracking of such spaces is a challenging task.

In large databases, huge volumes of error logs and trace files are created. The ability to manage themin the most appropriate form and archiving it at suitable intervals is also an important aspect. The tradeoff between archiving and deleting the files is also to be clearly understood.

8.2.5 Back Up Recovery Manager

Since the data stored in a warehouse is invaluable, the need to backup and recover lost data cannot beoveremphasized. There are three main features for the management of backups.

Scheduling

Backup Data Tracking

Database Awareness.

Since the only reason backups are taken is to save the accidentally lost data, backups are uselessunless the data can be used effectively whenever needed. This needs a very efficient integration with theschedule manager. Hence the backup recovery manager must be able to index and track the stored dataefficiently. The idea about the enormity the task can be got if one looks at the fact that the datawarehousesthemselves are huge and the backups will be several times bigger than the warehouse.

Page 95: BSIT 53 New

85BSIT 53 Data Warehousing and Data Mining

8.3 DATAWARE HOUSE PROCESS MANAGERS

These are responsible for the smooth flow, maintainance and upkeep of data into and out of thedatabase. The main types of process managers are

Load Manager

Warehouse Manager and

Query Manager

We shall look into each of them briefly. Before that, we look at a schematic diagram that defines theboundaries of the three types of managers.

Boundaries of process managers

8.3.1 Load Manager

This is responsible for any data transformations and for loading of data into the database. They shouldeffect the following

Data

Load manager

Operational data

Summary info

Meta data

Detailed information

External data

Decision

Query manager

Front end tools

information

Ware house manger

Page 96: BSIT 53 New

86 Chapter 8 - Process Managers

Data source interaction

Data transformation

Data load.

The actual complexity of each of these modules depend on the size of the database.

It should be able to interact with the source systems to verify the received data. This is a veryimportant aspect and any improper operations leads to invalid data affecting the entire warehouse. Theconcept is normally achieved by making the source and data ware house systems compatible.

The easiest method is to ask the source to send some control information, based on which the data canbe verified for relevant data. Simple concepts like checksums can go a long way in ensuring error freeoperations.

If there is a constant networking possible between the source and destination systems, then messagetransfers between the source and ware house can be used to ensure that the data transfer is ensuredcorrectly. In case of errors, retransmissions can be asked for.

In more complex cases, a copy management tool can be used to effect more complex tests beforeadmitting the data from the source. The exact nature of checks is based on the actual type of data beingtransferred.

One very simple, but very useful indication is to make sure that a count of the no. of records ismaintained so that no data is lost and at the same time, it does not get loaded twice.

The amount of data transformation needed again depends on the context. In simple cases, only thefield formats may have to be changed. Single fields may have to be broken down or multiple fieldscombined. Extra fields may also have to be added / deleted. In more complex transformation mappings,separate tools to do the job need to be employed.

Data loading is also a very important aspect. The actual operation depend on the softwares used.

8.3.2 Ware House Manager

The warehouse manager is responsible for maintaining data of the ware house. It should also createand maintain a layer of meta data. Some of the responsibilities of the ware house manager are

Data movement

Meta data management

Performance monitoring

Archiving.

Page 97: BSIT 53 New

87BSIT 53 Data Warehousing and Data Mining

Data movement includes the transfer of data within the ware house, aggregation, creation andmaintenance of tables, indexes and other objects of importance. It should be able to create new aggregationsas well as remove the old ones. Creation of additional rows / columns, keeping track of the aggregationprocesses and creating meta data are also it’s functions.

Most aggregations are created by queries. But a complex query normally needs several aggregationsand needs to be broken down to describe them. Also, they may not be a most optimal way of doing things.In such cases, the ware house manager should be capable of breaking down the query and be able tooptimize the resultant set of aggregations. This may also need some human interaction.

The ware house manager must also be able to devise parallelisms for any given operation. This wouldensure the most optimal utilization of resources. But parallelization would also involve additional queuingmechanisms, prioritization, sequencing etc. When data marts are being used, the warehouse manager isalso responsible for their maintenance. Scheduling their refresh sequences and clearing unwanted datawill also become it’s responsibilities.

The other important job of the warehouse manager is to manage the meta data. Whenever the olddata is archived or new data is loaded, the meta data needs to be updated. The manager should be abledo it automatically. The manager will also be responsible for the use of metadata, in several cases likeidentifying the same data being present at different levels of aggregation. Performance monitoring andtuning is also the responsibility of the ware house manager. This is done by maintaining statistics alongwith the query history, so that suitable optimizations are done. But the amount of statistics stored and thetype of conclusions drawn are highly subjective. The aspect of tuning the system performance is a morecomplex operation and as of now, no tool that can do it most effectively is available.

The last aspect of the ware house manager is archiving. All data is susceptible to aging over a periodof time, the usefulness of data becomes less and they are to be removed to make way for new data.However, it is desirable to hold the data as long as possible, since there is no guarantee that a piece ofdata will not at all be needed in future. Based on the availability of storage space and the previousexperience about how long a data needs to be preserved, a decision to purge the data is to be taken. Insome cases, legal requirements may also need the data to be preserved, though they are not useful fromthe business point of view.

The answer is to hold data in archives – normally on a tape but can also be on a disk, so that they canbe brought back as and when needed. However, one argument is that since the ware house always getsdata from a source and the source will have any way archived the data, the ware house need not do itagain.

In any case, a suitable strategy of archiving, based on various factors need to be devised and meticulouslyimplemented by the manager.

Page 98: BSIT 53 New

88 Chapter 8 - Process Managers

When designing the archiving process, several details need to be looked into:-

Life expectancy of data

In raw form

In aggregated form

Archiving parameters

Start data

Cycle time

Work load.

It can be pointed out that once the life expectancy of the data in the ware house is over, it is archivedin the raw form for sometime, later in the aggregated form, before being totally purged from the system.The actual time frames of each of these depend on various factors and previous experience.

The cycle time indicates how often the data is archived weekly, monthly or quarterly. It is to be notedthat archiving puts extra load on the system. If the data to be archived is small, perhaps it can be doneovernight, but if it is large it may affect the normal schedule. This is one of the factors that decides thecycle time – longer the cycle time, more will be the load.

8.3.3 Query Manager

We shall look at the last of manager, but not of any less importance, the query manager. The mainresponsibilities include the control of the following.

User’s access to data

Query scheduling

Query Monitoring

These jobs are varied in nature and have not been automated as yet.

The main job of the query manager is to control the user’s access to data and also to present the dataas a result of the query processing in a format suitable to the user. The raw data, often from differentsources, need to be compiled in a format suitable for querying. The query manager will have to act as amediator between the user on one hand and the meta data on the other. It is desirable that all the accesstools work through the query manager. If not atleast indirect controls need to be set up, to ensure properrestrictions on the queries made. This will ensure proper monitoring and control, if nothing else.

Page 99: BSIT 53 New

89BSIT 53 Data Warehousing and Data Mining

Scheduling the adhoc queries is also the responsibility of the query manager: Simultaneous, large,uncontrolled queries affect the system performance. Proper queuing mechanisms to ensure fairness to allqueries is of prime importance. The query manager will be able to create, abort and requeue the jobs.But the job of performance prediction – when does a query in the queue gets completed – is often kept outside the purview for the simple reason that it is difficult to estimate before hand.

The other important aspect is that the query manager should ideally be able to monitor all the queries.This would help in getting the proper statistics, tuning the adhoc queries to improve the system performanceas also to control the type of queries made. It is for this reason that all queries should be routed throughthe query manager.

BLOCK SUMMARY

We have looked at understood the functioning of the following classes of managers.

System mangers

Configuration manager

Schedule manager.

Event manager

Data base manager.

Backup recovery manager.

Process managers

Load manager.

Ware house manager

Query manager.

SELF EVALUATION - I

1. What are the 2 basic classes of managers?

2. Name any 3 duties of schedule manager.

3. What is an event?

4. Name any 4 events.

5. How does the event manager manages the events?

Page 100: BSIT 53 New

90 Chapter 8 - Process Managers

6. Name any 4 functions of data base manager.

7. Name the 3 process managers.

8. Name the functions of the load manager.

9. Name the functions of warehouse manger.

10. What are the responsibilities of query manger?

ANSWER TO SELF EVALUATION - I

1. a) Systems Managers.

b) Process Managers.

2. a) Handle multiple queues

b) Maintain job schedules across outages.

c) Supp0ort starting and stopping of queries etc.

3. An event is a measurable, observable occurrence of action.

4. a) disk running out of space.

b) excessive CPU use

c) A dying process

d) Table reaching maximum size etc.,.

5. By calling the scripts capable of handling the events.

6. a) To add / remove users.

b) To maintain roles and profile.

c) To perform database pace management.

d) manage temporary tables etc.

7. a) Load manager.

b) warehouse manager.

c) Query manager.

8. a) Source interaction.

b) data transformation

c) data load.

9. a) data movement

Page 101: BSIT 53 New

91BSIT 53 Data Warehousing and Data Mining

b) meta data management

c) performance monitoring.

d) data archiving.

10. a) User access to data

b) query scheduling.

c) query monitoring.

SELF EVALUATION - II

1. In detail explain the systems management tools.

2. Explain the boundaries of process manager with neat diagram in detail.

Page 102: BSIT 53 New

Dat a Min in g

COURSE SUMMARY

Data mining is a promising and flourishing frontier in database systems and new database applications.Data mining is also called knowledge discovery in databases (KDD), is the automated or extractionof patterns representing knowledge implicitly stored in large databases, data warehouses and

other massive information repositories.

Data mining is multidisciplinary field, drawing work from areas including database technology, artificialintelligence, machine learning, neural networks, statistics, pattern recognition, knowledge acquisition,information retrieval , high performance computing and data visualization.

The aim of this course is to give the reader an appreciation of the importance and potential of datamining. It is technique not only for IT managers but for all decision makers and we should able to exploitthe new technology.

Data mining is , in essence, a set of techniques that allows us to access data which is hidden in ourdatabase. In large database especially , it is extremely important to get appropriate , accurate and usefulinformation which we cannot find with standard SQL tools. In order to do this, a structured approach isneeded. A step-by-step approach must be adopted. Goals must be identified, data cleaned and preparedfor the queries and analyses to make. It is essential to begin with a very good data warehouse and thefacility to clean the data.

In this course , reader go through the basic fundamentals of data mining , that is what is data mining,issues related to data mining, approaches towards to data mining , application of data mining.

In the subsequent chapters, data mining techniques have been explained which includes associationrules, clustering and neural network. Advanced data mining issues are also discussed with respect to the

92 UNIT - II: Data Mining

Page 103: BSIT 53 New

world wide web. The guidelines to data mining issues are discussed. How to analyze the performance ofthe data mining systems are discussed. Here the various factors like size of the data, data miningmethods and error in the system is taken into account in analyzing the performance of the system. Finallywe see the application aspects ,that is, implementation of data mining system.

93BSIT 53 Data Warehousing and Data Mining

Page 104: BSIT 53 New

Chapter 9 - Introduction to Data Mining

Chapter 9

In t roduct ion t o Dat a Min ing

9.0 INTRODUCTION

We are in an information technology age. In this information age, we believe that informationleads to power and success. With the development of powerful and sophisticated technologiessuch as powerful machines like computers, satellites and others, we have been collecting

tremendous amounts of information. Initially, with the advent of computers and mass digital storage, westarted collecting and storing all sorts of data, counting on the power of computers to help sort through thisamalgam of information.

Unfortunately, these massive collections of data stored on different structures very rapidly becameoverwhelming. This initial chaos has led to the creation of structured databases and database managementsystems (DBMS). The efficient database management systems have been very important assets formanagement of a large corpus of data and especially for effective and efficient retrieval of particularinformation from a large collection whenever needed. The proliferation of database management systemshas also contributed to recent massive gathering of all sorts of information. Today, we have far moreinformation than we can handle from business transactions and scientific data to satellite pictures, textreports and military intelligence.

The overall evolution of database technology is shown in the table 9.1

The abundance of data, coupled with the need for powerful data analysis tools, has been described asa data rich but information poor situation. The fast-growing, tremendous amounts of data, collectedand stored in large and numerous databases, has far exceeded our human ability for comprehensionwithout powerful tools. Due to the high cost of data collection, people learned to make decisions based onlimited information. But this was not possible if the data set is huge or very big. Hence concept of datamining was developed.

94

Page 105: BSIT 53 New

Table 9.1 Evolution of Database Technology

9.1 WHAT IS DATA MINING?

There are many definitions for Data mining. Few important definitions are given below.

Data mining refers to extracting or mining knowledge from large amounts of data.

Data mining is the process of exploration and analysis, by automatic or semiautomatic means, oflarge quantities of data in order to discover meaningful patterns and rules.

Data Mining or Knowledge Discovery in Databases (KDD) as it is also known is the nontrivialextraction of implicit, previously unknown, and potentially useful information from data. This encompassesa number of different technical approaches, such as clustering, data summarization, learning classificationrules, finding dependency net works, analyzing changes, and detecting anomalies.

Data mining is the search for relationships and global patterns that exist in large databases but are‘hidden’ among the vast amount of data, such as a relationship between patient data and their medicaldiagnosis. These relationships represent valuable knowledge about the database and the objects in thedatabase and if the database is a faithful mirror, of the real world registered by the database.

Data mining refers to “using a variety of techniques to identify nuggets of information or decision-

Evolutionary Step Business Question Enabling Technologies Characteristics

Data Collection

(1960s)

"What was my total revenue in the last five years?"

Computers, tapes, disks

Retrospective, static data delivery

Data Access

(1980s)

"What were unit sales in New Delhi last March?"

Relational databases (RDBMS), Structured Query Language (SQL), ODBC

Retrospective, dynamic data delivery at record level

Data Warehousing &

Decision Support

(1990s)

"What were unit sales in New Delhi last March? Drill down to Mumbai."

On-line analytic processing (OLAP), multidimensional databases, data warehouses

Retrospective, dynamic data delivery at multiple levels

Data Mining

(Emerging Today)

"What’s likely to happen to Mumbai unit sales next month? Why?"

Advanced algorithms, multiprocessor computers, massive databases

Prospective, proactive information delivery

95BSIT 53 Data Warehousing and Data Mining

Page 106: BSIT 53 New

96 Chapter 9 - Introduction to Data Mining

making knowledge in bodies of data, and extracting these in such a way that they can be put to use in theareas such as decision support, prediction, forecasting and estimation. The data is often voluminous, butas it stands of low value as no direct use can be made of it, it is the hidden information in the data that isuseful”.

Some people relate the data mining to Knowledge Discovery in Databases or KDD . Alternatively,others view data mining as simple an essential step in the process of knowledge discovery in databases.Knowledge discovery is a process as shown in the figure 9.1. It consists of following iterative sequenceof the steps.

Figure 9.1 Data Mining: A KDD Process

1. Data cleaning: also known as data cleansing, it is a phase in which noise data and irrelevantdata are removed from the collection.

2. Data integration: at this stage, multiple data sources, often heterogeneous, may be combinedin a common source.

3. Data selection: at this step, the data relevant to the analysis is decided on and retrieved fromthe data collection.

4. Data transformation: also known as data consolidation, it is a phase in which the selected datais transformed into forms appropriate for the mining procedure.

Page 107: BSIT 53 New

97BSIT 53 Data Warehousing and Data Mining

5. Data mining: it is the crucial step in which clever techniques are applied to extract patternspotentially useful.

6. Pattern evaluation: in this step, strictly interesting patterns representing knowledge are identifiedbased on given measures.

7. Knowledge representation: is the final phase in which the discovered knowledge is visuallyrepresented to the user. This essential step uses visualization techniques to help users understandand interpret the data mining results.

The data mining step may interact with the user or a knowledge base. The interesting patterns arepresented to the user and may be stored as new knowledge in the knowledge base. So we conclude datamining is a step in the knowledge discovery process. Data mining is the process of discovering interestingknowledge from large amounts of data stored either in databases, data warehouses or other informationrepositories.

The architecture of a typical data mining system may have the following major components as shownin figure 9.2.

Fig. 9.2 Architecture of a typical Data Mining System

Graphical user interface

Pattern evaluation

Data mining engine

Database or data warehouse server

Knowledge base

Database Data warehouse

Data cleaning Data integration

Filtering

Page 108: BSIT 53 New

98 Chapter 9 - Introduction to Data Mining

1. Database, data warehouse or other information repository: this is one or a set of databases,data warehouses, spreadsheets or other kinds on information repositories. Data cleaning anddata integration techniques may be performed on the data.

2. Database or data warehouse server: The database or data warehouse server is responsiblefor fetching the relevant data, based on the user’s data mining request.

3. Knowledge base: This is the domain that is used to guide the search or evaluate theinterestingness of resulting patterns. Such knowledge can include concept hierarchies, used toorganize attributes or attribute values into different levels of abstraction.

4. Data mining engine: This is essential to the data mining system and ideally consists of a set offunctional modules for tasks such as characterization, association, classification, cluster analysisand evolution and derivation analysis.

5. Pattern evaluation module: This component typically employs interestingness measures andinteracts with the data mining modules so as to focus to search towards interesting patterns. Itmay use interestingness thresholds to filter out discovered patterns.

6. Graphical user interface: This module communicates between users and the data miningsystem, allowing the user to interact with the system by specifying a data mining query or task,providing information to help focus the search and performing exploratory data mining based onthe intermediate data mining results.

9.2 WHAT KIND OF DATA CAN BE MINED?

In principle, data mining is not specific to one type of media or data. Data mining should be applicableto any kind of information repository. However, algorithms and approaches may differ when applied todifferent types of data. Indeed, the challenges presented by different types of data vary significantly.Data mining is being put into use and studied for databases, including relational databases, object-relationaldatabases and object oriented databases, data warehouses, transactional databases, unstructured andsemi structured repositories such as the World Wide Web, advanced databases such as spatial databases,multimedia databases, time-series databases and textual databases, and even flat files. Here are someexamples in more detail

Flat Files: Flat files are actually the most common data source for data mining algorithms,especially at the research level. Flat files are simple data files in text or binary format with astructure known by the data mining algorithm to be applied. The data in these files can betransactions, time-series data, scientific measurements.

Relational Databases: A relational database consists of a set of tables containing either valuesof entity attributes, or values of attributes from entity relationships. Tables have columns and

Page 109: BSIT 53 New

99BSIT 53 Data Warehousing and Data Mining

rows, where columns represent attributes and rows represent tuples. A tuple in a relational tablecorresponds to either an object or a relationship between objects and is identified by a set ofattribute values representing a unique key. The most commonly used query language for relationaldatabase is SQL, which allows retrieval and manipulation of the data stored in the tables, as wellas the calculation of aggregate functions such as average, sum, min, max and count. Datamining algorithms using relational databases can be more versatile than data mining algorithmsspecifically written for flat files, since they can take advantage of the structure inherent torelational databases. While data mining can benefit from SQL for data selection, transformationand consolidation, it goes beyond what SQL could provide, such as predicting, comparing, detectingdeviations.

Data Warehouses: A data warehouse as a storehouse, is a repository of data collected frommultiple data sources (often heterogeneous) and is intended to be used as a whole under thesame unified schema. A data warehouse gives the option to analyze data from different sourcesunder the same roof. Data from the different stores would be loaded, cleaned, transformed andintegrated together. To facilitate decision making and multi-dimensional views, data warehousesare usually modeled by a multi-dimensional data structure.

Multimedia Databases: Multimedia databases include video, images, audio and text media.They can be stored on extended object-relational or object-oriented databases, or simply on afile system. Multimedia is characterized by its high dimensionality, which makes data miningeven more challenging. Data mining from multimedia repositories may require computer vision,computer graphics, image interpretation, and natural language processing methodologies.

Spatial Databases: Spatial databases are databases that in addition to usual data, storegeographical information like maps, and global or regional positioning. Such spatial databasespresent new challenges to data mining algorithms.

Time-Series Databases: Time-series databases contain time related data such stock marketdata or logged activities. These databases usually have a continuous flow of new data comingin, which sometimes causes the need for a challenging real time analysis. Data mining in suchdatabases commonly includes the study of trends and correlations between evolutions of differentvariables, as well as the prediction of trends and movements of the variables in time.

World Wide Web: The World Wide Web is the most heterogeneous and dynamic repositoryavailable. A very large number of authors and publishers are continuously contributing to itsgrowth and metamorphosis and a massive number of users are accessing its resources daily.Data in the World Wide Web is organized in inter-connected documents. These documents canbe text, audio, video, raw data and even applications. Conceptually, the World Wide Web iscomprised of three major components: The content of the Web, which encompasses documentsavailable, the structure of the Web, which covers the hyperlinks and the relationships betweendocuments; and the usage of the web, describing how and when the resources are accessed. A

Page 110: BSIT 53 New

100 Chapter 9 - Introduction to Data Mining

fourth dimension can be added relating the dynamic nature or evolution of the documents. Datamining in the World Wide Web or Web Mining, tries to address all these issues and is oftendivided into web content mining, web structure mining and web usage mining.

9.3 WHAT CAN DATA MINING DO?

The kinds of patterns that can be discovered depend upon the data mining tasks employed. By andlarge, there are two types of data mining tasks: descriptive data mining tasks that describe the generalproperties of the existing data, and predictive data mining tasks that attempt to do predictions based oninference on available data. The data mining functionalities and the variety of knowledge they discoverare briefly presented in the following list

a) Characterization: Data characterization is a summarization of general features of objects in atarget class, and produces what is called characteristic rules. The data relevant to a user-specified class are normally retrieved by a database query and run through a summarizationmodule to extract the essence of the data at different levels of abstractions. For example, onemay want to characterize the Video store customers who regularly rent more than 30 movies ayear. Note that with a data cube containing summarization of data, simple OLAP operations fitthe purpose of data characterization.

b) Discrimination: Data discrimination produces what are called discriminant rules and is basicallythe comparison of the general features of objects between two classes referred to as the targetclass and the contrasting class. For example, one may want to compare the generalcharacteristics of the customers who rented more than 30 movies in the last year with thosewhose rental account is lower than 5. The techniques used for data discrimination are verysimilar to the techniques used for data characterization with the exception that data discriminationresults include comparative measures.

c) Association Analysis: Association analysis is the discovery of what are commonly calledassociation rules. It studies the frequency of items occurring together in transactional databases,and based on a threshold called support, identifies the frequent item sets. Another threshold,confidence, which is the conditional probability than an item appears in a transaction whenanother item appears, is used to pinpoint association rules. Association analysis is commonlyused for market basket analysis. For example, it could be useful for the Video store manager toknow what movies are often rented together or if there is a relationship between renting acertain type of movies and buying popcorn or pop. The discovered association rules are of theform: P®Q [s,c], where P and Q are conjunctions of attribute value-pairs, and s (for support) isthe probability that P and Q appear together in a transaction and c (for confidence) is theconditional probability that Q appears in a transaction when P is present.

For example, the hypothetic association rule:

Page 111: BSIT 53 New

101BSIT 53 Data Warehousing and Data Mining

RentType(X, “game”) Ù Age(X, “13-19”) ® Buys(X, “pop”) [s=2% ,c=55%]

would indicate that 2% of the transactions considered are of customers aged between 13 and 19who are renting a game and buying a pop, and that there is a certainty of 55% that teenagecustomers who rent a game also buy pop.

d) Classification: Classification analysis is the organization of data in given classes. Also knownas supervised classification, the classification uses given class labels to order the objects inthe data collection. Classification approaches normally use a training set where all objects arealready associated with known class labels.

The classification algorithm learns from the training set and builds a model. The model is used toclassify new objects. For example, after starting a credit policy, the Video store managers couldanalyze the customers behaviours and label accordingly the customers who received creditswith three possible labels “safe”, “risky” and “very risky”. The classification analysis wouldgenerate a model that could be used to either accept or reject credit requests in the future.

e) Prediction: Prediction has attracted considerable attention given the potential implications ofsuccessful forecasting in a business context. There are two major types of predictions: one caneither try to predict some unavailable data values or pending trends or predict a class label forsome data. The latter is tied to classification. Once a classification model is built based on atraining set, the class label of an object can be foreseen based on the attribute values of theobject and the attribute values of the classes. Prediction is however more often referred to theforecast of missing numerical values, or increase/ decrease trends in time related data. Themajor idea is to use a large number of past values to consider probable future values.

f) Clustering: Similar to classification, clustering is the organization of data in classes. However,unlike classification, in clustering, class labels are unknown and it is up to the clustering algorithmto discover acceptable classes. Clustering is also called unsupervised classification, becausethe classification is not dictated by given class labels. There are many clustering approaches allbased on the principle of maximizing the similarity between objects in a same class (intra-classsimilarity) and minimizing the similarity between objects of different classes (inter-classsimilarity).

g) Outlier Analysis: Outliers are data elements that cannot be grouped in a given class or cluster,also known as exceptions or surprises, they are often very important to identify. While outlierscan be considered noise and discarded in some applications, they can reveal important knowledgein other domains, and thus can be very significant and their analysis valuable.

h) Evolution and Deviation Analysis: Evolution and deviation analysis pertain to the study oftime related data that changes in time. Evolution analysis models evolutionary trends in data,which consent to characterizing, comparing, classifying or clustering of time related data. Deviation

Page 112: BSIT 53 New

102 Chapter 9 - Introduction to Data Mining

analysis, on the other hand, considers differences between measured values and expected values,and attempts to find the cause of the deviations from the anticipated values.

9.4 HOW DO WE CATEGORIZE DATA MINING SYSTEMS?

There are many data mining systems available or being developed. Some are specialized systemsdedicated to a given data source or are confined to limited data mining functionalities, other are moreversatile and comprehensive. Data mining systems can be categorized according to various criteria amongother classification are the following

a) Classification According to the Type of Data Source Mined: this classification categorizesdata mining systems according to the type of data handled such as spatial data, multimedia data,time-series data, text data, World Wide Web, etc.

b) Classification According to the Data Model Drawn on: this classification categorizes datamining systems based on the data model involved such as relational database, object-orienteddatabase, data warehouse, transactional, etc.

c) Classification According to the Kind of Knowledge Discovered: this classificationcategorizes data mining systems based on the kind of knowledge discovered or data miningfunctionalities, such as characterization, discrimination, association, classification, clustering etc.Some systems tend to be comprehensive systems offering several data mining functionalitiestogether.

d) Classification According to Mining Techniques Used: Data mining systems employ andprovide different techniques. This classification categorizes data mining systems according tothe data analysis approach used such as machine learning, neural networks, genetic algorithms,statistics, visualization, database oriented or data warehouse-oriented, etc. The classificationcan also take into account the degree of user interaction involved in the data mining processsuch as query-driven systems, interactive exploratory systems, or autonomous systems. Acomprehensive system would provide a wide variety of data mining techniques to fit differentsituations and options, and offer different degrees of user interaction.

9.5 WHAT ARE THE ISSUES IN DATA MINING?

Data mining algorithms embody techniques that have sometimes existed for many years, but have onlylately been applied as reliable and scalable tools that time and again outperform older classical statisticalmethods. While data mining is still in its infancy, it is becoming a trend and ubiquitous. Before data miningdevelops into a conventional, mature and trusted discipline, many still pending issues have to be addressed.Some of these issues are addressed below.

Page 113: BSIT 53 New

103BSIT 53 Data Warehousing and Data Mining

a) Security and Social Issues: Security is an important issue with any data collection that isshared and/or is intended to be used for strategic decision-making. In addition, when data iscollected for customer profiling, user behaviour understanding, correlating personal data withother information, etc., large amounts of sensitive and private information about individuals orcompanies is gathered and stored. This becomes controversial given the confidential nature ofsome of this data and the potential illegal access to the information. Moreover, data mining coulddisclose new implicit knowledge about individuals or groups that could be against privacy policies,especially if there is potential dissemination of discovered information. Another issue that arisesfrom this concern is the appropriate use of data mining. Due to the value of data, databases ofall sorts of content are regularly sold, and because of the competitive advantage that can beattained from implicit knowledge discovered, some important information could be withheld,while other information could be widely distributed and used without control.

b) User Interface Issues: The knowledge discovered by data mining tools is useful as long as itis interesting, and above all understandable by the user. Good data visualization eases theinterpretation of data mining results, as well as helps users better understand their needs. Manydata exploratory analysis tasks are significantly facilitated by the ability to see data in an appropriatevisual presentation. There are many visualization ideas and proposals for effective data graphicalpresentation. However, there is still much research to accomplish in order to obtain goodvisualization tools for large datasets that could be used to display and manipulate mined knowledge.The major issues related to user interfaces and visualization are “screen real-estate”, informationrendering and interaction. Interactivity with the data and data mining results is crucial since itprovides means for the user to focus and refine the mining tasks, as well as to picture thediscovered knowledge from different angles and at different conceptual levels.

c) Mining Methodology Issues: These issues pertain to the data mining approaches applied andtheir limitations. Topics such as versatility of the mining approaches, the diversity of data available,the dimensionality of the domain, the broad analysis needs (when known), the assessment of theknowledge discovered, the exploitation of background knowledge and metadata, the control andhandling of noise in data, etc. are all examples that can dictate mining methodology choices. Forinstance, it is often desirable to have different data mining methods available since differentapproaches may perform differently depending upon the data at hand. Moreover, differentapproaches may suit and solve user’s needs differently.

Most algorithms assume the data to be noise-free. This is of course a strong assumption. Mostdatasets contain exceptions, invalid or incomplete information, etc., which may complicate, ifnot obscure, the analysis process and in many cases compromise the accuracy of the results. Asa consequence, data preprocessing (data cleaning and transformation) becomes vital. It is oftenseen as lost time, but data cleaning, as time consuming and frustrating as it may be, is one of themost important phases in the knowledge discovery process. Data mining techniques should beable to handle noise in data or incomplete information.

Page 114: BSIT 53 New

104 Chapter 9 - Introduction to Data Mining

More than the size of data, the size of the search space is even more decisive for data miningtechniques. The size of the search space is often depending upon the number of dimensions inthe domain space. The search space usually grows exponentially when the number of dimensionsincreases. This is known as the curse of dimensionality. This “curse” affects so badly theperformance of some data mining approaches that it is becoming one of the most urgent issuesto solve.

d) Performance Issues: Many artificial intelligence and statistical methods exist for data analysisand interpretation. However, these methods were often not designed for the very large data setsdata mining is dealing with today. Terabyte sizes are common. This raises the issues of scalabilityand efficiency of the data mining methods when processing considerably large data. Algorithmswith exponential and even medium-order polynomial complexity cannot be of practical use fordata mining. Linear algorithms are usually the norm. In same theme, sampling can be used formining instead of the whole dataset. However, concerns such as completeness and choice ofsamples may arise. Other topics in the issue of performance are incremental updating, andparallel programming. There is no doubt that parallelism can help solve the size problem if thedataset can be subdivided and the results can be merged later. Incremental updating is importantfor merging results from parallel mining, or updating data mining results when new data becomesavailable without having to re-analyze the complete dataset.

e) Data Source Issues: There are many issues related to the data sources, some are practicalsuch as the diversity of data types, while others are philosophical like the data glut problem. Wecertainly have an excess of data since we already have more data than we can handle and weare still collecting data at an even higher rate. If the spread of database management systemshas helped increase the gathering of information, the advent of data mining is certainly encouragingmore data harvesting. The current practice is to collect as much data as possible now andprocess it, or try to process it, later. The concern is whether we are collecting the right data atthe appropriate amount, whether we know what we want to do with it, and whether we distinguishbetween what data is important and what data is insignificant. Regarding the practical issuesrelated to data sources, there is the subject of heterogeneous databases and the focus on diversecomplex data types. We are storing different types of data in a variety of repositories. It isdifficult to expect a data mining system to effectively and efficiently achieve good mining resultson all kinds of data and sources. Different kinds of data and sources may require distinct algorithmsand methodologies. Currently, there is a focus on relational databases and data warehouses, butother approaches need to be pioneered for other specific complex data types. A versatile datamining tool, for all sorts of data, may not be realistic. Moreover, the proliferation of heterogeneousdata sources, at structural and semantic levels, poses important challenges not only to the databasecommunity but also to the data mining community.

Page 115: BSIT 53 New

105BSIT 53 Data Warehousing and Data Mining

9.6 REASONS FOR THE GROWING POPULARITY OF DATAMINING

a) Growing Data Volume

The main reason for necessity of automated computer systems for intelligent data analysis is theenormous volume of existing and newly appearing data that require processing. The amount ofdata accumulated each day by various business, scientific, and governmental organizations aroundthe world is daunting. It becomes impossible for human analysts to cope with such overwhelmingamounts of data.

b) Limitations of Human Analysis

Two other problems that surface when human analysts process data are the inadequacy of thehuman brain when searching for complex multifactor dependencies in data, and the lack ofobjectiveness in such an analysis. A human expert is always a hostage of the previous experienceof investigating other systems. Sometimes this helps, sometimes this hurts, but it is almostimpossible to get rid of this fact.

c) Low Cost of Machine Learning

One additional benefit of using automated data mining systems is that this process has a muchlower cost than hiring an many highly trained professional statisticians. While data mining doesnot eliminate human participation in solving the task completely, it significantly simplifies the joband allows an analyst who is not a professional in statistics and programming to manage theprocess of extracting knowledge from data.

9.7 APPLICATIONS

Data mining has many and varied fields of application some of which are listed below.

Retail/Marketing

Identify buying patterns from customers

Find associations among customer demographic characteristics

Predict response to mailing campaigns

Market basket analysis

Banking

Detect patterns of fraudulent credit card use

Page 116: BSIT 53 New

106 Chapter 9 - Introduction to Data Mining

Identify ‘loyal’ customers

Predict customers likely to change their credit card affiliation

Determine credit card spending by customer groups

Find hidden correlations between different financial indicators

Identify stock trading rules from historical market data

Insurance and Health Care

Claims analysis - which medical procedures are claimed together

Predict which customers will buy new policies

Identify behaviour patterns of risky customers

Identify fraudulent behaviour

Transportation

Determine the distribution schedules among outlets

Analyse loading patterns

Medicine

Characterise patient behaviour to predict office visits

Identify successful medical therapies for different illnesses

9.8 EXERCISE

I. FILL UP THE BLANKS

1. Data mining refers to __________________ knowledge from large amount of data.

2. Data cleaning step removes _____________ and ____________ data.

3. GUI module communicates between _______________ and ___________ system

4. Flat files are simple data file in _______________ format.

5. Multimedia databases include ____________, ____________, ___________ and _____________.

6. Descriptive data mining task is to describe _______________

7. Association is characterized by ____________and ____________.

Page 117: BSIT 53 New

107BSIT 53 Data Warehousing and Data Mining

8. Clustering is called ___________________ .

9. Outliers are data elements that cannot be grouped in a given ___________

10. ______________ cost of machine learning makes data mining popular.

ANSWERS FOR FILL UP THE BLANKS.

1. Extracting

2. Noise and irrelevant

3. User and data mining

4. Text / binary

5. Video, images, audio and Text

6. General properties of the existing data.

7. Support and confidence.

8. Unsupervised classification.

9. Class / Clusters.

10. Low.

II. ANSWER THE FOLLOWING QUESTIONS

1. Explain the evolution of database technology.

2. Explain the KDD processes in detail.

3. Explain the architecture of a data mining system.

4. Explain the functions of data mining.

5. Explain the categories of data mining system.

6. Give the reasons for the growing popularity of data mining.

Page 118: BSIT 53 New

Chapter 10 - Data Preprocessing and Data Mining Primitives

Chapter 10

Dat a Preprocessing and Dat a Min ingPr im i t ives

10.0 INTRODUCTION

For a successful data mining operation the data must be consistent, reliable and appropriate. Theappropriate data must be selected, missing fields and incorrect values rectified, unnecessaryinformation removed and where data comes from different sources, the format of field values may

need to be altered to ensure they are interpreted correctly.

It is rather straightforward to apply DM modelling tools to data and judge the value of resulting modelsbased on their predictive or descriptive value. This does not diminish the role of careful attention to datapreparation efforts.

10.1 DATA PREPARATION

Data preparation process is roughly divided into data selection, data cleaning, formation of new dataand data formatting.

10.1.1 Select Data

A subset of data acquired in previous stages is selected based on criteria stressed in previous stages:

Data quality properties: completeness and correctness

Technical constraints such as limits on data volume or data type: this is basically related to datamining tools which are planned earlier to be used for modeling

108

Page 119: BSIT 53 New

10.1.2 Data Cleaning

This step complements the previous one. It is also the most time consuming due to a lot of possibletechniques that can be implemented so as to optimize data quality for future modeling stage. Possibletechniques for data cleaning include:

Data Normalization. For example decimal scaling into the range (0,1), or standard deviationnormalization.

Data Smoothing. Discretization of numeric attributes is one example, this is helpful or evennecessary for logic based methods.

Treatment of Missing Values. There is not simple and safe solution for the cases wheresome of the attributes have significant number of missing values. Generally, it is good toexperiment with and without these attributes in the modelling phase, in order to find out theimportance of the missing values. Simple solutions are

a) Replacing all missing values with a single global constant

b) Replace a missing value with its feature mean

c) Replace a missing value with its feature and class mean.

The main flaw of simple solutions is that substituted value is not the correct value. This means that thedata will be biased. If the missing values can be isolated to only a few features, then we can try a solutionby deleting examples containing missing values, or delete attributes containing most of the missing values.Another solution, more sophisticated one is to try to predict missing values with a data mining tool. In thiscase predicting missing values is a special data mining prediction problem.

Data Reduction. Reasons for data reduction are in most cases twofold: either the data may betoo big for the program, or expected time for obtaining the solution might be too long. Thetechniques for data reduction are usually effective but imperfect. The most usual step for datadimension reduction is to examine the attributes and consider their predictive potential. Some ofthe attributes can usually be discarded, either because they are poor predictors or are redundantrelative to some other good attribute. Some of the methods for data reduction through attributeremoval are

a) Attribute selection from means and variances

b) Using principal component analysis

c) Merging features using linear transform.

109BSIT 53 Data Warehousing and Data Mining

Page 120: BSIT 53 New

110 Chapter 10 - Data Preprocessing and Data Mining Primitives

10.1.3 New Data Construction

This step represents constructive operations on selected data which includes:

Derivation of new attributes from two or more existing attributes

Generation of new records (samples)

Data transformation: Here data are transformed or consolidated into forms appropriate formining. Data transformation can involve the following

i) Smoothing – which works to remove the noise from data. Such techniques includebinning, clustering and regression

ii) Aggregation – where summary or aggregation operations are applied to the data. Forexample, the daily sales data may be aggregated so as to compute monthly and annualtotal amounts. This step is typically used in constructing a data cube for analysis of thedata at multiple granularities.

iii) Generalization of the Data – where low-level or primitive data are replaced by higher-level concepts through the use of concept hierarchies. For example, numeric attributeslike age may be mapped to higher-level concepts like young, middle-aged and senior.

iv) Normalization – where the attribute data are scaled so as to fall within a small specifiedrange , such as -1.0 to 1.0

v) Attribute Construction – where new attributes are constructed and added from thegiven set of attributes to help the mining process.

Merging Tables: joining together two or more tables having different attributes for same objects

Aggregations: operations in which new attributes are produced by summarizing informationfrom multiple records and/or tables into new tables with “summary” attributes

10.1.4 Data Formatting

Final data preparation step which represents syntactic modifications to the data that do not change itsmeaning, but are required by the particular modelling tool chosen for the DM task. These include:

Reordering of the attributes or records: some modelling tools require reordering of the attributes(or records) in the dataset: putting target attribute at the beginning or at the end, randomizingorder of records (required by neural networks for example)

Changes related to the constraints of modelling tools: removing commas or tabs, special characters,

Page 121: BSIT 53 New

111BSIT 53 Data Warehousing and Data Mining

trimming strings to maximum allowed number of characters, replacing special characters withallowed set of special characters.

There is also what is by DM practitioners called standard form of data (although there is not astandard format of data that can be readily read by all modelling tools). Standard form refers primarily toreadable data types:

Binary variables (1-for true; 0-for false)

Ordered variables (numeric)

Categorical variables are in standard form of data transformed into -m- binary variables where m isthe number of possible values for the particular variable. Since distinct DM modeling tools usually prefereither categorical or ordered attributes, the standard form is a data presentation that is uniform andeffective across a wide spectrum of DM modeling tools and other exploratory tools.

10.2 DATA MINING PRIMITIVES

Users communicates with the data mining system using a set of data mining primitives designed inorder to facilitate efficient and fruitful knowledge discovery. The primitives include the specification ofthe portion of the database or the set of data in which the user is interested, the kinds of knowledge to bemined, background knowledge useful in guiding the discovery process, interestingness measures for patternevaluation and how the discovered knowledge should be visualized. These primitives allow the user tointeractively communicate with the data mining system during discovery in order to examine the findingsfrom different angles or depths and direct the mining process.

10.2.1 Defining Data Mining Primitives

Each user will have a data mining task in mind. A data mining task can be specified in the form of adata mining query, which is input to the data mining system. A data mining query is defined in terms of thefollowing primitives

a) Task Relevant Data – This is the database portion to be investigated. For example , a companyXYZ is doing business in two states Karnataka and Tamilnadu. Manager of Karnataka wants toknow the total number of sales only in Karnataka, then only data related to Karnataka should beaccessed. These are referred to as relevant attributes.

b) The kinds of Knowledge to be Mined – This specifies the data mining functions to be performedsuch as characterization, discrimination, association, classification, clustering and evolution

Page 122: BSIT 53 New

112 Chapter 10 - Data Preprocessing and Data Mining Primitives

analysis. For example, if studying the buying habits of customers in Karnataka , we may chooseto mine associations between customer profiles and the items that these customers like to buy.

c) Background Knowledge – User can specify background knowledge or knowledge about thedomain to be mined. This knowledge is useful for guiding the knowledge discovery process andfor evaluating the patterns found. There are several kinds of background knowledge. There isone popular background knowledge known as concept hierarchies. Concept hierarchies areuseful in that they allow data to be mined at multiple levels of abstraction. These can be used toevaluate the discovered patterns according to their degree of unexpectedness or expectdness.

d) Interestingness Measures – These functions are used to separate uninteresting patterns fromknowledge. They may be used to guide the mining process or after discovery to evaluate thediscovered patterns. Different kinds of knowledge may have different interestingness measures.For example, interestingness measures for association rule include support ( the percentage oftask-relevant data tuple for which the rule pattern appears) and confidence ( an estimate of thestrength of the implication of the rule) Rules whose support and confidence values are belowuser-specified thresholds are considered uninteresting.

e) Presentation and Visualization of Discovered Patterns – This refers to the form in whichdiscovered patterns are to be displayed. Users can choose from different forms for knowledgepresentation such as rules, tables, charts, graphs, decision trees and cubes.

Task-relevant data

Database or data warehouse nameDatabase tables or data warehouse cubesConditions for data selectionRelevant attributes or dimensionsData Grouping criteria

Knowledge to be mined

CharacterizationDiscriminationAssociationClassification/PredictionClustering

Page 123: BSIT 53 New

113BSIT 53 Data Warehousing and Data Mining

Figure 10.1 Primitives for specifying a data mining task

10.3 A DATA MINING QUERYING LANGUAGE

A data mining language helps in effective knowledge discovery from the data mining systems. Designinga comprehensive data mining language is challenging because data mining covers a wide spectrum oftasks from data characterization to mining association rules, data classification and evolution analysis.Each task has different requirements. The design of an effective data mining query language requires adeep understanding of the power, limitation and underlying mechanism of the various kinds of data miningtasks.

Background knowledge

Concept hierarchiesUser beliefs about relationships in the data

Pattern interestingness measuresSimplicityCertainty (Eg. Confidence)Utility (eg.support)Novelty

Visualization of discovered patterns

Rules, tables, reports, charts, graphs, decision trees and cubesDrill-down and roll-up

Page 124: BSIT 53 New

114 Chapter 10 - Data Preprocessing and Data Mining Primitives

10.3.1 Syntax for Task-Relevant Data Specification

The first step in defining a data mining task is the specification of the task-relevant data,that is , thedata on which is to be performed. This involves specifying the database and tables or data warehousecontaining the relevant data, conditions for selecting the relevant data, the relevant attributes or dimensionsfor exploration and instructions regarding the ordering or grouping of the data retrieved. Data MiningQuery Language (DMQL) provides clauses for the specification of such information as follows.

Syntax for Task-Relevant Data Specification

The first step in defining a data mining task is the specification of the task-relevant data, that is, thedata on which mining is to be performed. This involves specifying the database and tables or datawarehouse containing the relevant data, conditions for selecting the relevant data, the relevant attributesor dimensions for exploration, and instructions regarding the ordering or grouping of the data retrieved.DMQL provides clauses for the clauses for the specification of such information, as follows:

use database (database_name) or use data warehouse (data_warehouse_name): The useclause directs the mining task to the database or data warehouse specified.

from (relation(s)/cube(s)) [where(condition)]: The from and where clauses respectively specifythe database tables or data cubes involved, and the conditions defining the data to be retrieved.

in relevance to (attribute_or_dimension_list): This clause lists the attributes or dimensions forexploration.

order by (order_list): The order by clause specifies the sorting order of the task relevant data.

group by (grouping_list): the group by clause specifies criteria for grouping the data.

having (conditions): The having cluase specifies the condition by which groups of data areconsidered relevant

Top Level Sytax of the Data Mining Query Language DMQL

(DMQL) ::= (DMQL_statement); { (DMQL_Statement)}

(DMQL_Statement) ::= (Data_Mining_Statement)

| (Concept_Hierarchy_Definition_Statement)

| (Visualization_and_Presentation)

(Data_Mining_Statement) :: =

use database (Database_name) | use data warehouse (Data_warehouse_name)

{use hierarchy (hierarchy_name) for (attribute_or_dimension)|

Page 125: BSIT 53 New

115BSIT 53 Data Warehousing and Data Mining

(Mine_Knowledge_Specification)

in relevance to (attribute_or_dimension_list)

from (relation(s)/cube(s))

[where (condition)]

[order by (order_list)]

[group by (grouping_list)]

[having (condition)]

[with [(interest_measure_name)] threshold = (threshold_value)

[for(attribute(s)))]}

(Mine_Knowledge_Specification)::= (Mine_Discr) | (Mine_Assoc) | (Mine_class)

(Mine_Char) ::= mine characteristics [as {pattern_name)]

analyze(measure(s))

(Mine_Discr) ::= mine comparison [as (pattern_name) ]

for (target_class) where (target_condition)

{versus (contrast_class_i) where (contrast_condition_i)}

analyze(measure(s))

(Mine_Assoc) ::= mine association [as (pattern_name)]

[matching (metapattern) ]

(Mine_Class) ::= mine classification [as (pattern_name)]

analyze (classifying_attribute_or_dimension)

(Voncept_Hierarchy_Definition_statement) ::=

define hierarchy (hierarchy_name)

[for (attribute_or_dimension)]

on (relation_or_cube_or_hierarchy)

as (hierarchy_description)

[where (condition)]

Page 126: BSIT 53 New

116 Chapter 10 - Data Preprocessing and Data Mining Primitives

(Visualization_and_presentation) :: =

display as (result_form | { (Multilevel_Manipulation)}

(Multilevel_Manipulation)::= roll up on (attribute_or_dimension)

| drill down on (attribute_or_dimension)

| add (attribute_or_dimension)

| drop (attribute_or_dimension)

Syntax for Specifying the Kind of Knowledge to be Mined

The (Mine_Knowledge_Specification) statement is used to specify the kind of knowledge to be mined.In other words , it indicates the data mining functionality to be performed. Its syntax is defined below forcharacterization, discrimination, association, and classification.

Characterization:

(Mine_Knowledge_specification) :: =

mine characteristics [as (pattern name)]

analyze (measure(s))

This specifies that characteristic descriptions are to be mined. The analyze clause, when used forcharacterization, specifies aggregate measures, such as count, sum, or count % (percentage count, i.e.,the percentage of tuples in the relevant data set with the specified characteristics). These measures areto be computed for each data characteristic found.

Syntax for Concept Hierarchy Specification

Concept hierarchies allow the mining of knowledge at multiple levels of abstraction. In order toaccommodate the different viewpoints of users with regard to the data, there may be more than oneconcept hierarchy per attribute or dimension. For instance, some users may prefer to organize branchlocations by provinces and states, while others may prefer to organize them according to languages used.In such cases, a user can indicate which concept hierarchy is to be used with the statement

use hierarchy (hierarchy_name) for (attribute_or_dimension)

Otherwise, a default hierarchy per attribute or dimension is used.

Syntax for Interestingness Measure Specification

The user can help control the number of uninteresting patterns returned by the data mining system byspecifying measures of pattern interestingness and their corresponding thresholds. Interestingness measures

Page 127: BSIT 53 New

117BSIT 53 Data Warehousing and Data Mining

include the confidence, support, noice, and novelty measures. Interestingness measures and thresholdscan be specified by the user with the statement

with [(interest_measure_name)] threshold = (threshold_value)

Syntax for Pattern Presentation and Visualization Specification.

How can users specify the forms of presentation and visualization to be used in displaying the discoveredpatterns? Our data mining query language needs syntax that allows users to specify the display ofdiscovered patterns in one or more forms, including rules, tables, crosstabs, pie or bar charts, decissiontrees, cubes, curves, or surfaces. We define the DMQL display statement for this purpose:

display as (result_form)

Where the (result_form) could be any of the knowledge presentation or visualization forms listedabove.

Interactive mining should allow the discovered patterns to be viewed at different concept levels orfrom different angles. This can be accomplished with roll-up and drill-down operations, as describedearlier. Patterns can be rolled up, or viewed at a more general level, by climbing up the concept hierarchyof an attribute or dimension (replacing lower-level concept values by higher-level values). Generalizationcan also be performed by dropping attributes or dimesions. For example, suppose that a pattern containsthe attribute city. Given the location hierarchy city <province_or_state < country < continent, then droppingthe attribute city from the patterns will generalize the data to the next highest level attribute,province_or_state. Patterns can be drilled down on, or viewed at a less general level, by stepping downthe concept hierarchy of an attribute or dimension. Patterns can also be made less general by addingattributes or dimensions to their description. The attribute added must be one of the attributes listed in thein relevance to clause for task-relevant specification. The user can alternately view the patterns atdifferent levels of abstractions with the use of the following DMQL syntax.

(Multilevel_Manipulation) ::= roll up on (attribute_or_dimension)

| drill down on (attribute_or_dimension)

| add (attribute_or_dimension)

| drop (attribute_or_dimension)

10.4 DESIGNING GRAPHICAL USER INTERFACES BASED ONA DATA MINING QUERY LANGUAGE

A data mining query language provides necessary primitives that allow users to communicate withdata mining systems. But novice users may find data mining query language difficult to use and the syntax

Page 128: BSIT 53 New

118 Chapter 10 - Data Preprocessing and Data Mining Primitives

difficult to remember. Instead , user may prefer to communicate with data mining systems through agraphical user interface (GUI). In relational database technology , SQL serves as a standard core languagefor relational systems , on top of which GUIs can easily be designed. Similarly, a data mining querylanguage may serve as a core language for data mining system implementations, providing a basis for thedevelopment of GUI for effective data mining. A data mining GUI may consist of the following functionalcomponents

a) Data collection and data mining query composition - This component allows the user to specifytask-relevant data sets and to compose data mining queries. It is similar to GUIs used for thespecification of relational queries.

b) Presentation of discovered patterns – This component allows the display of the discoveredpatterns in various forms, including tables, graphs, charts, curves and other visualization techniques.

c) Hierarchy specification and manipulation - This component allows for concept hierarchyspecification , either manually by the user or automatically. In addition , this component shouldallow concept hierarchies to be modified by the user or adjusted automatically based on a givendata set distribution.

d) Manipulation of data mining primitives – This component may allow the dynamic adjustment ofdata mining thresholds, as well as the selection, display and modification of concept hierarchies.It may also allow the modification of previous data mining queries or conditions.

e) Interactive multilevel mining – This component should allow roll-up or drill-down operations ondiscovered patterns.

f) Other miscellaneous information – This component may include on-line help manuals, indexedsearch , debugging and other interactive graphical facilities.

The design of GUI should take into consideration different classes of users of a data mining system.In general , users of a data mining systems can be classified into two types

1) Business analysts

2) Business executives

A business analyst would like to have flexibility and convenience in selecting different portions of data,manipulating dimensions and levels , setting mining parameters and tuning the data mining processes.

A business executives need clear presentation and interpretation of data mining results, flexibility inviewing and comparing different data mining results and easy integration of data mining results into reportwriting and presentation processes. A well-designed data mining system should provide friendly userinterfaces for both kinds of users.

Page 129: BSIT 53 New

119BSIT 53 Data Warehousing and Data Mining

10.5 ARCHITECTURES OF DATA MINING SYSTEMS

A good system architecture will enable the system to make best use of the software environment ,accomplish data mining tasks in an efficient and timely manner, interoperate and exchange informationwith other information systems, be adaptable to user’s different requirements and evolve with time.

To know what are the desired architectures for data mining systems, we view data mining is integratedwith database/data warehousing and coupling with the following schemes

a) No-coupling

b) Loose coupling

c) Semitight coupling

d) Tight-coupling

a) No-coupling – It means that data mining system will not utilize any function of a database or datawarehousing system. Here in this system , it fetches data from a particular source such as a file ,processes data using some data mining algorithms and then store the mining result in another file.

This system has some disadvantages

1) Database system provides a great deal of flexibility and efficiency at storing , organizing, accessingand processing data. Without this in a file, Data mining system may spend a more amount oftime finding, collecting , cleaning and transforming data.

2) There are many tested, scalable algorithms and data structures implemented in database anddata warehousing systems. It is feasible to realize efficient , scalable implementations usingsuch systems. Most data have been or will be stored in database or data warehouse systems.Without any coupling of such systems a data mining system will need to use other tools toextract data, making it difficult to integrate such a system into an information processingenvironment. Hence, no coupling represents a poor design.

b) Loose coupling - It means that DM system will use some facilities of a DB or DW system,fetching data from a data repository managed by these systems, performing data mining and then storingthe mining results either in a file or in a designated place in a database or data warehouse. Loose couplingis better than no coupling since it can fetch any portion of data stored in databases or data warehouses byusing query processing, indexing and other system facilities. The typical characteristic is many looselycoupled mining systems are main-memory based. Since mining itself does not explore data structures andquery optimizations methods provided by DB or DW systems, it is difficult for loose coupling to achievehigh scalability and good performance with large data sets.

c) Semitight coupling - It means that besides linking a DM system to a DB/DW system, efficientimplementations of a few essential data mining primitives can be provided in the DB/DW system. These

Page 130: BSIT 53 New

120 Chapter 10 - Data Preprocessing and Data Mining Primitives

primitives can include sorting, indexing, aggregation, histogram analysis, multiway join and precomputationof some essential statistical measures, such as sum, count, max, min, standard deviation and so on. Somefrequently used intermediate mining results can be precomputed and stored efficiently, this design willenhance the performance of a DM system.

d) Tight coupling - It means that a DM system is smoothly integrated into a DB/DW system. Thedata mining subsystem is treated as one functional component of an information system. Data miningqueries and functions are optimized based on mining query analysis, data structures, indexing schemesand query processing methods of a DB or DW system. This approach is highly desirable since it facilitatesefficient implementation of data mining functions, high system performance and an integrated informationprocessing environment.

10.6 EXERCISE

I. FILL UP THE BLANKS

1. For successful data mining operations the data must be ———— and ————.

2. The first stage of data preparation is ————

3. Data smoothing process comes in the stage of ————

4. —— helps in effective knowledge discovery from the data mining systems.

5. There are two types of users in data mining systems ——— and ———

6. ——— coupling DW system will not utilize the any functions of database or data warehousing system.

7. —— coupling DM system is smoothly integrated into a DB/DW system.

ANSWERS FOR FILL UP THE BLANKS

1. Consistent, reliable

2. data selection

3. Data cleaning

4. Data mining language

5. business analysts, business executives

6. no

7. tight

Page 131: BSIT 53 New

121BSIT 53 Data Warehousing and Data Mining

II. ANSWER THE FOLLOWING QUESTIONS

1. Explain the steps of data cleaning.

2. Explain the steps of new data construction.

3. Explain the data mining primitives.

4. In detail explain the data mining querying system

Page 132: BSIT 53 New

Chapter 11 - Data Mining Techniques

Chapter 11

Dat a Min ing Techn iques

11.0 INTRODUCTION

The discovery stage of the KDD process is fascinating. Here we discuss some of the importantmethods and in this way get an idea of the opportunities that are available as well as some of theproblems that occur during the discovery stage. We shall see that some learning algorithms do

well on one part of the data set where others fail and this clearly indicates the need for hybrid learning.

Data mining is not so much a single technique as the idea that there is more knowledge hidden in thedata than shows itself on the surface. Any technique that helps extract more out of our data is useful, sodata mining techniques form quite a heterogeneous group. In this chapter we discuss some of the techniques.

11.1 ASSOCIATIONS

Given a collection of items and a set of records, each of which contain some number of items from thegiven collection, an association function is an operation against this set of records which return affinitiesor patterns that exist among the collection of items. These patterns can be expressed by rules such as“72% of all the records that contain items A, B and C also contain items D and E.” The specific percentageof occurrences (in this case 72) is called the confidence factor of the rule. Also, in this rule, A,B and C aresaid to be on an opposite side of the rule to D and E. Associations can involve any number of items oneither side of the rule.

The market-basket problem assumes we have some large number of items, e.g., “bread, “milk.”Customers fill their market baskets with some subset of the items, and we get to know what items peoplebuy together, even if we don’t know who they are. Marketers use this information to position items, andcontrol the way a typical customer traverses the store.

122

Page 133: BSIT 53 New

In addition to the marketing application, the same sort of question has the following uses:

1. Baskets = documents; items = words. Words appearing frequently together in documents mayrepresent phrases or linked concepts, can be used for intelligence gathering.

2. Baskets = sentences, items = documents. Two documents with many of the same sentencescould represent plagiarism or mirror sites on the Web.

At present context, association rules are useful in data mining if we already have a rough idea of whatit is we are looking for. This illustrates the fact that there is no algorithm that will automatically give useverything that is of interest in the database. An algorithm that finds a lot of rules will probably also finda lot of useless rules, while an algorithm that finds only a limited number of associations, without finetuning, will probably also miss a lot of interesting information.

11.1.1 Data Mining with Apriori Algorithm

APriori algorithm data mining discovers items that are frequently associated together.

Let us look at the example of a store that sells DVDs, Videos, CDs, Books and Games. The storeowner might want to discover which of these items customers are likely to buy together. This can be usedto increase the store’s cross sell and upsell ratios. Customers in this particular store may like buying aDVD and a Game in 10 out of every 100 transactions or the sale of Videos may hardly ever be associatedwith a sale of a DVD.

With the information above, the store could strive for more optimum placement of DVDs and Gamesas the sale of one of them may improve the chances of the sale of the other frequently associated item.On the other hand, the mailing campaigns may be fine tuned to reflect the fact that offering discountcoupons on Videos may even negatively impact the sales of DVDs offered in the same campaign. Abetter decision could be not to offer both DVDs and Videos in a campaign.

To arrive at these decisions, the store may have had to analyze 10,000 past transactions of customersusing calculations that seperate frequent and consequently important associations from weak and unimportantassociations.

These frequently occurring associations are defined by measures known as Support Count andConfidence. The support and confidence measures are defined so that all associations can be weighedand only significant associations analyzed. The measures of an association rules are defined by thesupport and confidence.

The Support Count is the number of transactions or percentage of transactions that feature theassociation of a set of items.

Assume that the dataset of 9 transactions below is selected randomly from a universe of 100,000transactions :

123BSIT 53 Data Warehousing and Data Mining

Page 134: BSIT 53 New

124 Chapter 11- Data Mining Techniques

Market Based Analysis XML Data

Table 11.1: Use a Minimum Support Percentage of 0.4% and a Minimum

Confidence Percentage of 50% or 100%

The Apriori data mining analysis of the 9 transactions above is known as Market Based Analysis,as it is designed to discover which items in a series of transactions are frequently associated together.

11.1.2 Implementation Steps

1. The Apriori algorithm would analyze all the transactions in a dataset for each items supportcount. Any item that has a support count less than the minimum support count required isremoved from the pool of candidate items.

2. Initially each of the items is a member of a set of 1 Candidate Itemsets. The support count ofeach candidate item in the itemset is calculated and items with a support count less than theminimum required support count are removed as candidates. The remaining candidate items inthe itemset are joined to create 2 Candidate Itemsets that each comprise of two items ormembers.

3. The support count of each two member itemset is calculated from the database of transactionsand 2 member itemsets that occur with a support count greater than or equal to the minimumsupport count are used to create 3 Candidate Itemsets. The process in steps 1 and 2 isrepeated generating 4 and 5 Candidate Itemsets until the Support Count of all the itemsets arelower than the minimum required support count.

4. All the candidate itemsets generated with a support count greater than the minimum supportcount form a set of Frequent Itemsets. These frequent itemsets are then used to generateassociation rules with a confidence greater than or equal to the Minimum Confidence.

Transaction ID List of Items Purchased1 Books, CD, Video2 CD, Games3 CD, DVD4 Books, CD, Games5 Books, DVD6 CD, DVD7 Books, DVD8 Books, CD, DVD, Video9 Books, CD, DVD

Page 135: BSIT 53 New

125BSIT 53 Data Warehousing and Data Mining

5. Apriori recursively generates all the subsets of each frequent itemset and creates associationrules based on subsets with a confidence greater than the minimum confidence.

11.1.3 Improving the Efficiency of Aprori

There are many variations of the Apriori algorithm have been proposed that focus on improving theefficiency of the original algorithm.

Hash-based technique (Hashing itemset counts) : A hash-based technique can be used to reduce thesize of the candidate k-itemsets, Ck, for k>1. For example, when scanning each transaction in the databaseto generate the frequent 1-itemsets, L1 from the candidate 1-itemsets in C1, we can generate all of the 2-itemsets for each transaction, hash them into the different buckets of a hash table structure and increasethe corresponding bucket counts. A 2-itemset whose corresponding bucket count in the has table is belowthe supported threshold cannot be frequent and thus should removed from the candidate set. Such a hash-based technique may substantially reduce the number of the candidate k-itemsets examined.

Transaction reduction : A transaction that does not contain any frequent k-temsets cannot contain anyfrequent (K+1) itemsets. Therefore, such a transaction can be marked or removed from further considerationsince subsequent scans of the database for j-itemsets, where j>k, will not require it.

Sampling : The basic idea of the sampling approach is to pick a random sample S of the given data Dand then search for frequent itemsets in S instead of D. In this way, We trade off some degree ofaccuracy against efficiency.

Dynamic itemset counting : A dynamic itemset counting technique was proposed in which the databaseis partitioned into blocks marked by start points. In this variation, new candidate itemsets can be added atany start point, unlike Apriori, which determine new candidate itemsets only immediately priori to eachcomplete database scan. The technique is dynamic in that it estimates the support of all of the itemsetsthat have been counted so far, adding new candidate itemsets if all of their subsets are estimated to befrequent. The resulting algorithm requires database scan than Apriori.

11.2 DATA MINING WITH DECISION TREES

Decision trees are powerful and popular tools for classification and prediction. The attractiveness oftree-based methods is due in large part to the fact that, it is simple and decision trees represent rules.Rules can readily be expressed so that we humans can understand them or in a database access languagelike SQL so that records falling into a particular category may be retrieved.

In some applications, the accuracy of a classification or prediction is the only thing that matters; if adirect mail firm obtains a model that can accurately predict which members of a prospect pool are mostlikely to respond to a certain solicitation, they may not care how or why the model works. In other

Page 136: BSIT 53 New

126 Chapter 11 - Data Mining Techniques

situations, the ability to explain the reason for a decision is crucial. In health insurance underwriting, forexample, there are legal prohibitions against discrimination based on certain variables. An insurance companycould find itself in the position of having to demonstrate to the satisfaction of a court of law that it has notused illegal discriminatory practices in granting or denying coverage. There are a variety of algorithms forbuilding decision trees that share the desirable trait of explicability..

11.2.1 Decision Tree Working Concept

Decision tree is a classifier in the form of a tree structure where each node is either:  

a leaf node, indicating a class of instances, or

a decision node that specifies some test to be carried out on a single attribute0value, with onebranch and sub-tree for each possible outcome of the test.

A decision tree can be used to classify an instance by starting at the root of the tree and movingthrough it until a leaf node, which provides the classification of the instance.

Example: Decision making in the Bombay stock market is shown in Fig. 3.3.

Suppose that the major factors affecting the Bombay stock market are:

what it did yesterday;

what the New Delhi market is doing today;

bank interest rate;

unemployment rate;

   India’s prospect at cricket.

Following table shown in Fig. 3.2 is a small illustrative dataset of six days about the Bombay stockmarket. The lower part contains data of each day according to five questions, and the second row showsthe observed result (Yes (Y) or No (N) for “It rises today”). Figure 3.2 illustrates a typical learneddecision tree from the data in following.

Figure 11.2: Decision table

Instance No. It rises today

1Y

2 Y

3 Y

4 N

5 N

6 N

It rose yesterday NDelhi rises today Bank rate high Unemployment high India is losing

YYNNY

Y N Y Y Y

N N N Y Y

Y N Y N Y

N N N N Y

N N Y N Y

Page 137: BSIT 53 New

127BSIT 53 Data Warehousing and Data Mining

Examples of a small dataset on the Bombay stock market 

Figure 11.3: A decision tree for the Bombay stock market

The process of predicting an instance by this decision tree can also be expressed by answering thequestions in the following order:

Is unemployment high?

YES: The Bombay market will rise today

NO: Is the New Delhi market rising today?

YES: The Bombay market will rise today

NO: The Bombay market will not rise today.

  Decision tree induction is a typical inductive approach to learn knowledge on classification. The keyrequirements to do mining with decision trees are:

Attribute-value description: object or case must be expressible in terms of a fixed collection ofproperties or attributes.

Predefined classes: The categories to which cases are to be assigned must have been establishedbeforehand (supervised data).

Discrete classes: A case does or does not belong to a particular class, and there must be for morecases than classes.   

Is unemployment high?

Is the New Delhi market rising today ?The Bombay

market will rise today {2,3}

The Bombay market will rise today

The Bombay market will not rise today {4,5,6}

Yes No

YES NO

Page 138: BSIT 53 New

128 Chapter 11 - Data Mining Techniques

Sufficient data: Usually hundreds or even thousands of training cases.

“Logical” classification model: Classifier that can be only expressed as decision trees or set ofproduction rules.

11.2.2 Other Classficiation Methods

Case-Based Reasoning

Case-based reasoning (CBR) classifiers are instanced-based. The samples or “cases” stored by CBRare complex symbolic descriptions. Business applications of CBR include problem resolution for customerservice help desks, for example, where cases describe product-related diagnostic problems. CBR hasalso been applied to areas such as engineering and law, where cases are either technical designs or legalrulings, respectively.

When given a new case to classify, a case-based reasoner will first check if an identical training caseexists. If one is found, then the accompanying solution to that case is returned. If no identical case isfound, then the case-based reasoner will search for training cases having components that are similar tothose of the new case. Conceptually, these training cases may be considered as neighbors of the newcase. If cases are represented as graphs, this involves searching for subgraphs that are similar to subgraphswithin the new case. The case-based reasoner tries to combine the solutions of the neighboring trainingcasesin order to propose a solution for the new case. If compatibilities arise with the individual solutions, thenbacktracking to search for other solutions may be necessary. The case-based reasoner may employbackground knowledge and problem-solving strategies in order to propose a feasible combined solution.

Rough Set Approach

Rough set theory can be used for classification to discover structural relationships within imprecise ornoisy data. It applies to discrete-valued attributes. Continuous-valued attributes must therefore be discretizedprior to its use.

Rough set theory is based on the establishment of equivalence classes within the given training data.All of the data samples forming an equivalence class are indiscernible, that is, the samples are identicalwith respect to the attributes describing the data. Given real-world data, it is common that some classescannot be distinguished in terms of the available attributes. Rough sets can be used to approximately or“roughly” define such classes. A rough set definition for a given class C is approximated by two sets – alower approximation of C and an upper approximation of C. The lower approximation of C consists of allof the data samples that, based on the knowledge of the attributes, cannot be described as not belongingto C. The lower and upper approximations for a class C are shown in figure 3.4, where each rectangularregion represents an equivalence class. Decision rules can be generated for each class. Typically, adecision table is used to represent the rules.

Page 139: BSIT 53 New

129BSIT 53 Data Warehousing and Data Mining

Fig 11.4: A rough set approximation of the set samples of the class C using lower and upper approximations sets

of C The rectangular regions represent equivalence classes

Rough sets can also be used for feature reduction (where attributes that do not contribute towards theclassification of the given training data can be identified and removed) and relevance analysis (where thecontribution of significance of each attribute is assessed with respect to the classification task). Theproblem of finding the minimal subsets (reducts) of attributes that can describe all of the concepts in thegiven data set is NP-hard. However, algorithms to reduce the computation intensity have been proposed.In one method, for example, a discernibility matrix is used that stores the differences between attributevalues for each pair of data samples. Rather than searching on the entire training set, the matrix is insteadsearched to detect redundant attributes.

Fig 11.5: Fuzzy values for income

Page 140: BSIT 53 New

130 Chapter 11 - Data Mining Techniques

Fuzzy Set Approaches

Rule-based systems for classification have the disadvantage that they involve sharp cutoffs forcontinuous attributes. For example, consider the following rule for customer credit application approval.The rule essentially says that applications for customers who have had a job for two or more years andwho have a high income (i.e. of at least Rs 50k) are approved:

IF (years_employed >=2) ̂ (income 50k) THEN credit = “approved”

Figure 11.5 shows how values for the continuous attribute income are mapped into the discrete categories(low, medium,high), as well as how the fuzzy membership or truth values are calculated. Fuzzy logicsystemstypically provide tools to assist users.

11.2.3 Prediction

The prediction of continuous values can be modeled by statistical techniques of regression. For example,we may like to develop a model to predict the salary of college graduates with 10 years of work experienceor the potential sales of a new product given its price. Many problems can be solved by linear regressionand even more can be tackled by applying transformations to the variables so that a nonlinear problem canbe converted to a linear one.

Linear and Multiple Regression

In linear regression, data are modeled using a straight line. Linear regression is the simplest form ofregression. Bivariate linear regression models a random variable, Y (called a response variable), as alinear function of another random variable, X (called a predictor variable), that is

Y = + X,

Where the variance of Y is assumed to be constant, and a and b are regression coefficients specifyingthe Y-intercept and slope of the line, respectively. These coefficients can be solved for by the methodof least squares, which minimized the error between the actual data and the estimate of the line. Givens samples or data points of the form (x1, y1), (x2, y2), . . . . , (xs, ys), then the regression coefficients canbe estimated using this method with the following equations:

Page 141: BSIT 53 New

131BSIT 53 Data Warehousing and Data Mining

Where x is the average of x1, x2, . . . . . . xs, and y is the average of y1, y2, . . ys. The coefficients and often provide good approximations to otherwise complicated regression equations.

Table 11.2 Salary Data

Linear regression using the method of least squares Table 3.2 shows a set of paired data where X isthe number of years of work experience of a college graduate and Y is the corresponding salary of thegraduate. A plot of the data is shown in Figure 3.6, suggesting a linear relationship between the twovariables, X and Y. We model the relationship that salary may be related to the number of years of workexperience with the equation Y = a +X.

Given the above data, we compute x = 9.1 and y=55.4. Substituting these values into the aboveequation , we get

= (3-9,1) (30 – 55.4) + (8 – 9.1) (57 – 57.4) + . . . + (16 – 9.1) (83 – 55.4) = 3.5(3 – 9.1)2 + (8 – 9.1)2 + . . . . + (16 – 9.1)2

= 55.4 – (3.5)(9.1) = 23.6

Thus, the equation of the least squares line is estimated by Y = 23.6 + 3.5X. Using this equation, wecan predict that the salary of a college graduate with, say, 10 years of experience is Rs. 58.6K.

Multiple regression is an extension of linear regression involving more than one predictor variable. Itallows response variable Y to be modeled as a linear function of a multidimensional feature vector. Anexample of a multiple regression model based on two predictor attributes or variables, X1 and X2, is

Y = a + 1X

1 +

2X

2

The method of lest squares can also be applied here to solve for , 1, and

2.

Page 142: BSIT 53 New

132 Chapter 11 - Data Mining Techniques

Figure 11.6 Plot of a graph shown in table 3.2

11.2.4 Nonlinear Regression

Polynomial regression can be modeled by adding polynomial terms to the basic linear model. Byapplying transformations to the variables, we can convert the nonlinear model into a linear one that canthen be solved by the method of least squares.

Transformation of a polynomial regression model to a linear regression model. Consider acubicpolynomial relationship given by

Y = + 1X1 +

2X2 +

3X3

To convert this equation to linear form, we define new variables:

X 1 = X X

2 = X2 X

3 = X3

Using the above Equation can then be converted to linear form by applying the above assignments,resulting in the equation Y = +

1X1 +

2X2 +

3X3 which is solvable by the method of least squares.

Some models are intractably nonlinear (such as the sum of exponential terms, for example) and cannotbe converted to a linear model. For such cases, it may be possible to obtain least square estimates throughextensive calculations on more complex formulae.

Page 143: BSIT 53 New

133BSIT 53 Data Warehousing and Data Mining

11.2.5 Other Regression Models

Linear regression is used to model continuous – valued functions. It is widely used, owing largely to itssimplicity. Generalized linear models represent the theoretical foundation on which linear regression canbe applied to the modeling of categorical response variables. In generalized linear models, the variance ofthe response variable Y is a function of the mean value of Y, unlike in linear regression, where thevariance of Y is constant. Common types of generalized linear models include logistic regression andPoisson regression. Logistic regression models the probability of some event occurring as a linear functionof a set of predictor variables. Count data frequently exhibit a poisson distribution and are commonlymodeled using Poisson regression.

Log-linear models approximate discrete multidimensional probability distributions. They may be usedto estimate the probability value associated with data cube cells. For example, suppose we are given datafor the attributes city, item, year, and sales. In the log-linear method, all attributes must be categorical;hence then be used to estimate the probability of each cell in the 4-D base cuboid for the given attributes,based on the 2-D cuboids for city and item, city and year, city and sales, and the 3-D cuboid for item, year,and sales. In this way, an iterative technique can be used to build higher-order data cubes from lowerorderprediction, the log-linear model is useful for data compression (Since the smaller-order cuboidstogethertypically occupy less space than the base cuboid) and data smoothing (since cell estimates in the smaller-order cuboids are less subject to sampling variations than cell estimates in the base cuboid).

11.3 CLASSIFIER ACCURACY

Estimating classifier accuracy is important in that it allows one to evaluate how accurately a givenclassifier will label future data, that is, data on which the classifier has not been trained. For example, ifdata from previous sales are used to train a classifier to predict customer purchasing behavior, we wouldlike some estimate of how accurately the classifier can predict the purchasing behavior of future customers.Accuracy estimates also help in the comparison of different classifiers.

Figure 11.7: Estimating classifier accuracy with the holdout method.

Page 144: BSIT 53 New

134 Chapter 11 - Data Mining Techniques

11.3.1 Estimating Classifier Accuracy

Using training data to derive a classifier and then to estimate the accuracy of the classifier can resultin misleading overoptimistic estimates due to over specialization of the learning algorithm (or model) to thedata. Holdout and cross-validation are two common techniques for assessing classifier accuracy, basedon randomly sampled partitions of the given data.

In the holdout method, the given data are randomly partitioned into two independent sets, a trainingset and a test set. Typically, two thirds of the data are allocated to the training set, and the remaining onethird is allocated to the test set. The training set is used to derive the classifier, whose accuracy isestimated with the test set . The estimate is pessimistic since only apportion of the initial data is used toderive the classifier. Random subsampling is a variation of the holdout method in which the holdoutmethod is repeated k times. The overall accuracy estimate is taken as the average of the accuraciesobtained form each iteration.

In k-fold cross-validation, the initial data are randomly partitioned into k mutually exclusive subsets or“folds,” S

1, S

2, . . . S

k, each of approximately equal size. Training and testing is performed k times. In

iteration i, the subset Si is reserved as the test set, and the remaining subsets are collectively used to trainthe classifier. That is, the classifier of the first iteration is trained on subsets S

2, . . . . S

k and tested on S

2

and so on. The accuracy estimate is the overall number of correct classifications from the k iterations,divided by the total number of samples in the initial data. In stratified cross-validation, the folds arestratified so that the class distribution of the samples in each fold is approximately the same as that in theinitial data.

Other methods of estimating classifier accuracy include bootstrapping, which samples the given traininginstances uniformly with replacement, and leave-one-out, which is k-fold cross-validation with k set to s,the number of initial samples.

In general, stratified 10-fold cross-validation is recommended for estimating classifier accuracy (evenif computation power allows using more folds) due to its relatively low bias and variance.

The use of such techniques to estimate classifier accuracy increases the overall computation time, yetis useful for selecting among several classifiers.

11.4 BAYESIAN CLASSIFICATION

Bayesian classifiers are statistical classifiers. They can predict class membership probabilities, suchas the probability that a given sample belongs to a particular class.

Bayesian classification is based on Bayes theorem, described below. Studies comparing classificationalgorithms have found a simple Bayesian classifier known as the naive Bayesian classifier to be comparable

Page 145: BSIT 53 New

135BSIT 53 Data Warehousing and Data Mining

in performance with decision tree and neural network classifiers. Bayesian classifiers have also exhibitedhigh accuracy and speed when applied to large databases.

Naïve Bayesian classifiers assume that the effect of an attribute value on a given class is independentof the values of the other attributes. This assumption is called class conditional independence. It is madeto simplify the computations involved and in this sense, is considered “naïve”. Bayesian belief networksare graphical models, which unlike native Bayesian classifiers, allow the representation of dependenciesamong subsets of attributes. Bayesian belief networks can also be used for classification.

11.4.1 Bayes Theorem

Let X be a data sample whose class label is unknown. Let H be some hypothesis, such as that the datasample X belongs to a specified class C. For classification problems, we want to determine P (H/X), theprobability that the hypothesis H holds given the observed data sample X.

P(HX) is the posterior probability, or a posteriori probability, of H conditioned on X. For example,suppose the world of data samples consists of fruits, described by their color and shape. Suppose that Xis red and round, and that H is the hypothesis that X is an apple. Then P(HX) reflects our confidence thatX is an apple given that we have seen that X is red and round. In contrast, P(H) is the prior probability, ora priori probability, of H. For our example, this is the probability that any given data sample is an apple,regardless of how the data sample looks. The posterior probability, P(HX), is based on more information(Such as background knowledge) than the prior probability, P(H), which is independent of X.

Similarly, P(XH) is the posterior probability of X conditioned on H. That is, it is the probability that X isred and round given that we know that it is true that X is an apple. P(X) is the prior probability of X. .

P(X), P(H), and P(XH) may be estimated from the given data, as we shall see below. Bayes theoremis useful in that it provides a way of calculating the posterior probability, P(HX), from P(H), P(X), andP(XH). Bayes theorem is

P(HX) = P(XH) P(H) P(X)

11.4.2 Naive Bayesian Classification

The naïve Bayesian classifier, or simple Bayesian classifier, works as follows:

1. Each data sample is represented by an n-dimensional feature vector, X = (x1, x2, . . . . xn),depicting n measurements made on the sample from n attributes, respectively, A1, A2, . .An.

2. Suppose that there are m classes, C1, C2, …. Cm. Given an unknown data sample, X (i.e.,

Page 146: BSIT 53 New

136 Chapter 11 - Data Mining Techniques

having no class label), the classifier will predict that X belongs to the class having the highestposterior probability, conditioned on X. That is, the naïve Bayesian classifier assigns anunknown sample X to the class Ci if and only if

P(CiX) > P(CjX) for 1 j m, j i

Thus we maximize P(Ci/X). The class Ci for which P(Ci/X) is maximized is called themaximum posteriori hypothesis. By Bayes theorem

P(CiX) = P(XCi) P(Ci)P(X)

3. As P(X) is constant for all classes, only P(XCi) P(Ci) need be maximized. If the class priorprobabilities are not known, then it is commonly assumed that the classes are equally likely,that is, P(C1) = P(C2) = … = P(Cm), and we would therefore maximize P(XCi). Otherwise,we maximize P(XCi) P(Ci). Note that the class prior probabilities may be estimated by P(Ci)= si/s where si is the number of training samples of class Ci, and s is the total number oftraining samples.

4. Given data sets with many attributes, it would be extremely computationally expensive tocompute P(XCi). In order to reduce computation in evaluating P(XCi), the naïve assumptionof class conditional independence is made. This presumes that the values of the attributesare conditionally independent of one another, given the class label of the sample, that is, thereare not dependence relationships among the attributes. Thus,

The probabilities P(x1|C

i), P(x

2|C

i), . . ., P(x

n|C

i) can be estimated from the training samples,

where

a) If Ak is categorical, then P(xk|C

i) = Sik, where sik is the number of training samples of class S

i

Ci having the value xk for Ak, and Si is the number of training samples belong to Ci.

b) If Ak is continuous-valued, then the attribute is typically assumed to have a Gaussiandistribution so that

Where g(xk, ci, ci) is the Gaussian (normal) density function for attribute Ak, while ciand ci are the mean and standard deviation, respectively, given the values for attribute Ak

Page 147: BSIT 53 New

137BSIT 53 Data Warehousing and Data Mining

for training samples of class Ci.

5. In order to classify an unknown sample X, P(X/Ci) P(Ci) is evaluated for each class Ci.Sample X is then assigned to the class Ci if and only if

P(X|Ci)P(Ci) > P(X|Cj)P(Cj) for 1 j m, j i.

In other words, it is assigned to the class Ci for which P(X/Ci)P(Ci) is the maximum.

In theory, Bayesian classifiers have the minimum error rate in comparison to all other classifiers.However, in practice this is not always the case owing to inaccuracies in the assumption made for its use,such as class conditional independence and the lack of available probability data. However, various empiricalstudies of this classifier in comparison to decision tree and neural network classifiers have found it to becomparable in some domains.

Bayesian classifiers are also useful in that they provide a theoretical justification for other classifiersthat do not explicitly use Bayes theorem. For example, under certain assumptions, it can be shown thatmany neural networks and curve-fitting algorithms output the maximum posteriori hypothesis, as does thenaïve Bayesian classifier.

11.4.3 Bayesian Belief Networks

The naïve Bayesian classifier makes the assumption of class conditional independence, that is, giventhe class label of a sample, the values of the attributes are

Figure 11.8 A simple Bayesian belief network and conditional probability table for the Lung cancer

Page 148: BSIT 53 New

138 Chapter 11 - Data Mining Techniques

Conditionally independent of one another. This assumption simplifies computation. When the assumptionholds true, then the naive Bayesian classifier is the most accurate in comparison with all other classifiers.In practice, however, dependencies can exist between variables. Bayesian belief networks specify joingconditional probability distributions. They allow class conditional independencies to be defined betweensubsets of variables. They provide a graphical model of causal relationships, on which learning can beperformed. These networks are also known as belief networks, Bayesian networks, and probabilisticnetworks. For brevity, we will refer to them as belief networks.

A belief network is defined by two components. The first is a directed acyclic graph, where each noderepresents a random variable and each arc represents a probabilistic dependence. If an arc is drawn froma node Y to a node Z, then Y is a parent or immediate predecessor of Z, and Z is a descendent of Y. Eachvariable is conditionally independent of its nondescendents in the graph, given its parents. The variablesmay be discrete or continuous-valued. They may correspond to actual attributes given in the data or to“hidden variables” believed to form a relationship (such as medical syndromes in the case ofmedicaldata).

Figure 3.8 shows a simple belief network for six Boolean variables. The arcs allow a representation ofcausal knowledge. For example, having lung cancer is influenced by a person’s family history of lungcancer, as well as whether or not the person is a smoker. Furthermore, the arcs also show that thevariable Lung Cancer is conditionally independent of Emphysema, given its parents, Family History andSmoker. This means that once the values of Family History and Smoker are known, then the variableEmphysema does not provide any additional information regarding Lung Cancer.

The second component defining a belief network consists of one conditional probability table (CPT)for each variable. The CPT for a variable Z specifies the conditional distribution P(z/Parents(Z)), whereParents(Z) are the parents of Z. Figure 3.8 table shows a CPT for Lung Cancer. The conditional probabilityfor each value of Lung Cancer is given for each possible combination of values of its parents. Forinstance, from the upper leftmost and bottom rightmost entries, respectively, we see that

P(Lung Cancer = “yes” | Family History = “yes”, Smoker = “yes”) = 0.8

P(Lung Cancer = “no” | Family History = “no”, Smoker = “no”) = 0.9

The joint probability of any tuple (z1, . . . . . zn) corresponding to the variables or attributes Z1 . . . . .Zn is computed by

Where the values for P(zi/parents(Zi)) correspond to the entries in the CPT for Zi. A node within thenetwork can be selected as an “output” node, representing a class label attribute. There may be more

Page 149: BSIT 53 New

139BSIT 53 Data Warehousing and Data Mining

than one output node. Inference algorithms for learning can be applied on the network. The classificationprocess, rather than returning a single class label, can return a probability distribution for the class labelattribute, that is, predicting the probability of each class.

11.4.4 Training Bayesian Belief Networks

In the learning or training of a belief network, a number of scenarios are possible. The networkstructure may be given in advance or inferred from the data. The network variables may be observable orhidden in all or some of the training samples. The case of hidden data is also referred to as missing valuesor incomplete data.

If the network structure is known and the variables are observable, then training the network isstraightforward. It consists of computing the CPT entries, as is similarly done when computing theprobabilities involved in native Bayesian classification. When the network structure is given and some ofthe variables are hidden, then a method of gradient descent can be used to train the belief network. Theobject is to learn the values for the CPT entries. Let S be a set of s training samples, X1, X2, . . . . . . . ,Xs. Let w

ijk be a CPT entry for the variable Yi = Yij having the parents Ui = u

ik. For example, if w

ijk is the

upper left most CPT entry of Figure 3.8 table, then Yij is Lung Cancer; Y

ij is its value, “yes”; Ui lists the

parent nodes of Yi, namely, {Family History, Smoker}; and uik lists the values of the parent nodes, namely,

{“yes”, “yes”}. The Wijk

are viewed as weights, analogous to the weights in hidden units of neuralnetworks. The set of weights is collectively referred to as w. The weights are initialized to randomprobability values. The gradient descent strategy performs greedy hill-climbing. Ateach iteration, theweights are updated and will eventually converge to a local optimum solution.

1. Compute the gradients: For each i, j, k compute

The probability in the right-hand side of Equation is to be calculated for each training sample Xd in S.For brevity, let’s refer to this probability simply as P. When the variables represented by Yi and Ui arehidden for some Xd, then the corresponding probability P can be computed form the observed variables ofthe sample using standard algorithms for Bayesian network inference.

2. Take a small step in the direction of the gradient: The weights are updated by

Page 150: BSIT 53 New

140 Chapter 11 - Data Mining Techniques

Where l is the learning rate representing the step size, and

is computed from previous Equation. The learning rate is set to a small constant.

3. Renormalize the weights: Because the weights Wijk

are probability values, they must be between0.0 and 1.0, and

j W

ijk must equal 1 for all i, k. These criteria are achieved by renormalizing the weights

after they have been updated by previous Equation.

Several algorithms exist for learning the network structure from the training data given observablevariables. The problem is one of discrete optimization.

11.5 NEURAL NETWORKS FOR DATA MINING

A neural processing element receives inputs from other connected processing elements. These inputsignals or values pass through weighted connections, which either amplify or diminish the signals. Insidethe neural processing element, all of these input signals are summed together to give the total input to theunit. This total input value is then passed through a mathematical function to produce an output or decisionvalue ranging from 0 to 1. Notice that this is a real valued (analog) output, not a digital 0/1 output. If theinput signal matches the connection weights exactly, then the output is close to 1. If the input signal totallymismatches the connection weights then the output is close to 0. Varying degrees of similarity are representedby the intermediate values. Now, of course, we can force the neural processing element to make a binary(1/0) decision, but by using analog values ranging between 0.0 and 1.0 as the outputs, we are retainingmore information to pass on to the next layer of neural processing units. In a very real sense, neuralnetworks are analog computers.

Each neural processing element acts as a simple pattern recognition machine. It checks the inputsignals against its memory traces (connection weights) and produces an output signal that corresponds tothe degree of match between those patterns. In typical neural networks, there are hundreds of neuralprocessing elements whose pattern recognition and decision making abilities are harnessed together tosolve problems.

11.5.1 Neural Network Topologies

The arrangement of neural processing units and their interconnections can have a profound impact onthe processing capabilities of the neural networks. In general, all neural networks have some set ofprocessing units that receive inputs from the outside world, which we refer to appropriately as the “input

Page 151: BSIT 53 New

141BSIT 53 Data Warehousing and Data Mining

units.” Many neural networks also have one or more layers of “hidden” processing units that receiveinputs only from other processing units. A layer or “slab” of processing units receives a vector of data orthe outputs of a previous layer of units and processes them in parallel. The set of processing units thatrepresents the final result of the neural network computation is designated as the “output units”. Thereare three major connection topologies that define how data flows between the input, hidden, and outputprocessing units. These main categories%feed forward, limited recurrent, and fully recurrent networks%aredescribed in detail in the next sections.

11.5.2 Feed-Forward Networks

Feed-forward networks are used in situations when we can bring all of the information to bear on aproblem at once, and we can present it to the neural network. It is like a pop quiz, where the teacher walksin, writes a set of facts on the board, and says, “OK, tell me the answer.” You must take the data, processit, and “jump to a conclusion.” In this type of neural network, the data flows through the network in onedirection, and the answer is based solely on the current set of inputs.

In Figure 11.9 we see a typical feed-forward neural network topology. Data enters the neural networkthrough the input units on the left. The input values are assigned to the input units as the unit activationvalues. The output values of the units are modulated by the connection weights, either being magnified ifthe connection weight is positive and greater than 1.0, or being diminished if the connection weight isbetween 0.0 and 1.0. If the connection weight is negative, the signal is magnified or diminished in theopposite direction.

Figure 11.9: Feed-forward neural networks.

Each processing unit combines all of the input signals corning into the unit along with a threshold value.This total input signal is then passed through an activation function to determine the actual output of theprocessing unit, which in turn becomes the input to another layer of units in a multi-layer network. Themost typical activation function used in neural networks is the S-shaped or sigmoid (also called the logistic)function. This function converts an input value to an output ranging from 0 to 1. The effect of the thresholdweights is to shift the curve right or left, thereby making the output value higher or lower, depending on thesign of the threshold weight. As shown in Figure 11.9, the data flows from the input layer through zero,

Input

Hidden

Output

Page 152: BSIT 53 New

142 Chapter 11 - Data Mining Techniques

one, or more succeeding hidden layers and then to the output layer. In most networks, the units from onelayer are fully connected to the units in the next layer. However, this is not a requirement of feed-forwardneural networks. In some cases, especially when the neural network connections and weights areconstructed from a rule or predicate form, there could be less connection weights than in a fully connectednetwork. There are also techniques for pruning unnecessary weights from a neural network after it istrained. In general, the less weights there are, the faster the network will be able to process data and thebetter it will generalize to unseen inputs. It is important to remember that “feed-forward” is a definition ofconnection topology and data flow. It does not imply any specific type of activation function or trainingparadigm.

11.5.3 Classification by Backpropagation

Backpropagation is a neural network learning algorithm. The field of neural networks was originallykindled by psychologists and neurobiologists who sough to develop and test computational analogues ofneurons. Roughly speaking, a neural network is a set of connected input/output units where each connectionhas a weight associated with it. During the learning phase, the network learns by adjusting the weights soas to be able to predict the correct class label of the input samples. Neural network learning is alsoreferred to as connectionist learning due to the connections between units.

Neural networks involve long training times and are therefore more suitable for applications wherethis is feasible. They require a number of parameters that are typically best determined empirically, suchas the network topology or “structure”. Neural networks have been criticized for their poor interpretability,since it is difficult for humans to interpret the symbolic meaning behind the learned weights. Thesefeatures initially made neural networks less desirable for data mining.

Advantage of neural networks, however, include their high tolerance to noisy data as well as theirability to classify patterns on which they have not been trained. In addition, several algorithms haverecently been developed for the extraction of rules from trained neural networks. These factors contributetowards the usefulness of neural networks for classification in data mining.

The most popular neural network algorithm is the backprogragation algorithm, proposed in the 1980s.

11.5.4 Backpropagation

Backpropagation learns by iteratively processing a set of training samples, comparing the network’sprediction for each sample with the actual known class label. For each sample with the actual knownclass label. For each training sample, the weights are modified so as to minimize the means squared errorbetween the network’s prediction and the actual class. These modifications are made in the “backwards”direction, that is, from the output layer, through each hidden layer down to the first hidden layer (hence the

Page 153: BSIT 53 New

143BSIT 53 Data Warehousing and Data Mining

name backpropagation). Although it is not guaranteed, in general the weights will eventually coverage,and the learning process stops. The algorithm is summarized in Figure each step is described below.

Initialize the weights: The weights in the network are initialized to small random numbers (e.g.,ranging from – 1.0 to 1.0, or -0.5 to 0.5). Each unit has a bias associated with it, as explained below. Thebiases are similarly initialized to small random numbers.

Each training sample, X, is processed by the following steps.

Propagate The inputs forward: In this step, the net input and output of each unit in the hidden andoutput layers are computed. First, the training sample is fed to the input layer of the network. Note thatfor unit j in the input layer, its output is equal to its input, that is, Oj = Ij for input unit j. The net input to eachunit in the hidden and output layers is computed as a linear combination of its inputs. To help illustrate this,a hidden layer or output layer unit is shown in Figure 3.10. The inputs to the units are, in fact, the outputsof the units connected to it in the previous layer. To compute the net input to the unit, each input connectedto the unit is multiplied by its corresponding weight, and this is summed. Given a unit j in a hidden or outputlayer, the net input, Ij, , to unit j is

Ij = w

ijO

i +

j ,

i

Where wij is the weight of the connection from unit i in the previous layer to unit j; O

i is the output of

unit i from the previous layer; and j is the bias of the unit. The bias acts as a threshold in that it serves tovary the activity of the unit.

Algorithm: Backpropagation: Neural network learning for classification, using the backpropagationalgorithm.

Input: The training samples, samples; the learning rate, l; a multilayer feed-forward network, network.

Output: A neural network trained to classify the samples.

Method:

(1) Initialize all weights and biases in network;

(2) While terminating condition is not satisfied{

(3) For e ach training sample X in samples {

(4) // Propagate the inputs forward:

(5) For each hidden of output layer unit j{

(6) Ij = i w

ijO

j + j; // compute the net input of unit j with respect of the previous layer, i

Page 154: BSIT 53 New

144 Chapter 11 - Data Mining Techniques

(7) Oj = 1 ; }/’/compute the output of each unit j 1+e-1

(8) //Backpropagate the errors:

(9) for each unit j in the output layer

(10) Errj = Oj (1 – Oj) (Tj – Oj); // compute the error

(11) For each unit in the hidden layers, from the last to the first hidden layer

(12) Errj = O

j (1 - O

j)

k Err

kw

jk;//compute the error with respect to the next higher layer, k

(13) For each weight wij in network {

(14) wij = (1)ErrjO

j;//weight increment

(15) wij = w

ij + w

ij;}// weight update

(16) for each bias , in network {

(17) j = (1) Errj; // bias increment

(18) = j +

j;} //bias update

(19) }}

Each unit in the hidden and output layers takes its net input and then applies an activation function toit, as illustrated in Figure 11.10. Thefucntion symbolizes the activation of the neuron represented by theunit. The logistic, or sigmoid, function is used. Given the net input Ij to unit j, then Oj, the out put of unit j,is computed as

Oj = 11 + e-Ij

This function is also referred to as a squashing function, since it maps a large input domain onto thesmaller range of 0 to 1. The logistic function is non-linear and differentiable, allowing the backpropagationalgorithm to model classification problems that are linearly inseparable.

Figure 11.10: Neural network shows input Layer, Activation Segment and output Weights

Page 155: BSIT 53 New

145BSIT 53 Data Warehousing and Data Mining

Backpropagate the Error: The error is propagated backwards by updating the weights and biases toreflect the error of the network’s prediction. For a unit j in the output layer, the error Errj is computed by

Errj = Oj(1-Oj)(Tj-Oj)

Where Oj is the actual output of unit j, and Tj is the true output, based on the known class label of thegiven training sample. Note that Oj (1-Oj) is the derivative of the logistic function.

To compute the error of a hidden layer unit j, the weighted sum of the errors of the units connected tounit j in the next layer are considered. The error of a hidden layer unit j is

Errj = Oj(1-Oj) Errkw

jk

k

Where wjk is the weight of the connection form unit j to a unit k in the next higher layer, and Errk is theerror of unit k.

The weights and biases are updated to reflect the propagated errors. Weights are updated by thefollowing equations, where wij is the change in weight wij:

Wij = (1)ErrjOi

Wij = Wij + Wij

The variable l denotes the learning rate, a constant typically having a value between 0.0 and 1.0.Backpropagation learns using a method of gradient descent to search for a set of weights that can modelthe given classification problem so as to minimize the mean squared distance between the network’sclass prediction and the actual class label of the samples. The learning rate helps to avoid getting stuck ata local minimum in decision space (i.e., where the weights appear to converge, but are not the optimumsolution) and encourages finding the global minimum. If the learning rate is too large, then oscillationbetween inadequate solutions may occur at a very slow pace. If the learning rate is too large, thenoscillation between inadequate solutions may occur. A rule of thumb is to set the learning rate to 1/t,where t is the number of iterations through the training set so far.

Biases are updated by the following equations below, where wj is the change in bias j;

j = (l)Errj

j = j + j

Note that here we are updating the weights and biases after the presentation of each sample. This isreferred to as case updating. Alternatively, the weight and bias increments could be accumulated invariables, so that the weights and biases are updated after all of the samples in the training set have beenpresented. This latter strategy is called epoch updating, where one iteration through the training set is anepoch. In theory, the mathematical derivation of backpropagation employs epoch updating, yet in practice,case updating is more common since it tends to yield more accurate results.

Page 156: BSIT 53 New

146 Chapter 11 - Data Mining Techniques

Terminating condition: Training stops when

all wij in the previous epoch were so small as to below some specified threshold, or

the percentage of samples misclassified in the previous epoch is below some threshold, or

a prespecified number of epochs has expired.

In practice, several hundreds of thousands of epochs may be required before the weights will converge.

11.5.5 Backpropagation and Interpretability

A major disadvantage of neural networks lies in their knowledge representation. Acquired knowledgein the form of a network of units connected by weighted links is difficult for humans to interpret. Thisfactor has motivated research in extracting the knowledge embedded in trained neural networks and inrepresenting that knowledge symbolically. Methods included extracting rules from networks and sensitivityanalysis.

Various algorithms for the extraction of rules have been proposed. The methods typically imposerestrictions regarding procedures used in training the given neural network, the network topology, and thediscretization of input values.

Fully connected networks are difficult to articulate. Hence, often the first step towards extractingrules from neural networks is network pruning. This consists of removing weighted links that do not resultin a decrease in the classification accuracy of the given network.

Once the trained network has been pruned, some approaches will then perform link, unit, or activationvalue clustering. In one method, for example, clustering is used to find the set of common activationvalues for each hidden unit in a given trained two layer neural network. The combinations of theseactivation values for each hidden unit are analyzed. Rules are derived relating combinations of activationvalues with corresponding output unit values. Similarly, the sets of input values and activation values arestudied to derive rules describing the relationship between the input and hidden unit layers. Finally, the twosets of rules may be combined to form IF THEN rules. Other algorithms may derive rules of other forms,including M-of-N rules (where M out of a given N conditions in the rule antecedent must be true in orderfor the rule consequent to be applied), decision trees with M-of-N tests, fuzzy rules, and finite automata.

Sensitivity analysis is used to assess the impact that a given input variable has on a network output.The input to the variable is varied while the remaining input variables are fixed at some value. Meanwhile,changes in the network output are monitored. The knowledge gained from this form of analysis can be

represented in rule such as “IF X decreases 5% THEN Y increases 8%.

Page 157: BSIT 53 New

147BSIT 53 Data Warehousing and Data Mining

11.6 CLUSTERING IN DATA MINING

Clustering is a division of data into groups of similar objects. Each group, called cluster, consists ofobjects that are similar between themselves and dissimilar to objects of other groups. Representing databy fewer clusters necessarily loses certain fine details (akin to lossy data compression), but achievessimplification. It represents many data objects by few clusters, and hence, it models data by its clusters.

Data modeling puts clustering in a historical perspective rooted in mathematics, statistics, and numericalanalysis. From a machine learning perspective clusters correspond to hidden patterns, the search forclusters is unsupervised learning, and the resulting system represents a data concept. Therefore, clusteringis unsupervised learning of a hidden data concept. Data mining deals with large databases that impose onclustering analysis additional severe computational requirements.

Clustering techniques fall into a group of undirected data mining tools. The goal of undirected datamining is to discover structure in the data as a whole. There is no target variable to be predicted, thus nodistinction is being made between independent and dependent variables.

Clustering techniques are used for combining observed examples into clusters (groups) which satisfytwo main criteria:

Each group or cluster is homogeneous; examples that belong to the same group are similar toeach other.

Each group or cluster should be different from other clusters, that is, examples that belong toone cluster should be different from the examples of other clusters.

Depending on the clustering technique, clusters can be expressed in different ways:

Identified clusters may be exclusive, so that any example belongs to only one cluster.

They may be overlapping; an example may belong to several clusters.

They may be probabilistic, whereby an example belongs to each cluster with a certain probability.

Clusters might have hierarchical structure, having crude division of examples at highest level ofhierarchy, which is then refined to sub-clusters at lower levels.

11.6.1 Requirements for Clustering

Clustering is a challenging and interesting field potential applications pose their own special requirements.

The following are typical requirements of clustering in data mining.

Scalability: Many clustering algorithms work well on small data sets containing fewer than

Page 158: BSIT 53 New

148 Chapter 11 - Data Mining Techniques

200 data objects However, a large database may contain millions of objects. Clustering on asample of a given large data set may lead to biased results. Highly scalable clusteringalgorithms are needed.

Ability to Deal with Different types of Attributes: Many algorithms are designed tocluster interval-based (numerical) data. However, applications may require clustering othertypes of data, such as binary, categorical (nominal), and ordinal data, or mixtures of thesedata types.

Discovery of Clusters with Arbitrary Shape: Many clustering algorithms determinedclusters based on Euclidean or Manhattan distance measures. Algorithms based on suchdistance measures tend to find spherical clusters with similar size and density. However, acluster could be of any shape. It is important to develop algorithms that can detect clusters ofarbitrary shape.

Minimal Requirements for Domain Knowledge of Determine Input Parameters:Many clustering algorithms require users to input certain parameters in cluster analysis (suchas the number of desired clusters). The clustering results can be quite sensitive to inputparameters. Parameters are often hard to determine, especially for data sets containinghigh-dimensional objects. This not only burdens users, but also makes the quality of clusteringdifficult to control.

Ability to Deal with Noisy Data: Most real-world databases contain outliers or missing,unknown, erroneous data. Some clustering algorithms are sensitive to such data and maylead to clusters of poor quality.

Insensitivity to the Order of Input Records: Some clustering algorithms are sensitive tothe order of input data; for example, the same set of data, when presented with differentorderings to such an algorithm, may generated dramatically different clusters. It is importantto develop algorithms that are insensitive to the order of input.

High Dimensionality: A database or a data warehouse can contain several dimensions orattributes. Many clustering algorithms are good at handling low-dimensional data, involvingonly two to three dimensions. Human eyes are good at judging the quality of clustering for upto three dimensions. It is challenging to cluster data objects in high – dimensional space,especially considering that such data can be very sparse and highly skewed.

Constraint-Based Clustering: Real-world applications may need to perform clusteringunder various kinds of constraints. Suppose that your job is to choose the locations for agiven number of new automatic cash-dispensing machines (i.e., ATMs) in a city. To decideupon this, we may cluster household while considering constraints such as the city’s riversand highway networks and customer requirements per region. A challenging task is to findgroups of data with good clustering behavior that satisfy specified constraints.

Page 159: BSIT 53 New

149BSIT 53 Data Warehousing and Data Mining

Interpretability and Usability: Users expect clustering results to be interpretable,comprehensible, and usable. That is, clustering may need to be tied up with specific semanticinterpretations and applications. It is important to study how an applications goal may influencethe selection of clustering methods.

11.6.2 Type of Data in Cluster Analysis

We study the types of data that often occur in cluster analysis and how to preprocess them for such ananalysis. Suppose that a data set to be clustered contains n objects, which may represent persons, houses,documents, countries, and so on. Main memory-based clustering algorithms typically operate on either ofthe following two data structures.

Data Matrix (or object-by-variable structure): This represent n objects, such as persons,with p variables (also called measurements or attributes), such as age, height, weight, gender,race, and so on. The structure is in the form of a relational table, or n-by-p matrix (n objectsx p variables):

Dissimilarity matrix (or object-by-object structure): This stores a collection of proximitiesthat are available for all pairs of n objects. It is often represented by an n-by-n table:

Where d(I,j) is the measured difference or dissimilarity between objects i and j. In general, d(I,j) is

Page 160: BSIT 53 New

150 Chapter 11 - Data Mining Techniques

a nonnegative number that is close to 0 when object I and j are highly similar or “near” each other, andbecomes larger the more they differ. Since d(j,i) = d(j,i) = 0, we have the above matrix .

The data matrix is often called a two-mode matrix, where as the dissimilarity matrix is called a onemodematrix, since the rows and columns of the former represent different entities, while those of the latterrepresent the same entity. Many clustering algorithms operate on a dissimilarity matrix. If the data arepresented in the form of a data matrix, it can first be transformed into a dissimilarity matrix beforeapplying such clustering algorithms.

11.6.3 Interval-Scaled Variables

Interval-scaled variables are continuous measurements of a roughly linear scale. Typical examplesinclude weight and height, latitude and longitude coordinates (e.g., when clustering houses), and weathertemperature.

The measurement unit used can affect the clustering analysis. For example, changing measurementunits from meters to inches for height or from kilograms to pounds for weight, may lead to a very differentclustering structure. In general, expressing a variable in smaller units will lead to a larger range for thatvariable and thus a large effect on the resulting clustering structure. To help avoid dependence on thechoice of measurement units, the data should be standardized. Standardizing measurements attempts togive all variables an equal weight. This is particularly useful when given no prior knowledge of the data.However, in some applications, users may intentionally want to give more weight to a certain set ofvariables than to others. For example, when clustering basketball player candidates, we may prefer togive more weight to the variable height.

To standardize measurements, one choice is to convert the original measurements to unitless variables.Given measurements for a variable f, this can be performed as follows.

1. Calculate the mean absolute deviation, Sf:

Where x1f, . . . . ., xnf are n measurements off, and mf is the mean value of f, that is,

2. Calculate the standardized measurement, or z-score:

Page 161: BSIT 53 New

151BSIT 53 Data Warehousing and Data Mining

The mean absolute deviation, Sf, is more robust to outliers than the standard deviation, sf. Whencomputing the mean absolute deviation, the deviations from the mean (i.w., | x

if - m

f|) are not squared

hence, the effect of outliers is somewhat reduced. There are more robust measures of dispersion, such asthe median absolute deviation. However, the advantage of using the mean absolute deviation is that thezscores of outliers do not become too small; hence, the outliers remain detectable.

Standardization may or may not be useful in a particular application. Thus the choice of whether andhow to perform standardization should be left to the user.

After standardization, or without standardization in certain applications, the dissimilarity (or similarity)between the objects described by interval-scaled variables is typically computed based on the distancebetween each pair of objects. The most popular distance measure in Euclidean distance, which is definedas

Where is = (xi1, x

i2, . . . x

ip) and j = (x

j1, x

j2, . . . x

jp) are two p-dimensional data objects.

Another well-known metric is Manhattan (or city block) distance, defined as

Both the Euclidean distance and Manhattan distance satisfy the following mathematic requirements ofa distance function:

1. d(i,j) 0: Distance is a nonnegative number.

2. d(i,i) = 0: The distance of an object to itself is 0.

3. d(i,j) = d (j,i): Distance is a symmetric function.

4. d(i,j) d(i,h) + d(h,j): Going directly from object i to object j in space is no more than makinga detour over any other object h (triangular inequality).

Minkowski distance is a generalization of both Euclidean distance and Manhattan distance. It is definedas

Where q is a positive integer. It represents the Manhattan distance when q=1, and Euclidean distancewhen q = 2.

Page 162: BSIT 53 New

152 Chapter 11 - Data Mining Techniques

If each variable is assigned a weight according to its perceived importance, the weighted Euclideandistance can be computed as

Weighting can also be applied to the Manhattan and Minkowski distances.

11.6.4 Binary Variables

A Binary variable has only two states: 0 or 1, whether 0 means that the variable is absent, and 1 meansthat it is present. Given the variable smoker describing a patient, for instance, 1 indicates that the patientsmokes, while 0 indicates that the patient does not. Treating binary variables as if they are interval-scaledcan lead to misleading clustering results. Therefore, methods specific to binary data are necessary forcomputing dissimilarities.

To compute the dissimilarity between two binary variables, one approach involves computing adissimilarity matrix from the given binary data. If all binary variables are thought of as having the sameweight, we have the 2-by-2 contingency table of Table 3.3, whether q is the number of variables that equal1 for both objects i and j, r is the number of variables that equal 1 for object i but that are 0 for object j, sis the number of variables that equal 0 for object i but equal 1 for object j, and t is the number of variablesthat equal 0 for both objects i and j. The total number of variables is p, where p = q + r + s + t.

Table 3.3 A contingency table for binary variables

A binary variable is symmetric if both of its states are equally valuable and carry the same weight; thatis, there is no preference on which outcome should be coded as 0 or 1. One such example could be theattribute gender having the states male and female. Similarity that is based on symmetric binary variablesis called invariant similarity in that the result does not change when some or all of the binary variables arecoded differently. For invariant similarities, the most well known coefficient for assessing the dissimilaritybetween objects i and j is the simple matching coefficient, defined as

Page 163: BSIT 53 New

153BSIT 53 Data Warehousing and Data Mining

A binary variable is asymmetric if the outcomes of the states are not equally important, such as thepositive and negative outcomes of a disease test. By convention, we shall code the most important outcome,which is usually the rarest asymmetric binary variables, the agreement of two 1s (a positive match) is thenconsidered more significant that that of two 0s (a negative match). Therefore, such binary variables areoften considered “monary” (as if having one state). The similarity based on such variables is callednoninvariant similarity. For noninvariant similarities, the most well-known coefficient is the JaccardCoefficient, where the number of negative matches, t, is considered unimportant and thus is ignored in thecomputation:

Example: Dissimilarity between binary variables

Suppose that a patient record table (Table 3.4) contains the attributes name, gender, fever, cough, test-1, test-2, test-3, and test-4, where name is an object-id, gender is a symmetric attribute, and the remainingattributes are asymmetric binary.

For asymmetric attribute values, let the values Y (yes) and P (positive) be sent to 1, and the value N(no or negative) by set to 0. Suppose that the distance between objects (patients ) is computed based onlyon the asymmetric variables. According to the Jaccard coefficient formula , the distance between eachpair of the three patients, Ram, Sita, and Laxman, should be

Table 3.4 A relational table containing mostly binary attributes.

Page 164: BSIT 53 New

154 Chapter 11 - Data Mining Techniques

These measurements suggest that Laxman and Sita are unlikely to have a similar disease since theyhave the highest dissimilarity value among the three pairs. Of the three patients, Ram and Sita are themost likely to have a similar disease.

11.6.5 Nominal, Ordinal and Ratio-Scaled Variables

Nominal Variables

A nominal variable is a generalization of the binary variable in that it can take on more than two states.For example, map_color is a nominal variable that may have, say, five states: red, yellow, green, pink andblue.

Let the number of states of a nominal variable be M. The states can be denoted by letters, symbols ora set of integers, such as 1, 2 ,…………M. Notice that such integers are used just for data handling anddo not represent any specific ordering.

The dissimilarity between two objects i and j can be computed using the simple matching approach:

Where m is the number of matches (i.e., the number of variables for which i and j are in the samestate), and p is the total number of variables. Weights can be assigned to increase the effect of m or toassign greater weight to the matches in variables having a larger number of states.

Nominal variables can be encoded by asymmetric binary variables by creating a new binary variablefor each of the M nominal states. For an object with a given state value, the binary variable representingthat state is set to 1, while variable map_color, a binary variable can be created for each of the five colorslisted above. For an object having the color yellow, the yellow variable is set to 1, while the remaining fourvariables are set to 0.

Ordinal Variables

A discrete ordinal variable resembles a nominal variable, except that the M states of the ordinal valueare ordered in a meaningful sequence. Ordinal variables are very useful for registering subjectiveassessments of qualities that cannot be measured objectively. For example professional ranks are oftenenumerated in a sequential order, such as assistant, associate, and full. A continuous ordinal variable lookslike a set of continuous data of an unknown scale; that is, the relative ordering of the values is essential buttheir actual magnitude is not. For example, the relative ranking in a particular sport (e.g., gold, silver,bronze) is often more essential than the actual values of a particular measure. Ordinal variables may alsobe obtained form the discretization of interval-scaled quantities by splitting the value range into a finite

Page 165: BSIT 53 New

155BSIT 53 Data Warehousing and Data Mining

number of classes. The values of an ordinal variable can be mapped to ranks. For example, suppose thatan ordinal variable f has Mf states. These ordered states define the ranking 1, . . . . . M

f.

The treatment of ordinal variables is quite similar to that of interval-scaled variables when computingthe dissimilarity between objects. Suppose that f is a variable from a set of ordinal variables describing nobjects. The dissimilarity computation with respect of f involves the following steps:

1. The value of f for the ith object is xif, and f has Mf ordered states, representing the ranking

1, . . . . . , Mf. Replace each x

if by its corresponding ran, r

if{ 1, . . . . ., M

f.}.

2. Since each ordinal variable can have a different number of states, it is often necessary tomap the range of each variable on to [0.0, 1.0] so that each variable has equal weight. Thiscan be achieved by replacing the rank r

if of the ith object in the fth variable by

rif

- 1Z

if= _______

Mf - 1

3. Dissimilarity can then be computed using any of the distance measures described in Section3.4.3 for interval-scaled variables, using z

if to represent the f value for the ith object.

Ratio-Scaled Variables

A ratio-scaled variable makes a positive measurement on a nonlinear scale, such as an exponentialscale, approximately following the formula

AeBt or Ae–Bt,

Where A and B are positive constants. Typical examples include the growth of a bacteria population,or the decay of a radioactive element.

To compute the dissimilarity between objects described by ratio-scaled variable There are three methodsto handle ratio-scaled variables for computing the dissimilarity between objects.

Treat ratio-scaled variables like interval-scaled variables. This, however, is not usually agood choice since it is likely that the scale may be distorted.

Apply logarithmic transformation to a ratio-scaled variable f having value xif for object I byusing the formula y

if = log (x

if ). The y

if values can be treated as interval-valued Note that for

some ratio-scaled variables, log-log or other transformations may be applied, depending onthe definition and application.

Treat xif as continuous ordinal data and treat their ranks as interval-valued.

Page 166: BSIT 53 New

156 Chapter 11 - Data Mining Techniques

11.6.6 Variables of Mixed Types

In many real databases, objects are described by a mixture of variable types. One approach is to groupeach kind of variable together, performing a separate cluster analysis for each variable type. This isfeasible if these analyses derive compatible results. However, in real applications, it is unlikely that aseparate cluster analysis per variable type will generate compatible results.

A more preferable approach is to process all variable types together, performing a single clusteranalysis. One such technique combines the different variables into a single dissimilarity matrix, bringing allof the meaningful variables onto a common scale of the interval [0.0, 1.0].

Suppose that the data set contains p variables of mixed type. The dissimilarity d(i,j) between objects iand j is defined as

Where the indicator (f) = 0 if either (1) x if is missing (i.e., there is no measurement of variable f forobject i or j object j), or (2) xif = xif = 0 and variable f is asymmetric binary; otherwise (f) = 1 Thecontribution of variable f to the dissimilarity between i and j , d(f), is computed dependent on its type:

Where the indicator (f)ij = 0 if either (1) x

if or x

jf is missing (i.e., there is not measurement of variable

f for object i or object j), or (2) xif = x

if = 0 and variable f is asymmetric binary; otherwise (f)

ij = 1. The

contribution of variable f to the dissimilarity between I and j, d(f)ij , is computed dependent on its type:

If f is binary or nominal: d(f) ij = 0 if xif = x

if; otherwise (f) ij = 1.

If is interval-based: d(f)ij= |x

if – x

if| , where h runs over all nonmissing objects

for variable f. maxhxhf - minhxhf

If f is ordinal or ratio-scaled: compute the ranks rif and z

if = r

if – 1, and tread zif as interval-

scaled. Mf - 1

Thus, the dissimilarity between objects can be computed even when the variables describing theobjects are of different types.

Page 167: BSIT 53 New

157BSIT 53 Data Warehousing and Data Mining

11.7 A CATEGORIZATION OF MAJOR CLUSTERINGMETHODS

The choice of clustering algorithm depends both on the type of data available an don the particularpurpose and application. If cluster analysis is used as a descriptive or exploratory tool, it is possible to tryseveral algorithms on the same data to see what the data may disclose.

In general, major clustering methods can be classified into the following categories.

Partitioning Methods: Given a database of n objects or data tuples, a partitioning method constructsk partitions of the data, where each partition represent a cluster and k d” n. That is, it classifies the datainto k groups, which together satisfy the following requirements: (1) each group must contain at least oneobject, and (2) each object must belong to exactly one group., Notice that the second requirement can berelaxed in some fuzzy partitioning techniques.

Given k, the number of partitions to construct, a partitioning method creates an initial partitioning. Itthen uses an iterative relocation technique that attempts to improve the partitioning by moving objectsfrom one group to another. The general criterion of a good partitioning is that objects in the same clusterare “close” or related to each other, whereas objects of different for judging the quality of partitions.

To achieve global optimality in partitioning-based clustering would require the exhaustive enumerationof all of the possible partitions. Instead, most applications adopt one of two popular heuristic methods: (1)the k-means algorithm, where each cluster is represented by the mean value of the objects in the cluster,and (2) the k-medoids algorithm, where each cluster is represented by one of the objects located near thecenter of the cluster. These heuristic clustering methods work well for finding spherical-shaped clusters insmall to medium-sized databases. To find clusters with complex shapes and for clustering very large datasets, partitioning-based methods need to be extended.

Hierarchical Methods: A hierarchical method creates a hierarchical decomposition of the given setof data objects. A hierarchical method can be classified as being either agglomerative or divisive, based onhow the hierarchical decomposition is formed. The agglomerative approach, also called the bottom-upapproach, starts with each object forming a separate group. It successively merges the objects or groupsclose to one another, until all of the groups are merged into one (the topmost level of the hierarchy), or untila termination condition holds. The divisive approach, also called the top-down approach, starts with all theobjects in the same cluster. In each successive iteration, a cluster is split up into smaller clusters, untileventually each object is in one cluster, or until a termination condition holds. Hierarchical methods sufferfrom the fact that once a step (merge or split) is done, it can never be undone. This rigidity is useful in thatit leads to smaller computation costs by not worrying about a combinatorial number of different choices.However, a major problem of such techniques is that they cannot correct erroneous decisions. There aretwo approaches to improving the quality of hierarchical clustering: (1) perform careful analysis of object“linkages” at each hierarchical partitioning, such as in CURE and Chameleon, or (2) integrate hierarchicalagglomeration and iterative relocation by first using a hierarchical agglomerative algorithm.

Page 168: BSIT 53 New

158 Chapter 11 - Data Mining Techniques

Density-Based Methods: Most partitioning methods cluster objects based on the distance betweenobjects. Such methods can find only spherical-shaped clusters and encounter difficulty at discoveringclusters of arbitrary shapes. Other clustering methods have been developed based on the notion of density.Their general idea is to continue growing the given cluster as long as the density (number of objects ordata points) in the “neighborhood” exceeds some threshold; that is, for each data point within a givencluster, the neighborhood of a given radius has to contain at least a minimum number of points. Such amethod can be used to filter out noise (outliers) and discover clusters of arbitrary shape.

DBSCAN is a typical density-based method that grows clusters according to a density threshold.OPTICS is a density-based method that computes an augmented clustering ordering for automatic andinteractive cluster analysis.

Grid-Based Methods: Grid-based methods quantize the object space into a finite number of cellsthat form a grid structure. All of the clustering operations are performed on the gird structure (i.e., on thequantized space). The main advantage of this approach is its fast processing time, which is typicallyindependent of the number of data objects and dependent only on the number of cells in each dimensionin the quantized space.

STING is a typical example of a grid-based method. CLIQUE and Wave Cluster are two clusteringalgorithms that are both grid-based and density based.

Model-Based Methods: Model based methods hypothesize a model for each of the clusters andfind the best fit of the data to the given model. A model-based algorithm may locate clusters by constructinga density function that reflects the spatial distribution of the data points. It also leads to a way of automaticallydetermining the number of clusters based on standard statistics, taking “noise” or outliers into account andthus yielding robust clustering methods.

11.8 CLUSTERING ALGORITHM

We will explain here the basics of the simplest of clustering methods: k-means algorithm. There aremany other methods, like self-organizing maps (Kohonen networks), or probabilistic clustering methods(AutoClass algorithm), which are more sophisticated and proficient, but k-means algorithm seemed a bestchoice for the illustration of the main principles.

11.8.1 K-Means Algorithm

This algorithm has as an input a predefined number of clusters, that is the k from its name. Meansstands for an average, an average location of all the members of a particular cluster. When dealing withclustering techniques, one has to adopt a notion of a high dimensional space, or space in which orthogonaldimensions are all attributes from the table of data we are analyzing. The value of each attribute of an

Page 169: BSIT 53 New

159BSIT 53 Data Warehousing and Data Mining

example represents a distance of the example from the origin along the attribute axes. Of course, in orderto use this geometry efficiently, the values in the data set must all be numeric (categorical data must betransformed into numeric ones!) and should be normalized in order to allow fair computation of the overalldistances in a multi-attribute space.

K-means algorithm is a simple, iterative procedure, in which a crucial concept is the one of centroid.Centroid is an artificial point in the space of records which represents an average location of the particularcluster. The coordinates of this point are averages of attribute values of all examples that belong to thecluster. The steps of the K-means algorithm are given below.

1. Select randomly k points (it can be also examples) to be the seeds for the centroids of kclusters.

2. Assign each example to the centroid closest to the example, forming in this way k exclusiveclusters of examples.

3. Calculate new centroids of the clusters. For that purpose average all attribute values of theexamples belonging to the same cluster (centroid).

4. Check if the cluster centroids have changed their “coordinates”. If yes, start again form thestep 2). If not, cluster detection is finished and all examples have their cluster membershipsdefined.

Usually this iterative procedure of redefining centroids and reassigning the examples to clusters needsonly a few iterations to converge.

11.8.2 Important Issues in Automatic Cluster Detection

Most of the issues related to automatic cluster detection are connected to the kinds of questions wewant to be answered in the data mining project, or data preparation for their successful application.

Distance Measure

Most clustering techniques use for the distance measure the Euclidean distance formula (square rootof the sum of the squares of distances along each attribute axes).

Non-numeric variables must be transformed and scaled before the clustering can take place. Dependingon this transformations, the categorical variables may dominate clustering results or they may be evencompletely ignored.

Choice of the Right Number of Clusters

If the number of clusters k in the K-means method is not chosen so to match the natural structure of

Page 170: BSIT 53 New

160 Chapter 11 - Data Mining Techniques

the data, the results will not be good. The proper way t alleviate this is to experiment with different valuesfor k. In principle, the best k value will exhibit the smallest intra-cluster distances and largest inter-clusterdistances. More sophisticated techniques measure this qualities automatically, and optimize the number ofclusters in a separate loop (AutoClass).

Cluster Interpretation

Once the clusters are discovered they have to be interpreted in order to have some value for the datamining project. There are different ways to utilize clustering results:

Cluster membership can be used as a label for the separate classification problem. Somedescriptive data mining technique (like decision trees) can be used to find descriptions of clusters.

Clusters can be visualized using 2D and 3D scatter graphs or some other visualization technique.

Differences in attribute values among different clusters can be examined, one attribute at atime.

11.8.3 Application Issues

Clustering techniques are used when we expect natural groupings in examples of the data. Clustersshould then represent groups of items (products, events, customers) that have a lot in common. Creatingclusters prior to application of some other data mining technique (decision trees, neural networks) mightreduce the complexity of the problem by dividing the space of examples. This space partitions can bemined separately and such two step procedure might exhibit improved results (descriptive or predictive)as compared to data mining without using clustering.

11.9 GENETIC ALGORITHMS

There is a fascinating interaction between technology and nature and by means of technical inventionswe can learn to understand nature better. Conversely , nature of often a source of inspiration for technicalbreak throughs. The same principle also applies to computer science, a most fertile area for exchange ofviews between biology and computer science being evolutionary computing . Evolutionary computingoccupies itself with problem solving by the application of evolutionary mechanisms. At present geneticalgorithms are considered to be among the most successful machine learning technique.

Genetic algorithms are inspired by Darwin’s theory of evolution. Solution to a problem solved bygenetic algorithms uses an evolutionary process (it is evolved).

Page 171: BSIT 53 New

161BSIT 53 Data Warehousing and Data Mining

Algorithm begins with a set of solutions (represented by chromosomes) called population. Solutionsfrom one population are taken and used to form a new population. This is motivated by a hope, that thenew population will be better than the old one. Solutions which are then selected to form new solutions(offspring) are selected according to their fitness - the more suitable they are the more chances theyhave to reproduce.

Genetic algorithms can be viewed as a kind of meta-learning strategy, which means that the geneticapproach could be employed by individuals who at the moment use almost any other learning mechanism.The past few years have seen the development of many hybrid approaches, in which neural networkshave been used to create input for genetic algorithms or alternatively genetic algorithms to optimize theoutput of neural networks. At present, genetic programming is widely used in financial markets and forinsurance applications.

11.10 EXERCISE

I. FILL IN THE BLANKS

1. ——— algorithm data mining discovers items that are frequently associated together.

2. ————— are powerful and popular tools for classification and prediction.

3. Feed forward neural network contains 3 layers namely ——, —— and ———

4. ———— is a division of data into groups of similar objects.

5. K-means algorithm is a simple ——— procedure

6. Algorithm begins with a set of solutions (represented by chromosomes) called ——————

7. ––––––––––– supervised learning algorithm is one of the popular technique used in neural network.

ANSWERS

1. APriori

2. Decision trees

3. Input, Hidden, output

4. Clusering

5. Iterative

6. Population

7. Backpropagation

Page 172: BSIT 53 New

II. ANSWER THE FOLLOWING QUESTIONS

1. Explain the Apriori algorithm in detail.

2. Explain the decision tree working concept of data mining.

3. Explain Bayelian classification.

4. Explain the neural network for data mining.

5. Explain the steps of Backpropagation.

6. Explain the requirements for clustering.

7. Explain the categorization of clustering methods.

8. Explain the K-mean algorithm in detail.

Chapter 11 - Data Mining Techniques162

Page 173: BSIT 53 New

Chapter 12

Guidelines o f KDD Environm ent

12.0 INTRODUCTION

The goal of a KDD process is to obtain an ever-increasing and better understanding of the changingenvironment of the organization. A KDD environment supports the data mining process, but thisprocess is so involved that it is neither realistic nor desirable to try to support it with just one

generic tool. Rather , one needs a suite of tools that is carefully selected and tuned specifically for eachorganization utilizing data mining. Strictly speaking, there exists no generic data mining tools , database,pattern recognizers, machine learning, reporting tools, statistical analysis everything can be of use attimes. Still, it is clear that nevertheless one needs some guidance in how to set up a KDD environment.

12.1 GUIDELINES

It is customary in the computer industry to formulate rules of thumb that help information technology(IT) specialists to apply new developments. In setting up a reliable data mining environment we mayfollow the guidelines so that KDD system may work in a manner we desire.

1. Support Extremely Large Data Sets

Data mining deals with extremely large data sets consisting of billions of records and withoutproper platforms to store and handle these volumes of data, no reliable data mining is possible.Parallel servers with databases optimized for decision support system oriented queries are useful.Fast and flexible access to large data sets is of very important.

2. Support Hybrid Learning

Learning tasks can be divided into three areas

163BSIT 53 Data Warehousing and Data Mining

Page 174: BSIT 53 New

164 Chapter 12 - Guidelines of KDD Environment

a. Classification tasks

b. Knowledge engineering tasks

c. Problem-solving tasks

All algorithms can not perform well in all the above areas as discussed in previous chapters.Depending on our requirement one has to choose the appropriate one.

3. Establish a Data Warehouse

A data warehouse contains historic data and is subject oriented and static, that is , users do notupdate the data but it is created on a regular time-frame on the basis of the operational data ofan organization. It is thus obvious that a data warehouse is an ideal starting point for a datamining process, since data mining depends heavily on the permanent availability of historic dataand in this sense a data warehouse could be regarded as indispensable.

4. Introduce Data Cleaning Facilities

Even when a data warehouse is in operation , the data is certain to contain all sorts ofheterogeneous mixture. Special tools for cleaning data are necessary and some advanced toolsare available, especially in the field of de-duplication of client files. Other cleaning techniquesare only just emerging from research laboratories.

5. Facilitate Working with Dynamic Coding

Creative coding is the heart of the knowledge discovery process. The environment should enablethe user to experiment with different coding schemes, store partial results make attributesdiscrete, create time series out of historic data, select random sub-samples, separate test setsand so on. A project management environment that keeps track of the genealogy of differentsamples and tables as well as of the semantics and transformations of the different attributes isvital.

6. Integrate with Decision Support System

Data mining looks for hidden data that cannot easily be found using normal query techniques. Aknowledge discovery process always starts with traditional decision support system activitiesand from there we magnify in on interesting parts of the data set.

7. Choose Extendible Architecture

New techniques for pattern recognition and machine learning are under development and wealso see many developments in the database area. It is advisable to choose an architecture thatenables us to integrate new tools at later stages. Object-oriented technology typically helps thiskind of flexibility.

Page 175: BSIT 53 New

165BSIT 53 Data Warehousing and Data Mining

8. Support Heterogeneous Databases

Not all the necessary data is necessarily to be found in the data warehouse. Sometimes we willneed to enrich the data warehouse with information from unexpected sources, such as informationbrokers or with operational data that is not stored in our regular data warehouse. In order tofacilitate this, the data mining environment must support a variety of interfaces, hierarchicaldatabases, flat files various relational databases and object-oriented database systems.

9. Introduce Client/Server Architecture

A data mining environment needs extensive reporting facilities. Some development such as datalandscapes, point in the direction of highly interactive graphic environment but database serversare not very suitable for this tasks. Discovery jobs need to be processed by large data miningservers, while further refinement and reporting will take place on a client. Separating the datamining activities on the servers from the clients is vital for good performance. Client/server is amuch more flexible system which moves the burden of visualization and graphical techniquesfrom the servers to the local machine. We can then optimize our database server completely fordata mining. Adequate parallelization of data mining algorithms on large servers is of vitalimportance in this respect.

10. Introduce Cache Optimization

Learning and pattern recognition algorithms that operate on data bases often need very specialand frequent access to the data. Usually it is either impossible or impractical to store the data inseparate tables or to cache large portions in internal memory. The learning algorithms in a datamining environments should be optimized for this type of database access. A low-level integrationwith the database environment is desirable.

It is very important to note knowledge discovery process is not a one-off activity the we implementand then ignore, the successful organization of the future will have to keep permanently alertboth to possible new sources of information and to the technologies available for opening upthese sources. The major problem facing our information society is the enormous abundance ofdata. In the future every organization will have to find its way through this enormous informationand data mining will play an active and crucially important role.

12.2 EXERCISE

I. FILL IN THE BLANKS

1. ———— servers with databases optimized for decision support system oriented queries are useful.

2. Learning tasks can be divided into three ——, —— and ——— areas.

Page 176: BSIT 53 New

3. A ———— contains historic data and is subject oriented and static.

4. ——— coding is the heart of the knowledge discovery process.

5. A knowledge discovery process always starts with traditional ——————— system.

ANSWERS

1. Parallel

2. Classification tasks, knowledge engineering tasks, problem-solving tasks

3. Data warehouse

4. Creative

5. Decision support

II. ANSWER THE FOLLOWING QUESTIONS

1. Explain the guidelines of KDD environment in detail.

Chapter 12 - Guidelines of KDD Environment166

Page 177: BSIT 53 New

Chapter 13

Dat a Min in g Ap p l icat io n

13.0 INTRODUCTION

In the previous chapters, we studied principles and methods for mining relational data and complextypes of data. We know data mining is interdisciplinary field and it is upcoming filed with wide anddiverse applications, there is still a nontrivial gap between general principles of data mining and

domain-specific, effective data mining tools for particular applications. In this chapter we discuss some ofthe applications of data mining.

13.1 DATA MINING FOR BIOMEDICAL AND DNA DATAANALYSIS

Recently, there is explosive growth in the field of biomedical research, which includes from thedevelopment of new pharmaceuticals and advances in cancer therapies to the identification and study ofthe human genome by discovering large-scale sequencing patterns and gene functions. Since a great dealof biomedical research has focused on DNA data analysis, we study this application here.

Recent research in DNA analysis has led to the discovery of genetic causes for many diseases anddisabilities, as well as the discovery of new medicines and approaches for disease diagnosis, preventionand treatment.

An important focus in genome research is the study of DNA sequences since such sequences fromthe foundation of the genetic codes of all living organisms. All DNA sequences are comprised of fourbasic building blocks (called nucleotides) adenine (A), cytosine (C), guanine(G) and thymine (T). Thesefour nucleotides combined to form long sequences or chains that resemble a twisted ladder.

167BSIT 53 Data Warehousing and Data Mining

Page 178: BSIT 53 New

168 Chapter 13 - Data Mining Application

Human beings have around 1,00,000 genes. A gene is usually comprised of hundreds of individualnucleotides arranged in a particular code. There are almost an unlimited number of ways that the nucleotidecan be ordered and sequenced to form distinct genes. It is challenging to identify particular gene sequencepatterns that play roles in various diseases. Since many interesting sequential pattern analysis and similaritysearch techniques have been developed in data mining, data mining has become a powerful tools andcontributes substantially to DNA analysis in the following manner

a) Semantic integration of heterogeneous, distributed genome databases

Due to the highly distributed, uncontrolled generation and use of a wide variety of DNA data,the semantic integration of such heterogeneous and widely distributed genome databases becomesan important task for systematic and coordinated analysis of DNA databases.

Data cleaning and data integration methods developed in data mining will help the integration ofgenetic data and the construction of data warehouses for genetic data analysis.

b) Similarity search and comparison among DNA sequences

One of the most important search problems in genetic analysis is similarity search and comparisonamong DNA sequence. Gene sequences isolated from diseased and healthy tissues can becompared to identify critical differences between the two classes of genes. Data transformationmethods such as scaling, normalization and window stitching , which are popularly used in theanalysis of time-series data , are ineffective for genetic data since such data are nonnumericdata and the precise interconnection between different kinds of nucleotides play an importantrole in their function. On the other hand , the analysis of frequent sequential patterns is importantin the analysis of similarity and dissimilarity in genetic sequences.

c) Association analysis

Association analysis methods can be used to help determine the kinds of genes that are likely toco-occur in target samples. Such analysis would facilitate the discovery of groups of genes andthe study of interactions and relationships between them.

d) Path analysis

While group of genes may contribute to a disease process, different genes may become activeat different stages of the disease. If the sequence of genetic activities across the differentstages of disease development can be identified, it may be possible to develop medicines thattarget different stages separately, therefore achieving more effective treatment of the disease.

e) Visualization tools and genetic data analysis

Complex structures and sequencing patterns of genes are most effectively presented in graphs,trees and chains by various kinds of visualization tools. Such visually appealing structures and

Page 179: BSIT 53 New

169BSIT 53 Data Warehousing and Data Mining

patterns facilitate pattern understanding, knowledge discovery and interactive data exploration.Visualization therefore plays an important role in biomedical data mining.

13.2 DATA MINING FOR FINANCIAL DATA ANALYSIS

Most banks and financial institutions offer a wide variety of banking services (for example saving,balance checking, individual transactions), credit (such as loans, mortgage) and investment services(mutual funds). Some also offer insurance services and stock investment services.

Financial data collected in the banking and financial industry are often relatively complete, reliable andof high quality, which facilitates systematic data analysis and data mining. The various issues are

a) Design and construction of data warehouses for multidimensional data analysis anddata mining

Data warehouses need to be constructed for banking and financial data. Multidimensional dataanalysis methods should be used to analyze the general properties of such data. Data warehouses,data cubes, multifeature and discovery-driven data cubes, characteristic and comparative analysesand outlier analyses all play important roles in financial data analysis and mining.

b) Loan payment prediction and customer credit policy analysis

Loan payment prediction and customer credit analysis are critical to the business of a bank.Many factors can strongly or weakly influence loan payment performance and customer creditrating. Data mining methods, such as feature selection and attribute relevance ranking mayhelp identify important factors and eliminate irrelevant ones. In some cases, analysis of thecustomer payment history may find that say, payment – to – income ratio is dominant factor,while education level and debt ratio are not. The bank may then decide to adjust its loan-grantingpolicy so as to grant loans to those whose application was previously denied but whose profileshows relatively low risks according to the critical factor analysis.

c) Classification and clustering of customers for targeted marketing

Classification and clustering methods can be used for customer group identification and targetedmarketing. Effective clustering and collaborative filtering methods can help identify customergroups, associate a new customer with an appropriate customer group and facilitate targetedmarketing.

d) Detection of money laundering and other financial crimes

To detect money laundering and other financial crimes, it is important to integrate informationfrom multiple databases, as long as they are potentially related to the study. Multiple dataanalysis tools can then be used to detect unusual patterns, such as large amounts of cash flow at

Page 180: BSIT 53 New

170 Chapter 13 - Data Mining Application

certain periods, by certain group of people and so on. Linkage analysis tools that are used toidentify links among different people and activities, classification tools that is used to groupdifferent cases, outlier analysis tools which is used to detect unusual amounts of fund transfer orother activities and sequential pattern analysis tools are used to characterize unusual accesssequence. These tools may identify important relationship and patterns of activities and helpinvestigators focus on suspicious cases for further detailed examination.

13.3 DATA MINING FOR THE RETAIL INDUSTRY

The retail industry is a major application area for data mining since it collects huge amount of data onsales, customer shopping history, goods transportation, consumption and service records and so on. Thequantity of data collected continues to expand rapidly, due to web or e-commerce. Today, many storesalso have web sites where customers can make purchases on-line.

Retail data mining can help identify customer buying behaviours, discover customer shopping patternsand trends, improve the quality of customer service, achieve better customer retention and satisfaction,enhance goods consumption ratios, design more effective goods transportation and distribution policiesand reduce the cost of the business. The following are few activities of data mining are carried out in theretail industry.

a) Design and construction of data warehouses on the benefits of data mining

The first aspect is to design a warehouse. Here it involves deciding which dimensions andlevels to include and what preprocessing to perform in order to facilitate quality and efficientdata mining.

b) Multidimensional analysis of sales, customers, products, time and region

The retail industry requires timely information regarding customer needs, product sales, trendsand fashions as well as the quality, cost, profit and service of commodities. It is thereforeimportant to provide powerful multidimensional analysis and visualization tools, including theconstruction of sophisticated data cubes according to the needs of data analysis.

c) Analysis of the effectiveness of sales campaigns

The retail industry conducts sales campaigns using advertisements, coupons and various kindsof discounts and bonuses to promote products and attract customers. Careful analysis of theeffectiveness of sales campaigns can help improve company profits. Multi-dimensional analysiscan be used for this purposes by comparing the amount of sales and the number of transactionscontaining the sales items during the sales period versus those containing the same items beforeor after the sales campaign.

Page 181: BSIT 53 New

171BSIT 53 Data Warehousing and Data Mining

d) Customer retention – analysis of customer loyalty

With customer loyalty card information , one can register sequences of purchases of particularcustomers. Customer loyalty and purchase trends can be analyzed in a systematic way, Goodspurchased at different periods by the same customer can be grouped into sequences. Sequentialpatterns mining can then be used to investigate changes in customer consumption or loyalty andsuggest adjustments on the pricing an variety of goods in order to help retain customers andattract new customers.

e) Purchase recommendations and cross-reference of items

Using association mining for sales records , one may discover that a customer who buys aparticular brand of bread is likely to buy another set of items. Such information can used to formpurchase recommendations. Purchase recommendations can e advertised on the web , in weeklyflyers or on the sales receipts to help improve customer service, aid customers in selecting itemsand increase sales.

13.4 OTHER APPLICATIONS

As mentioned earlier, data mining is an interdisciplinary field. Data mining can be used in many areas.Some of the applications are mentioned below.

Data mining for the telecommunication industry

Data mining system products and research prototypes.

13.5 EXERCISE

1. Identify an application and also explain the techniques that can be incorporated into, in solving the problem

using data mining techniques.