DATAMINING AND DATAWARE HOUSING

8/14/2019 DATAMINING AND DATAWARE HOUSING

1/7

A TECHNICAL PAPER ON

DATAMINING AND DATAWARE HOUSING WITH SPECIAL REFERENCE

TO

PARTITIONAL ALGORITHMS IN CLUSTERING OF DATA MINING

Gudlavalleru Engineering College

by

I.RAHUL

III/IV B.TECH CSE

email:[email protected]

Phone:08674-247222

1

K.PRADEEP KUMAR

III/IV B.TECH CSE

email:[email protected]

Phone:08674-240673


2/7

Contents

1. Abstract

2. Keywords

3. Introduction

4. Clustering

5. Partitional Algorithms

6. K-medoid Algorithms

6.1 PAM

6.2 CLARA

6.3 CLARANS

7. Analysis

8. Conclusion

9. References

2


3/7

PARTITIONAL ALGORITHMS IN CLUSTERING OF DATA MINING

1. ABSTRACT

In last few years there has been tremendous

research interest in devising efficient data mining

algorithms. Clustering is a very essential

component of data mining techniques.

Interestingly, the special nature of data mining

makes the classical clustering algorithms

unsuitable, these characteristics are usually very

large datasets; the dataset need not be necessarily

numeric and hence importance should be given to

efficient input and output operations instead of

algorithmic complexity. As a result in last few

years a number of clustering algorithms are

proposed for data mining. The present paper gives

a brief overview of partitional clustering

algorithms used in data mining. The first part of the

paper discuses overview of clustering techniqueused in data mining. In the second part the paper

discusses different partitional clustering algorithms

used in mining of data.

2. KEYWORDS:

Knowledge discovery in

database, Data mining, Clustering,

partitional algorithms, PAM, CLARA,

CLARANS.

3. INTRODUCTION:

Data mining is the non-trivial process of

identifying valid, novel, potentially useful, and

ultimately understandable patterns of data.Knowledge discovery in database (KDD) is a well

defined process, consisting of several distinct steps.

Data mining is the core step in the process which

results in the discovery of knowledge. Data mining

is a high-level application technique used to

present and analyze data for decision-makers.

There is an enormous wealth of information

embedded in huge databases belonging to

enterprises and this has spurred tremendous interest

in areas of knowledge discovery and data mining.

The fundamental goals of data mining are

prediction and description. Prediction makes use ofexisting variables in the database in order to predict

unknown or future values of interest and

description focuses on finding patterns describing

the data and the subsequent presentation for user

interpretation. There are several mining techniques

for prediction and description. These are

categorized as association, classification,

sequential patterns and clustering. The basic

premise of association is to find all associations

such that the presence of one set of items in a

transaction implies other items. Classification

develops profiles different groups. Sequential

patterns identify sequential patterns subject to a

user-specified minimum constraint. Clustering

segments a database into subsets or clusters.

4. Clustering

Clustering is a useful technique for discovery of

data distribution and patterns in the underlying

data. The goal of clustering is to discover dense

and sparse regions in a data set. Data clustering has

been studied in the statistics, machine learning, anddatabase communities with diverse emphases.

3


4/7

There are two main types of clustering techniques

partitional clustering techniques and hierarchical

clustering techniques. The partitional clustering

techniques construct a partition of the database into

predefined number of clusters. The hierarchical

clustering techniques do a sequence of partitions

in which each partition is nested into next partition

in the sequence.

Datasets before clustering

Datasets after clustering

5. PARTITIONAL ALGORITHMS

Partitional algorithms construct a partition of a

database of n objects into a set of k clusters. The

construction involves determining the optimal

partition with respect to an objective function.

There are approximately k/k! ways of partitioning

a set of n data points into k subsets. An exhaustive

enumeration method can though find the global

optimal partition but is practically infeasible whenn and k are very small. The partitional clustering

algorithm usually adopts iterative optimization

paradigm. It starts with an initial partition and uses

an iterative control strategy. It tries swapping of

data points to see if such a swapping improves the

quality of clustering. When no swapping yields

improvements in clustering it finds a locally

optimal partition. This quality of clustering is very

sensitive to initially selected partition. There are

mainly two different categories of the partitioning

algorithms.

k-means algorithm, where each cluster is

represented by the center of gravity of the

cluster.

k-medoid algorithms where each cluster is

represented by one of the objects of the

cluster located near the center.

Most of special clustering algorithms designed for

data mining are k-medoid algorithms. Different k-

medoid algorithms are PAM, CLARA,

CLARANS.

6. k-Medoid Algorithms

6.1 PAM

PAM uses a k-medoid method to identify the

clusters. PAM selects k objects arbitrarily from the

data as medoids. In each step, a swap between a

selected object Oi and a non-selected object Oh is

made as long as such a swap would result in an

improvement of the quality of clustering .To

calculate the effect of such a swap between O i and

Oh a cost Cih is computed, which is related to the

quality of partitioning the non-selected objects to k

clusters represented by the medoids. So, at this

stage it is necessary first to understand the method

of partitioning of the data objects when a set of k-

medoids are given

4


5/7

Partitioning

If Oj is a non-selected object and O i is a medoid,

we then say Oj belongs to the cluster represented

by Oi, if d(Oi,Oj)=Minoe d(Oj,Oe), where the

minimum is taken over all medoids Oe and

d(Oa,Oh) determines the distance or dissimilarity

between objects Oa and Oh. The dissimilarity

matrix is known prior to the commencement of

PAM. The quality of clustering is measured by the

average dissimilarity between an object and the

medoid of the cluster to which the object belongs.

Iterative Selection of Medoids

Let us assume that O1, O2, .., Ok are k medoids

selected at any stage. We denote C1, C2, , Ck are

the respective clusters. From the foregoing

discussion, for a non-selected object Oj, j 1, 2

k ifOj Ch then Min(1


6/7


7/7

Increment jj+1

End do

Compare the cost of clustering with

mincost

If current_cost < mincost

o Mincost current_cost

o Best_nodecurrent

increment ee+1

End do

Return best node.

7. ANALYSIS

PAM is very robust to the existence of outliers.

The clusters found by this method do not depend

on the order in which the objects are examined.

However, it cannot handle very large data. CLARA

samples the large data and applies PAM on this

sample. The result will be based on the sample.

CLARANS applies randomized Iterative-

Optimization for determining of medoids. This can

be applied to large datasets also. It is more efficient

than earlier medoid-based methods suffers from

two major drawbacks: it assumes that all objects fit

in main memory, and the result is very sensitive to

input order. In addition, it may not find a real local

minimum due to the trimming of searching which

is controlled by maxneighbour.

8. CONCLUSION

PAM algorithm is efficient and gives good results

when data is small. However it cannot be applied

to large datasets. CLARA efficiency is determined

by the sample of data taken at sampling phase.

CLARANS is efficient for large datasets. As

datasets from which required data is mined is large

CLARANS is used and is efficient partitional

algorithm compared to PAM and CLARA.

9. REFERENCES:

Vasudha Bhatnagar, On Mining Of Data, IETE

Journal of research, 2001

Data mining and warehousing by Dunham

IEEE Papers

www.datawarehouse.com

www.itpapers.com

7
http://www.datawarehouse.com/http://www.itpapers.com/http://www.datawarehouse.com/http://www.itpapers.com/

DATAMINING AND DATAWARE HOUSING

Documents

Transcript of DATAMINING AND DATAWARE HOUSING