DATAMINING AND DATAWARE HOUSING
-
Upload
bridget-smith -
Category
Documents
-
view
219 -
download
0
Transcript of DATAMINING AND DATAWARE HOUSING
-
8/14/2019 DATAMINING AND DATAWARE HOUSING
1/7
A TECHNICAL PAPER ON
DATAMINING AND DATAWARE HOUSING WITH SPECIAL REFERENCE
TO
PARTITIONAL ALGORITHMS IN CLUSTERING OF DATA MINING
Gudlavalleru Engineering College
by
I.RAHUL
III/IV B.TECH CSE
email:[email protected]
Phone:08674-247222
1
K.PRADEEP KUMAR
III/IV B.TECH CSE
email:[email protected]
Phone:08674-240673
-
8/14/2019 DATAMINING AND DATAWARE HOUSING
2/7
Contents
1. Abstract
2. Keywords
3. Introduction
4. Clustering
5. Partitional Algorithms
6. K-medoid Algorithms
6.1 PAM
6.2 CLARA
6.3 CLARANS
7. Analysis
8. Conclusion
9. References
2
-
8/14/2019 DATAMINING AND DATAWARE HOUSING
3/7
PARTITIONAL ALGORITHMS IN CLUSTERING OF DATA MINING
1. ABSTRACT
In last few years there has been tremendous
research interest in devising efficient data mining
algorithms. Clustering is a very essential
component of data mining techniques.
Interestingly, the special nature of data mining
makes the classical clustering algorithms
unsuitable, these characteristics are usually very
large datasets; the dataset need not be necessarily
numeric and hence importance should be given to
efficient input and output operations instead of
algorithmic complexity. As a result in last few
years a number of clustering algorithms are
proposed for data mining. The present paper gives
a brief overview of partitional clustering
algorithms used in data mining. The first part of the
paper discuses overview of clustering techniqueused in data mining. In the second part the paper
discusses different partitional clustering algorithms
used in mining of data.
2. KEYWORDS:
Knowledge discovery in
database, Data mining, Clustering,
partitional algorithms, PAM, CLARA,
CLARANS.
3. INTRODUCTION:
Data mining is the non-trivial process of
identifying valid, novel, potentially useful, and
ultimately understandable patterns of data.Knowledge discovery in database (KDD) is a well
defined process, consisting of several distinct steps.
Data mining is the core step in the process which
results in the discovery of knowledge. Data mining
is a high-level application technique used to
present and analyze data for decision-makers.
There is an enormous wealth of information
embedded in huge databases belonging to
enterprises and this has spurred tremendous interest
in areas of knowledge discovery and data mining.
The fundamental goals of data mining are
prediction and description. Prediction makes use ofexisting variables in the database in order to predict
unknown or future values of interest and
description focuses on finding patterns describing
the data and the subsequent presentation for user
interpretation. There are several mining techniques
for prediction and description. These are
categorized as association, classification,
sequential patterns and clustering. The basic
premise of association is to find all associations
such that the presence of one set of items in a
transaction implies other items. Classification
develops profiles different groups. Sequential
patterns identify sequential patterns subject to a
user-specified minimum constraint. Clustering
segments a database into subsets or clusters.
4. Clustering
Clustering is a useful technique for discovery of
data distribution and patterns in the underlying
data. The goal of clustering is to discover dense
and sparse regions in a data set. Data clustering has
been studied in the statistics, machine learning, anddatabase communities with diverse emphases.
3
-
8/14/2019 DATAMINING AND DATAWARE HOUSING
4/7
There are two main types of clustering techniques
partitional clustering techniques and hierarchical
clustering techniques. The partitional clustering
techniques construct a partition of the database into
predefined number of clusters. The hierarchical
clustering techniques do a sequence of partitions
in which each partition is nested into next partition
in the sequence.
Datasets before clustering
Datasets after clustering
5. PARTITIONAL ALGORITHMS
Partitional algorithms construct a partition of a
database of n objects into a set of k clusters. The
construction involves determining the optimal
partition with respect to an objective function.
There are approximately k/k! ways of partitioning
a set of n data points into k subsets. An exhaustive
enumeration method can though find the global
optimal partition but is practically infeasible whenn and k are very small. The partitional clustering
algorithm usually adopts iterative optimization
paradigm. It starts with an initial partition and uses
an iterative control strategy. It tries swapping of
data points to see if such a swapping improves the
quality of clustering. When no swapping yields
improvements in clustering it finds a locally
optimal partition. This quality of clustering is very
sensitive to initially selected partition. There are
mainly two different categories of the partitioning
algorithms.
k-means algorithm, where each cluster is
represented by the center of gravity of the
cluster.
k-medoid algorithms where each cluster is
represented by one of the objects of the
cluster located near the center.
Most of special clustering algorithms designed for
data mining are k-medoid algorithms. Different k-
medoid algorithms are PAM, CLARA,
CLARANS.
6. k-Medoid Algorithms
6.1 PAM
PAM uses a k-medoid method to identify the
clusters. PAM selects k objects arbitrarily from the
data as medoids. In each step, a swap between a
selected object Oi and a non-selected object Oh is
made as long as such a swap would result in an
improvement of the quality of clustering .To
calculate the effect of such a swap between O i and
Oh a cost Cih is computed, which is related to the
quality of partitioning the non-selected objects to k
clusters represented by the medoids. So, at this
stage it is necessary first to understand the method
of partitioning of the data objects when a set of k-
medoids are given
4
-
8/14/2019 DATAMINING AND DATAWARE HOUSING
5/7
Partitioning
If Oj is a non-selected object and O i is a medoid,
we then say Oj belongs to the cluster represented
by Oi, if d(Oi,Oj)=Minoe d(Oj,Oe), where the
minimum is taken over all medoids Oe and
d(Oa,Oh) determines the distance or dissimilarity
between objects Oa and Oh. The dissimilarity
matrix is known prior to the commencement of
PAM. The quality of clustering is measured by the
average dissimilarity between an object and the
medoid of the cluster to which the object belongs.
Iterative Selection of Medoids
Let us assume that O1, O2, .., Ok are k medoids
selected at any stage. We denote C1, C2, , Ck are
the respective clusters. From the foregoing
discussion, for a non-selected object Oj, j 1, 2
k ifOj Ch then Min(1
-
8/14/2019 DATAMINING AND DATAWARE HOUSING
6/7
-
8/14/2019 DATAMINING AND DATAWARE HOUSING
7/7
Increment jj+1
End do
Compare the cost of clustering with
mincost
If current_cost < mincost
o Mincost current_cost
o Best_nodecurrent
increment ee+1
End do
Return best node.
7. ANALYSIS
PAM is very robust to the existence of outliers.
The clusters found by this method do not depend
on the order in which the objects are examined.
However, it cannot handle very large data. CLARA
samples the large data and applies PAM on this
sample. The result will be based on the sample.
CLARANS applies randomized Iterative-
Optimization for determining of medoids. This can
be applied to large datasets also. It is more efficient
than earlier medoid-based methods suffers from
two major drawbacks: it assumes that all objects fit
in main memory, and the result is very sensitive to
input order. In addition, it may not find a real local
minimum due to the trimming of searching which
is controlled by maxneighbour.
8. CONCLUSION
PAM algorithm is efficient and gives good results
when data is small. However it cannot be applied
to large datasets. CLARA efficiency is determined
by the sample of data taken at sampling phase.
CLARANS is efficient for large datasets. As
datasets from which required data is mined is large
CLARANS is used and is efficient partitional
algorithm compared to PAM and CLARA.
9. REFERENCES:
Vasudha Bhatnagar, On Mining Of Data, IETE
Journal of research, 2001
Data mining and warehousing by Dunham
IEEE Papers
www.datawarehouse.com
www.itpapers.com
7
http://www.datawarehouse.com/http://www.itpapers.com/http://www.datawarehouse.com/http://www.itpapers.com/