DATAMINING AND DATAWARE HOUSING

download DATAMINING AND DATAWARE HOUSING

of 7

Transcript of DATAMINING AND DATAWARE HOUSING

  • 8/14/2019 DATAMINING AND DATAWARE HOUSING

    1/7

    A TECHNICAL PAPER ON

    DATAMINING AND DATAWARE HOUSING WITH SPECIAL REFERENCE

    TO

    PARTITIONAL ALGORITHMS IN CLUSTERING OF DATA MINING

    Gudlavalleru Engineering College

    by

    I.RAHUL

    III/IV B.TECH CSE

    email:[email protected]

    Phone:08674-247222

    1

    K.PRADEEP KUMAR

    III/IV B.TECH CSE

    email:[email protected]

    Phone:08674-240673

  • 8/14/2019 DATAMINING AND DATAWARE HOUSING

    2/7

    Contents

    1. Abstract

    2. Keywords

    3. Introduction

    4. Clustering

    5. Partitional Algorithms

    6. K-medoid Algorithms

    6.1 PAM

    6.2 CLARA

    6.3 CLARANS

    7. Analysis

    8. Conclusion

    9. References

    2

  • 8/14/2019 DATAMINING AND DATAWARE HOUSING

    3/7

    PARTITIONAL ALGORITHMS IN CLUSTERING OF DATA MINING

    1. ABSTRACT

    In last few years there has been tremendous

    research interest in devising efficient data mining

    algorithms. Clustering is a very essential

    component of data mining techniques.

    Interestingly, the special nature of data mining

    makes the classical clustering algorithms

    unsuitable, these characteristics are usually very

    large datasets; the dataset need not be necessarily

    numeric and hence importance should be given to

    efficient input and output operations instead of

    algorithmic complexity. As a result in last few

    years a number of clustering algorithms are

    proposed for data mining. The present paper gives

    a brief overview of partitional clustering

    algorithms used in data mining. The first part of the

    paper discuses overview of clustering techniqueused in data mining. In the second part the paper

    discusses different partitional clustering algorithms

    used in mining of data.

    2. KEYWORDS:

    Knowledge discovery in

    database, Data mining, Clustering,

    partitional algorithms, PAM, CLARA,

    CLARANS.

    3. INTRODUCTION:

    Data mining is the non-trivial process of

    identifying valid, novel, potentially useful, and

    ultimately understandable patterns of data.Knowledge discovery in database (KDD) is a well

    defined process, consisting of several distinct steps.

    Data mining is the core step in the process which

    results in the discovery of knowledge. Data mining

    is a high-level application technique used to

    present and analyze data for decision-makers.

    There is an enormous wealth of information

    embedded in huge databases belonging to

    enterprises and this has spurred tremendous interest

    in areas of knowledge discovery and data mining.

    The fundamental goals of data mining are

    prediction and description. Prediction makes use ofexisting variables in the database in order to predict

    unknown or future values of interest and

    description focuses on finding patterns describing

    the data and the subsequent presentation for user

    interpretation. There are several mining techniques

    for prediction and description. These are

    categorized as association, classification,

    sequential patterns and clustering. The basic

    premise of association is to find all associations

    such that the presence of one set of items in a

    transaction implies other items. Classification

    develops profiles different groups. Sequential

    patterns identify sequential patterns subject to a

    user-specified minimum constraint. Clustering

    segments a database into subsets or clusters.

    4. Clustering

    Clustering is a useful technique for discovery of

    data distribution and patterns in the underlying

    data. The goal of clustering is to discover dense

    and sparse regions in a data set. Data clustering has

    been studied in the statistics, machine learning, anddatabase communities with diverse emphases.

    3

  • 8/14/2019 DATAMINING AND DATAWARE HOUSING

    4/7

    There are two main types of clustering techniques

    partitional clustering techniques and hierarchical

    clustering techniques. The partitional clustering

    techniques construct a partition of the database into

    predefined number of clusters. The hierarchical

    clustering techniques do a sequence of partitions

    in which each partition is nested into next partition

    in the sequence.

    Datasets before clustering

    Datasets after clustering

    5. PARTITIONAL ALGORITHMS

    Partitional algorithms construct a partition of a

    database of n objects into a set of k clusters. The

    construction involves determining the optimal

    partition with respect to an objective function.

    There are approximately k/k! ways of partitioning

    a set of n data points into k subsets. An exhaustive

    enumeration method can though find the global

    optimal partition but is practically infeasible whenn and k are very small. The partitional clustering

    algorithm usually adopts iterative optimization

    paradigm. It starts with an initial partition and uses

    an iterative control strategy. It tries swapping of

    data points to see if such a swapping improves the

    quality of clustering. When no swapping yields

    improvements in clustering it finds a locally

    optimal partition. This quality of clustering is very

    sensitive to initially selected partition. There are

    mainly two different categories of the partitioning

    algorithms.

    k-means algorithm, where each cluster is

    represented by the center of gravity of the

    cluster.

    k-medoid algorithms where each cluster is

    represented by one of the objects of the

    cluster located near the center.

    Most of special clustering algorithms designed for

    data mining are k-medoid algorithms. Different k-

    medoid algorithms are PAM, CLARA,

    CLARANS.

    6. k-Medoid Algorithms

    6.1 PAM

    PAM uses a k-medoid method to identify the

    clusters. PAM selects k objects arbitrarily from the

    data as medoids. In each step, a swap between a

    selected object Oi and a non-selected object Oh is

    made as long as such a swap would result in an

    improvement of the quality of clustering .To

    calculate the effect of such a swap between O i and

    Oh a cost Cih is computed, which is related to the

    quality of partitioning the non-selected objects to k

    clusters represented by the medoids. So, at this

    stage it is necessary first to understand the method

    of partitioning of the data objects when a set of k-

    medoids are given

    4

  • 8/14/2019 DATAMINING AND DATAWARE HOUSING

    5/7

    Partitioning

    If Oj is a non-selected object and O i is a medoid,

    we then say Oj belongs to the cluster represented

    by Oi, if d(Oi,Oj)=Minoe d(Oj,Oe), where the

    minimum is taken over all medoids Oe and

    d(Oa,Oh) determines the distance or dissimilarity

    between objects Oa and Oh. The dissimilarity

    matrix is known prior to the commencement of

    PAM. The quality of clustering is measured by the

    average dissimilarity between an object and the

    medoid of the cluster to which the object belongs.

    Iterative Selection of Medoids

    Let us assume that O1, O2, .., Ok are k medoids

    selected at any stage. We denote C1, C2, , Ck are

    the respective clusters. From the foregoing

    discussion, for a non-selected object Oj, j 1, 2

    k ifOj Ch then Min(1

  • 8/14/2019 DATAMINING AND DATAWARE HOUSING

    6/7

  • 8/14/2019 DATAMINING AND DATAWARE HOUSING

    7/7

    Increment jj+1

    End do

    Compare the cost of clustering with

    mincost

    If current_cost < mincost

    o Mincost current_cost

    o Best_nodecurrent

    increment ee+1

    End do

    Return best node.

    7. ANALYSIS

    PAM is very robust to the existence of outliers.

    The clusters found by this method do not depend

    on the order in which the objects are examined.

    However, it cannot handle very large data. CLARA

    samples the large data and applies PAM on this

    sample. The result will be based on the sample.

    CLARANS applies randomized Iterative-

    Optimization for determining of medoids. This can

    be applied to large datasets also. It is more efficient

    than earlier medoid-based methods suffers from

    two major drawbacks: it assumes that all objects fit

    in main memory, and the result is very sensitive to

    input order. In addition, it may not find a real local

    minimum due to the trimming of searching which

    is controlled by maxneighbour.

    8. CONCLUSION

    PAM algorithm is efficient and gives good results

    when data is small. However it cannot be applied

    to large datasets. CLARA efficiency is determined

    by the sample of data taken at sampling phase.

    CLARANS is efficient for large datasets. As

    datasets from which required data is mined is large

    CLARANS is used and is efficient partitional

    algorithm compared to PAM and CLARA.

    9. REFERENCES:

    Vasudha Bhatnagar, On Mining Of Data, IETE

    Journal of research, 2001

    Data mining and warehousing by Dunham

    IEEE Papers

    www.datawarehouse.com

    www.itpapers.com

    7

    http://www.datawarehouse.com/http://www.itpapers.com/http://www.datawarehouse.com/http://www.itpapers.com/