Clustering of Data in the Cloud (Amazon

7
Clustering of Data in Clustering of Data in the Cloud (Amazon EC2) the Cloud (Amazon EC2) Guided By : Prof. D P Gaikwad

description

clustering of documents

Transcript of Clustering of Data in the Cloud (Amazon

  • Clustering of Data in the Cloud (Amazon EC2)Guided By : Prof. D P Gaikwad

  • AimChecking the performance of the K-means algorithm (most popular and easy algorithm) on large data sets (which is distributed).For that purpose used Apache Mahout (Machine Learning Library) on the top of Apache Hadoop (Framework for Distributed Computing ).Instead of creating Hadoop infrastructure by our own used Cloud Service (Amazon EC2)

  • IntroductionClustering refers to the process of grouping samples into different groups based on their similarities.These groups are called as Clusters.Unlike classification, clustering do not rely on predefined classes and class-labeled training examples. For this reason, clustering is a form of learning by observation, rather than learning by examples

  • K-means ClusteringK-means clustering is a widely used partition algorithm.It partitions the samples into clusters by minimizing a measure between the samples and the centroid of the clusters. Euclidean distance is a similarity criterion extensively used for samples.

  • Steps K-means ClusteringThe algorithm for the standard K-means is given as follows :

    1. Choose a number of clusters k 2. Initialize centroid1 c1, ck3.For each data point, compute the centroid it is closest to (using some distance measure) and assign the data point to this cluster.4. Re-compute centroids (mean of data points in cluster)5. Stop when there are no new re-assignments.

  • Hadoop / MahoutHadoop is used for distributed computing. It has two major components.

    1. MapReduce 2. HDFS (Hadoop Distributed File System)Apache Mahout is a machine learning library that runs over a Hadoop system. It has a collection of algorithms to solve clustering, classification and prediction problems. Both are free and open source projects. Due to their inexpensive and scalable characteristics, these platforms can be a promising technology to solve data intensive problems which were not trivial in the past.