horizontal aggregation and hierarchical clustering
-
Upload
miny-vincent -
Category
Documents
-
view
224 -
download
0
Transcript of horizontal aggregation and hierarchical clustering
Department of Computer Science & EngineeringCS010 807 Main Project Rajagiri School of Engineering and Technology, Rajagiri Valley
1. INTRODUCTION
Most algorithms require datasets with horizontal layout as input with several records
and one variable or dimensions per columns. Managing large data sets without DBMS
support can be a difficult task. Trying different subsets of data points and dimensions is more
flexible, faster and easier to do inside a relational database with SQL queries than outside
with alternative tool. Horizontal aggregation can be performing by using operator, it can
easily be implemented inside a query processor, much like a select, project and join. PIVOT
operator on tabular data that exchange rows, enable data transformations useful in data
modeling, data analysis, and data presentation.
There are many existing functions and operators for aggregation in Structured Query
Language. The most commonly used aggregation is the sum of a column and other
aggregation operators return the average, maximum, minimum or row count over groups of
rows. All operations for aggregation have many limitations to build large data sets for data
mining purposes. Database schemas are also highly normalized for On-Line Transaction
Processing (OLTP) systems where data sets that are stored in a relational database or data
warehouse. But data mining, statistical or machine learning algorithms generally require
aggregated data in summarized form. Data mining algorithm requires suitable input in the
form of cross tabular (horizontal) form; significant effort is required to compute aggregations
for this purpose. Such effort is due to the amount and complexity of SQL code which needs
to be written, optimized and tested.
To analyze data efficiently, Data mining systems are widely using datasets with
columns in horizontal tabular layout. Preparing a data set is more complex task in a data
mining project, requires many SQL queries, joining tables and aggregating columns.
Conventional RDBMS usually manage tables with vertical form. Aggregated columns in a
horizontal tabular layout returns set of numbers, instead of one number per row. The system
uses one parent table and different child tables, operations are then performed on the data
loaded from multiple tables. PIVOT operator, offered by RDBMS is used to calculate
aggregate operations. PIVOT method is much faster method and offers much scalability.
1 HORIZONTAL AGGREGATION & HIERARCHICAL CLUSTERING
Department of Computer Science & EngineeringCS010 807 Main Project Rajagiri School of Engineering and Technology, Rajagiri Valley
Partitioning large set of data, obtained from the result of horizontal aggregation, in to
homogeneous cluster is important task in this system. K-means algorithm using SQL is best
suited for implementing this operation. To overcome the problems in K-means algorithm, we
are optimizing it by using hierarchical clustering.
1.1 Project Scope
The objective of this project is to find an efficient clustering algorithm that overcomes
the defects of commonly used K-means algorithm for portioning large set of data, obtained
as a result of horizontal aggregation into homogeneous clusters.
Data mining, statistical or machine learning algorithms generally require aggregated
data in summarized form. To analyze data efficiently, Data mining systems are widely using
datasets with columns in horizontal tabular layout. Preparing a data set is more complex task
in a data mining project, requires many SQL queries, joining tables and aggregating columns.
Conventional RDBMS usually manage tables with vertical form. Aggregated columns in a
horizontal tabular layout returns set of numbers, instead of one number per row. The system
uses one parent table and different child tables, operations are then performed on the data
loaded from multiple tables. Partitioning large set of data, obtained from the result of
horizontal aggregation, in to homogeneous cluster is important task in data mining analysis.
The proposed implementations allow clustering large data sets in a relational DBMS
eliminating the need to ex-port or access data outside the DBMS. Horizontal aggregation
represents a template to generate SQL code from a data mining tool. This SQL code reduces
manual work in the data preparation phase in data mining related project.
Automatically generated code, which is more efficient than end user written SQL
code. Thus datasets for the data mining projects can be created in less time. Data sets can be
created entirely inside the DBMS .
2 HORIZONTAL AGGREGATION & HIERARCHICAL CLUSTERING
Department of Computer Science & EngineeringCS010 807 Main Project Rajagiri School of Engineering and Technology, Rajagiri Valley
1.2 Design and Implementation Constraints Managing large data sets without DBMS support can be a difficult task.
All aggregations operation has many limitations to build data sets for data mining
purposes
SQL code needs to be written, optimized and tested.
1.3 Assumptions and Dependencies Data in numerical form is preferred.
Data should be applicable for performing horizontal aggregation.
It is assumed that there is no missing attributes.
The obtained data sets after horizontal aggregation should be suitable for performing
clustering.
The data is considered to be in numerical form.
1.4 Advantages Works well with outliers
Produces good clusters with a single scan of data set
Minimizes running time
Linearly scalable
Less space complexity
3 HORIZONTAL AGGREGATION & HIERARCHICAL CLUSTERING
Department of Computer Science & EngineeringCS010 807 Main Project Rajagiri School of Engineering and Technology, Rajagiri Valley
2. LITERATURE SURVEY
2.1 Background
Horizontal aggregation is new class of function to return aggregated columns in a
horizontal layout. Most algorithms require datasets with horizontal layout as input with
several records and one variable or dimensions per columns. Preparing a data set is more
complex task in a data mining project, requires many SQL queries, joining tables and
aggregating columns. Horizontal aggregation can be performed by using the PIVOT operator.
It can easily be implemented inside a query processor, much like a select, project and join.
PIVOT operator on tabular data that exchange rows, enable data transformations useful in
data modeling, data analysis, and data presentation. PIVOT method is much faster method
and offers much scalability.
Partitioning large set of data, obtained from the result of horizontal aggregation, in to
homogeneous cluster is important task in this system. K-means algorithm using SQL is best
suited for implementing this operation. To overcome the problems in K-means algorithm, we
are optimizing it by using hierarchical clustering.
Similar IEEE papers:
(i) Horizontal Aggregations in SQL by Using Clustering as a Mixture
Gaussians
To analyze data efficiently, Data mining systems are widely using datasets with
columns in horizontal tabular layout. Preparing a data set is more complex task in a data
mining project, requires many SQL quires, joining tables and aggregating columns.
Conventional RDBMS usually manage tables with vertical form. Aggregated columns in a
horizontal tabular layout returns set of numbers, instead of one number per row [3].The
system uses one parent table and different child tables, operations are then performed on the
data loaded from multiple tables. PIVOT operator, offered by RDBMS is used to calculate
4 HORIZONTAL AGGREGATION & HIERARCHICAL CLUSTERING
Department of Computer Science & EngineeringCS010 807 Main Project Rajagiri School of Engineering and Technology, Rajagiri Valley
aggregate operations. PIVOT method is much faster method and offers much scalability.
Partitioning large set of data, obtained from the result of horizontal aggregation, in to
homogeneous cluster is important task in this system. Gaussians algorithm using SQL is best
suited for implementing
(ii) Optimization of Horizontal Aggregation in SQL by Using K-Means
Clustering
Horizontal aggregation is new class of function to return aggregated columns in a
horizontal layout. Horizontal aggregation can be performing by using operator, it can easily
be implemented inside a query processor, much like a select, project and join. PIVOT
operator on tabular data that exchange rows, enable data transformations useful in data
modeling, data analysis, and data presentation. K-means clustering algorithms are used to
cluster the attribute, that attribute is the result of horizontal aggregation. Clustering using the
K-Means algorithm requires the user to specify the number of clusters, which is a major
disadvantage[4].
5 HORIZONTAL AGGREGATION & HIERARCHICAL CLUSTERING
Department of Computer Science & EngineeringCS010 807 Main Project Rajagiri School of Engineering and Technology, Rajagiri Valley
3. HARDWARE AND SOFTWARE SPECIFICATIONS
Minimum Software requirements:
IDE: Netbeans 7.01
SQL SERVER 2005
JDK 1.6
Minimum Hardware requirements:
1 GB RAM
3.1 Java Development Kit
A Java Development Kit (JDK) is a program development environment for writing
Java applets and applications. It consists of a runtime environment that "sits on top" of the
operating system layer as well as the tools and programming that developers need to compile,
debug, and run applets and applications written in the Java language.
3.2 SQL Server 2005
Microsoft SQL Server is a relational database management system developed by
Microsoft. As a database, it is a software product whose primary function is to store and
retrieve data as requested by other software applications, be it those on the same computer or
those running on another computer across a network (including the Internet). There are at
least a dozen different editions of Microsoft SQL Server aimed at different audiences and for
workloads ranging from small single-machine applications to large Internet-facing
applications with many concurrent users. Its primary query languages are T-SQL and Slit
included native support for managing XML data, in addition to relational data. For this
purpose, it defined an xml data type that could be used either as a data type in database
columns or as literals in queries. XML columns can be associated with XSD schemas; XML
data being stored is verified against the schema. XML is converted to an internal binary data
6 HORIZONTAL AGGREGATION & HIERARCHICAL CLUSTERING
Department of Computer Science & EngineeringCS010 807 Main Project Rajagiri School of Engineering and Technology, Rajagiri Valley
type before being stored in the database. Specialized indexing methods were made available
for XML data. XML data is queried using XQuery; SQL Server 2005 added some extensions
to the T-SQL language to allow embedding XQuery queries in T-SQL. SQL Server 2005 has
also been enhanced with new indexing algorithms, syntax and better error recovery systems.
Data pages are check summed for better error resiliency, and optimistic concurrency support
has been added for better performance. Permissions and access control have been made more
granular and the query processor handles concurrent execution of queries in a more efficient
way. Partitions on tables and indexes are supported natively, so scaling out a database onto a
cluster is easier. SQL CLR was introduced with SQL Server 2005 to let it integrate with
the .NET Framework. SQL Server 2005 introduced "MARS" (Multiple Active Results Sets),
a method of allowing usage of database connections for multiple purposes.
3.3 Net Beans 7.01
NetBeans is an integrated development environment (IDE) for developing primarily
with Java, but also with other languages, in particular PHP, C/C++, and HTML5. It is also an
application platform framework for Java desktop applications and others. The NetBeans IDE
is written in Java and can run on Windows, OS X, Linux, Solaris and other platforms
supporting a compatible JVM. The NetBeans Platform allows applications to be developed
from a set of modular software components called modules.
Applications based on the NetBeans Platform (including the NetBeans IDE itself) can
be extended by third party developers. The NetBeans Platform is a reusable framework for
simplifying the development of Java Swing desktop applications. The NetBeans IDE bundle
for Java SE contains what is needed to start developing NetBeans plugins and NetBeans
Platform based applications; no additional SDK is required. NetBeans IDE is a free, open-
source, cross-platform IDE with built-in-support for Java Programming Language.
The NetBeans IDE is written in Java and can run on Windows, OS X, Linux, Solaris
and other platforms supporting a compatible JVM. The NetBeans Platform allows
applications to be developed from a set of modular software components called modules.
7 HORIZONTAL AGGREGATION & HIERARCHICAL CLUSTERING
Department of Computer Science & EngineeringCS010 807 Main Project Rajagiri School of Engineering and Technology, Rajagiri Valley
NetBeans Platform (including the NetBeans IDE itself) can be extended by third party
developers.
4. SYSTEM ANALYSIS AND DESIGN
4.1 Existing System
Existing system refers to a system which presently we are using or provided and
contains certain limitations which are researched so as to find a solution to it. The existing
system consists of performing horizontal aggregation using pivot and clustering using K-
Means. PIVOT operator on tabular data that exchange rows, enable data transformations
useful in data modeling, data analysis, and data presentation. Data mining algorithm requires
suitable input in the form of cross tabular (horizontal) form; significant effort is required to
compute.
K- Means is initialized from some random or approximate solution. Each step assigns
each point to its nearest cluster and then points belonging to the same cluster are averaged to
get new cluster centroids. Each step successively improves cluster centroids until they are
stable. This is the standard version of K-Means technique used.
But this clustering Technique has many disadvantages. Firstly the learning
algorithm requires a priori specification of the number of cluster centers. If there are two
highly overlapping data then k-means will not be able to resolve that there are two clusters.
Euclidean distance measures can unequally weight underlying factors. The learning algorithm
provides the local optima of the squared error function. Randomly choosing of the cluster
center cannot lead us to the fruitful result. It is applicable only when mean is defined i.e. fails
for categorical data. It is unable to handle noisy data and outliers and fails for non-linear data
set.
4.2 Requirement Analysis
SQL Server 2008
8 HORIZONTAL AGGREGATION & HIERARCHICAL CLUSTERING
Department of Computer Science & EngineeringCS010 807 Main Project Rajagiri School of Engineering and Technology, Rajagiri Valley
SQL server is a RDBMS developed by Microsoft. This is the RDBMS that we have
used for the project. It supports pivot operation. The databases are created in SQL SERVER.
Connection is established with SQL server and then databases are read, aggregated and
clustered accordingly.
Database
Databases are created to operate large quantities of information by inputting, storing,
retrieving and managing that information. The databases used for horizontal aggregation and
clustering should contain numerical data.
4.3 Proposed System
Proposed system refers to a system where we apply our ideas or do slight
modifications to the Existing System. Here we are choosing a more efficient algorithm for
clustering i.e. BIRCH (Balanced Iterative reducing and Clustering using Hierarchies), which
is an unsupervised data mining algorithm used to perform hierarchical clustering over
particularly large data-sets. An advantage of BIRCH is its ability to incrementally and
dynamically cluster incoming, multi-dimensional metric data points in an attempt to produce
the best quality clustering for a given set of resources (memory and time constraints). In most
cases, BIRCH only requires a single scan of the database. In addition, BIRCH is recognized
as the first clustering algorithm proposed in the database area to handle 'noise' (data points
that are not part of the underlying pattern) effectively.
This Clustering Algorithm has many advantages. It is local in that each clustering
decision is made without scanning all data points and currently existing clusters. It exploits
the observation that data space is not usually uniformly occupied and not every data point is
equally important. It makes full use of available memory to derive the finest possible sub-
clusters while minimizing I/O costs. It is also an incremental method that does not require the
whole data set in advance.
4.4. Module Division
1. Interface Design
2. Horizontal Aggregation
9 HORIZONTAL AGGREGATION & HIERARCHICAL CLUSTERING
Department of Computer Science & EngineeringCS010 807 Main Project Rajagiri School of Engineering and Technology, Rajagiri Valley
3. BIRCH Clustering
4. Comparison using K-Means.
4.4.1 Interface Design
The Interface of our project is basically made with NETBEANS 7.01. NetBeans is an
integrated development environment (IDE) for developing primarily with Java, but also with
other languages, in particular PHP, C/C++, and HTML5. It is also an application platform
framework for Java desktop applications and others. Desktop Application of Netbeans has
been used for our project to give a professional style. It consists of a Menubar with a set of
menu options. One of the menu option is “Optimising” in this we have different menu items
to display the required table, to perform Horizontal Aggregation, K-Means Clustering and
BIRCH Clustering.
When the Dataset menuItem is clicked, the user is prompted for the dataset name. On
entering a valid Dataset name, the corresponding Dataset is displayed. If a valid name is not
entered, no Table will be displayed and is an error.
When the Horizontal Aggregation menuItem is clicked, the user is prompted for the
METHOD i.e. what aggregation function is to be used for the Horizontal Aggregation using
pivot , COLUMN TO BE AGGREGATED i.e. which on column values should the
aggregation function be applied and then the TRANSPOSE i.e. which columns values has to
be transposed into separated columns. Once valid entries are made, the horizontally
aggregated output will be displayed in a table format. If illegal values are entered, no table
will be displayed.
When the optimizing menuItem is clicked, using the previously entered values the K-
means clustering algorithm is executed and the output is displayed in table format. The K-
Means clustering algorithm is performed for comparison purpose.
When the BIRCH clustering menuItem is clicked, the BIRCH clustering code is
executed. The output consists of displaying the number of data points within a specified
cluster. This output is displayed as a Bar chart.
10 HORIZONTAL AGGREGATION & HIERARCHICAL CLUSTERING
Department of Computer Science & EngineeringCS010 807 Main Project Rajagiri School of Engineering and Technology, Rajagiri Valley
Another menu Option in this menu is for comparison. It contains a menu item to
display the comparison chart. When this menu Item is pressed, the comparison chart is
displayed. This contains the comparison of the time complexity and space complexity of both
the clustering algorithms.
4.4.2 Horizontal Aggregation
The second module of this project is performing horizontal Aggregation. Horizontal
aggregation is new class of function to return aggregated columns in a horizontal layout.
Most algorithms require datasets with horizontal layout as input with several records and one
variable or dimensions per columns. Managing large data sets without DBMS support can be
a difficult task. Trying different subsets of data points and dimensions is more flexible, faster
and easier to do inside a relational database with SQL queries than outside with alternative
tool. Horizontal aggregation is done using PIVOT operator. The PIVOT operator on tabular
data exchange rows, enable data transformations useful in data modeling, data analysis, and
data presentation.
There are several advantages for horizontal aggregation. First one is horizontal
aggregation represent a template to generate SQL code from a data mining tool. This SQL
code reduces manual work in the data preparation phase in data mining related project.
Second is automatically generated code, which is more efficient than end user written SQL
code. Thus datasets for the data mining projects can be created in less time. Third advantage
is the data sets can be created entirely inside the DBMS.
K D1 D2 A
1 3 X 9
2 2 Y 6
3 1 Y 10
4 1 Y 0
5 2 X 1
6 1 X Null
7 3 X 8
8 2 X 711 HORIZONTAL AGGREGATION & HIERARCHICAL CLUSTERING
D1 D2X D2Y
1 null 10
2 8 6
3 17 Null
Department of Computer Science & EngineeringCS010 807 Main Project Rajagiri School of Engineering and Technology, Rajagiri Valley
Figure 4.1: Horizontal aggregation of Table F
To perform horizontal Aggregation in our project, the user is prompted for the
METHOD i.e. what aggregation function is to be used for Horizontal Aggregation using
pivot , COLUMN TO BE AGGREGATED i.e. which on column values should the
aggregation function be applied and then the TRANSPOSE i.e. which columns values has to
be transposed into separated columns. Once valid entries are made, the horizontally
aggregated output will be displayed in a table format. If illegal values are entered, no table
will be displayed.
4.4.3 BIRCH Clustering
BIRCH is a clustering algorithm which deals with large datasets by first generating a
more compact summary that retains as much distribution information as possible, and then
clustering the data summary instead of the dataset.
BIRCH summarizes a dataset into a set of sub clusters to reduce the scale of the
clustering problem using the concept of clustering feature.
Clustering feature: Clustering feature (CF) entry is a triple summarizing the
information that we maintain about a sub cluster of a data points.
CF-tree: A CF-tree is a height-balanced tree with two parameters: branching factor
( B for non leaf node and L for leaf node) and threshold T. Each nonleaf node contains atmost
B entries of the form [CFi, childi], where i=1,2,…,B, ’childi’ is a pointer to its i-th child node,
and CFi is the CF entry of the sub cluster represented by this child. So a nonleaf node
represents a sub cluster made up of all the sub clusters represented by its entries. In addition,
each leaf node has two pointers, ‘prev’ and ‘next’, which are used to chain all leaf nodes
together for efficient scans. All entries in a leaf node must satisfy a threshold requirements,
with respect to a threshold value T: the diameter( alternatively , the radius) of each leaf entry
has to be less than T.
12 HORIZONTAL AGGREGATION & HIERARCHICAL CLUSTERING
Department of Computer Science & EngineeringCS010 807 Main Project Rajagiri School of Engineering and Technology, Rajagiri Valley
The tree size is a function of T. The larger T is, the smaller the tree is. A node is
required to fit in a page of size P, where P is a parameter of BIRCH. Once the dimension d of
the data space is given, the sizes of leaf and nonleaf entries are known, and then B and L are
determined by P. So P can be varied for performance tuning.
Phases of BIRCH algorithm:
The four phases of BIRCH algorithm are:
1. Loading :
The main task of Phase 1 is to scan all data and build an initial in-memory CF-
tree using the given amount of memory and recycling space on disk.
2. Optional Condensing:
This phase serves as a cushion between phase one loading and phase three. We
scan the leaf entries in the initial CF tree to rebuild a small CF Tree while
removing more outliers and grouping more crowded sub clusters into larger
ones.
3. Global Clustering:
We use an existing global or semi global algorithm to cluster all the leaf
entries across the boundaries of different nodes. After this phase we obtain a
set of clusters that captures the major distribution of patterns in the data.
4. Cluster refining:
It uses the centroids of the clusters produced by the previous phase as seeds
and redistributes the data points to its closest seed to obtain a set of new
clusters. It also ensures that all the copies of the given data point goes to the
same cluster.
4.4.4 Comparison Using K-Means
Comparison of the performance of BIRCH clustering algorithm is done with another
clustering algorithm such as K-Means algorithm. The comparison is done by comparing the
time complexities and space complexities of the algorithms. The time complexities are
13 HORIZONTAL AGGREGATION & HIERARCHICAL CLUSTERING
Department of Computer Science & EngineeringCS010 807 Main Project Rajagiri School of Engineering and Technology, Rajagiri Valley
calculated by obtaining the difference between the starting and ending times while running
both the algorithms.
K-means is initialized from some random or approximate solution. Each step assigns
each point to its nearest cluster and then points belonging to the same cluster are averaged to
get new cluster centroids. Each step successively improves cluster centroids until they are
stable. This is the standard version of K-Means technique used. The algorithm is designed as
follows:
K-means algorithm based on classification technique uses horizontal aggregation as
input. Pivot operator is used to calculate the aggregation of particular data values from
distinct fact tables. Optimization provides for PIVOT for large number of fact table. The
database connectivity and choosing different tables with .mdb extension is the first step in
this system.
Horizontal aggregation can be evaluated by choosing transpose column and aggregate
operation .Pivot operator automatically transforms table to horizontal layout. This is the main
advantage of this particular algorithm.
The k-means algorithm is the best-known squared error based clustering algorithm
with input as horizontal aggregation The algorithm consist of mainly four steps.1) Selection
of the initial k means for k clusters from attribute of datasets obtained from horizontal
aggregation operation.2) Calculation of the dissimilarity between an object and the Mean of a
cluster.3) Allocation of an object to the cluster whose mean is nearest to the object.4) Re-
calculation of the mean of a cluster from the objects allocated to it so that the intra cluster
dissimilarity.
4.5 System Overview4.5.1 Use Case Description
Use Case 1:
Name: Horizontal Aggregation Module
14 HORIZONTAL AGGREGATION & HIERARCHICAL CLUSTERING
Department of Computer Science & EngineeringCS010 807 Main Project Rajagiri School of Engineering and Technology, Rajagiri Valley
Description: To horizontally aggregate a specified dataset so as to use it as input for the
clustering algorithms.
Flow of events:
a) Main flow
1. User logs in to SQLSERVER.
2. Connection is established.
3. The necessary datasets are loaded into SQLSERVER.
4. User is prompted for the required aggregating function, the aggregated column
and the transpose column.
5. The code for horizontal aggregation is executed.
6. Output is obtained and displayed in table format.
Activity diagram:
Figure 4.2:Activity Diagram
15 HORIZONTAL AGGREGATION & HIERARCHICAL CLUSTERING
User views the result
Query is executed and horizontally aggregated table obtained
User is prompted for method, transpose and aggregate column
Loads tables into DB
User logs in to DB
Department of Computer Science & EngineeringCS010 807 Main Project Rajagiri School of Engineering and Technology, Rajagiri Valley
Use Case Diagram :
User
Figure 4.3: Use Case Diagram for Horizontal Aggregation
Specific Requirements
SQL SERVER should contain the required datasets
Aggregate functions should be a valid one
Column to be aggregated should contain only numerical data
Pre conditions
The required dataset should be loaded into the SQLSERVER
Connection should be established between SQLSERVER and the Application.
Post conditions
The Horizontally aggregated output of the dataset is displayed in table format
Use Case 2:
Name: BIRCH Clustering Module
Description: To perform Clustering using BIRCH algorithm. Its performance is compared
with K-Means Algorithm.
Flow of events
a) Main flow:
1. Scan the horizontally aggregated data.
2. Create Clustering Features for each Data Point and insert it into the CF Tree.
16 HORIZONTAL AGGREGATION & HIERARCHICAL CLUSTERING
Enter the column to be aggregated
Enter the aggregate function
Enter the transpose column
Enter the table name
Department of Computer Science & EngineeringCS010 807 Main Project Rajagiri School of Engineering and Technology, Rajagiri Valley
3. Condense the CF Tree
4. Perform global clustering using any clustering algorithm such as K-Means.
5. Refine the output obtained
6. Check whether more clustering can be done on the data points
7. If yes, repeat the above steps from step 4 to 7
b) Alternative flow:
8.a: If no, then exit and display the clusters using a Bar Graph.
Activity Diagram:
17 HORIZONTAL AGGREGATION & HIERARCHICAL CLUSTERING Cluster more?
User logs in to DB
User obtains aggregated data
Initial Scanning
Create CF tree
Refine the tree
Perform Global Clustering
Condense the Clustering Feature Tree
yes
Department of Computer Science & EngineeringCS010 807 Main Project Rajagiri School of Engineering and Technology, Rajagiri Valley
no
Figure 4.4: BIRCH Clustering
Use Case Diagram:
User
Figure 4.5: Use case diagram for BIRCH ALGORITHM
Post conditions
1. View and analyze the clustered results
4.6 Algorithm18 HORIZONTAL AGGREGATION & HIERARCHICAL CLUSTERING
Refine the tree
Perform Global Clustering
Condense the clustering feature
Create CF tree
Department of Computer Science & EngineeringCS010 807 Main Project Rajagiri School of Engineering and Technology, Rajagiri Valley
Connection to the SQL Server
Step 1: Start.
Step 2: Connect the SQLSERVER.
Step 3: Load tables which is to be clustered into the database.
Step 4: Establish connection between the database and this application.
Step 5: Stop
Horizontal Aggregation Using Pivot
Step 1: Start.
Step 2: Read a valid table name and read the transpose column name.
Step 3: Read the column name which is to be aggregated.
Step 4: Read the name of the aggregate function which is to be applied on the column.
Step 5: Display the datasets in table format.
Step 6: Execute the query of PIVOT operator.
Step 7: Display the horizontally aggregated output in table format.
Step 8: Stop.
K-Means Algorithm
Step 1: Start.
Step 2: Initialize from some random or approximate solution.
Step 3: Assigns each point to its nearest cluster.
Step 4: Points belonging to the same cluster are averaged to get new cluster centroids.
Step 5: Repeat steps 2 and 3 successively which improves cluster centroids until they are stable.
Step 6: Stop.
BIRCH Algorithm
Step 1: Start.
Step 2: Scan all data and build an initial in-memory CF-tree using the given amount of
memory and recycling space on disk.
19 HORIZONTAL AGGREGATION & HIERARCHICAL CLUSTERING
Department of Computer Science & EngineeringCS010 807 Main Project Rajagiri School of Engineering and Technology, Rajagiri Valley
Step 2.1: Identifying the appropriate leaf: Starting from the root,, it recursively
descends the CF tree by choosing tile closest child node according to a chosen
distance metric D0.
Step 2.2: Modifying the leaf: When it reaches a leaf node, it finds the closest leaf
entry, say Li, and then tests whether Li can “absorb” “ Ent” without violating the
threshold conditions. If SO, the CF vector for Li is updated to reflect this, If not,, a
new entry for “Ent,” is added to the leaf. If there is space on the leaf for this new
entry, we are done, otherwise we must, .split the leaf node. Node splitting is done by
choosing the farthest pair of entries as seeds, and redistributing the remaining entries.
Step 2.3: Modifying the path to the leaf: After inserting “Ent” into a leaf, update the
CF information.
Step 3: Scan the leaf entries in the initial CF tree to rebuild a small CF Tree while removing
more outliers and grouping more crowded sub clusters into larger ones.
Step 4: Use an existing global or semi global algorithm to cluster all the leaf entries across
the boundaries of different nodes.
Step 5: Use the centroids of the clusters produced by the previous phase as seeds and
redistributes the data points to its closest seed to obtain a set of new clusters.
Step 6: Stop.
4.7. System Architecture
20 HORIZONTAL AGGREGATION & HIERARCHICAL CLUSTERING
DATABASE
DATA SET 1
DATA SET 1
TREE CONDENSING
INITIAL SCANNING
HORIZONTAL AGGREGATION
USING PIVOT
Department of Computer Science & EngineeringCS010 807 Main Project Rajagiri School of Engineering and Technology, Rajagiri Valley
Figure 4.6: System architecture
5. IMPLEMENTATIONImplementation is an activity that is contained throughout the development phase. It is
a process of bringing a developed system into operational use and turning it over to the user.
The new system and its components are to test in a structured and planned manner. A
successful system should be delivered and users should have confidence that the system
would work efficiently and effectively. The more complex the system being implemented the
more involved will be the system analysis and design effort required for implementation.
Implementation is the stage of the project when the theoretical design is turned into
the working system. It is the most crucial stage in achieving a successful new system and
giving the users confidence that the new system will work and be effective. The coding
process performs this task. Coding is viewed as a natural consequence of design.
5.1. Top down Design
A top-down approach is essentially the breaking down of a system to gain insight into
its compositional sub-systems. In a top-down approach an overview of the system is first
formulated, specifying but not detailing any first-level subsystems. Each subsystem is then
refined in yet greater detail, sometimes in many additional subsystem levels, until the entire
specification is reduced to base elements. A top-down model is often specified with the
assistance of "black boxes", these make it easier to manipulate. However, black boxes may
21 HORIZONTAL AGGREGATION & HIERARCHICAL CLUSTERING
Department of Computer Science & EngineeringCS010 807 Main Project Rajagiri School of Engineering and Technology, Rajagiri Valley
fail to elucidate elementary mechanisms or be detailed enough to realistically validate the
model.
Top-down approaches emphasize planning and a complete understanding of the
system. It is inherent that no coding can begin until a sufficient level of detail has been
reached in the design of at least some part of the system. The Top-Down Approach is done by
attaching the stubs in place of the module. This, however, delays testing of the ultimate
functional units of a system until significant design is complete. Bottom-up emphasizes
coding and early testing, which can begin as soon as the first module has been specified. This
approach, however, runs the risk that modules may be coded without having a clear idea of
how they link to other parts of the system, and that such linking may not be as easy as first
thought. Re-usability of code is one of the main benefits of the bottom-up approach.
Top-down is a programming style, the mainstay of traditional procedural languages,
in which design begins by specifying complex pieces and then dividing them into
successively smaller pieces. Eventually, the components are specific enough to be coded and
the program is written. This is the exact opposite of the bottom-up programming approach
which is common in object-oriented languages such as C++ or Java.
The technique for writing a program using top-down methods is to write a main
procedure that names all the major functions it will need. Later, the programming team looks
at the requirements of each of those functions and the process is repeated. These
compartmentalized sub-routines eventually will perform actions so simple they can be easily
and concisely coded. When all the various sub-routines have been coded the program is done.
By defining how the application comes together at a high level, lower level work can
be self-contained. By defining how the lower level abstractions are expected to integrate into
higher level ones, interfaces become clearly defined.
Practicing top-down programming has several advantages:
Separating the low level work from the higher level abstractions leads to a
modular design.
22 HORIZONTAL AGGREGATION & HIERARCHICAL CLUSTERING
Department of Computer Science & EngineeringCS010 807 Main Project Rajagiri School of Engineering and Technology, Rajagiri Valley
Modular design means development can be self-contained.
Having "skeleton" code illustrates clearly how low level modules integrate.
Fewer operations errors (to reduce errors, because each module has to be
processed separately, so programmers get large amount of time for
processing).
Much less time consuming (each programmer is only involved in a part of the
big project).
Very optimized way of processing (each programmer has to apply their own
knowledge and experience to their parts (modules), so the project will become
an optimized one).
Easy to maintain (if an error occurs in the output, it is easy to identify the
errors generated from which module of the entire program).
5.2. Bottom up Approach
A bottom-up approach is the piecing together of systems to give rise to grander
systems, thus making the original systems sub-systems of the emergent system. In a bottom-
up approach the individual base elements of the system are first specified in great detail.
These elements are then linked together to form larger subsystems, which then in turn are
linked, sometimes in many levels, until a complete top-level system is formed. This strategy
often resembles a "seed" model, whereby the beginnings are small but eventually grow in
complexity and completeness. However, "organic strategies" may result in a tangle of
elements and subsystems, developed in isolation and subject to local optimization as opposed
to meeting a global purpose
In a bottom-up approach the individual base elements of the system are first specified
in great detail. These elements are then linked together to form larger subsystems, which then
in turn are linked, sometimes in many levels, until a complete top-level system is formed.
This strategy often resembles a "seed" model, whereby the beginnings are small, but
eventually grow in complexity and completeness.
23 HORIZONTAL AGGREGATION & HIERARCHICAL CLUSTERING
Department of Computer Science & EngineeringCS010 807 Main Project Rajagiri School of Engineering and Technology, Rajagiri Valley
Object-oriented programming (OOP) is a programming paradigm that uses "objects"
to design applications and computer programs. This bottom-up approach has one weakness.
We need to use a lot of intuition to decide the functionality that is to be provided by the
module. If a system is to be built from existing system, this approach is more suitable as it
starts from some existing modules.
5.3. Software Development Process Model
A software development process, also known as a software development life cycle
(SDLC), is a structure imposed on the development of a software product. Similar terms
include software life cycle and software process. It is often considered a subset of systems
development life cycle. There are several models for such processes, each describing
approaches to a variety of tasks or activities that take place during the process. Some people
consider a life-cycle model a more general term and a software development process a more
specific term. For example, there are many specific software development processes that 'fit'
the spiral life-cycle model. ISO/IEC 12207 is an international standard for software life-cycle
processes. It aims to be the standard that defines all the tasks required for developing and
maintaining software.
There are many different models and methodologies, but each generally consists of a series of
defined steps or stages. Few popular models are:
Waterfall Model
V-Model
Spiral Model
Prototyping
Iterative (Incremental Delivery)
Rapid Application Development
The engineering process that we used to complete our project was the
V- model or the Validation model. This is an elaboration of Waterfall model and emphasizes
on validation activities corresponding to each of the activities that create the products of the
project. It can be seen as expanding the testing activity in the waterfall model. Each step has a
matching validation process that can give feedback (in case defects are found) to the
24 HORIZONTAL AGGREGATION & HIERARCHICAL CLUSTERING
Department of Computer Science & EngineeringCS010 807 Main Project Rajagiri School of Engineering and Technology, Rajagiri Valley
corresponding development stage and a reworking of the succeeding steps. For example, the
system designer might have written that a calculation be carried out in a certain way.
The person who has written the program design might have misunderstood what was
required and written the pseudo code in a different way. At system testing stage, the system
designer would carry out checks to ensure that the software is doing what was specified in the
design document and would discover the program designer’s misreading of that document.
So, all the steps from program design has to be repeated for correcting Defect.
25 HORIZONTAL AGGREGATION & HIERARCHICAL CLUSTERING
Department of Computer Science & EngineeringCS010 807 Main Project Rajagiri School of Engineering and Technology, Rajagiri Valley
Figure 5.1: V- Model
6. TESTING STRATEGIES
26 HORIZONTAL AGGREGATION & HIERARCHICAL CLUSTERING
Department of Computer Science & EngineeringCS010 807 Main Project Rajagiri School of Engineering and Technology, Rajagiri Valley
6.1 Black Box TestingBlack-box and white-box are test design methods. Black-box test design treats the
system as a "black-box", so it doesn't explicitly use knowledge of the internal structure.
Black-box test design is usually described as focusing on testing functional requirements.
Synonyms for black-box include: behavioral, functional, opaque-box, and closed-box. Black
Box Testing: It is also known as Skin Box was testing or Closed Box Testing.
Black testing takes an external perspective of the test object to derive test cases. These
tests can be functional or non-functional, though usually functional. The test designer selects
valid and invalid inputs and determines the correct output. There is no knowledge of the test
object's internal structure. The higher the level, and hence the bigger and more complex the
box, the more one is forced to use black box testing to simplify. While this method can
uncover unimplemented parts of the specification, one cannot be sure that all existent paths
are tested. Typical black box test design techniques include:
Decision table testing
Pairwise testing
State transition tables
Use case testing
Cross-functional testing
Black testing takes an external perspective of the test object to derive test cases. The
test designer selects valid and invalid inputs and determines the correct output. There is no
knowledge of the test object's internal structure. The higher the level, and hence the bigger
and more complex the box, the more one is forced to use black box testing to simplify. While
this method can uncover unimplemented parts of the specification, one cannot be sure that all
existent paths are tested.
6.2. White Box TestingWhite-box test design allows one to peek inside the "box", and it focuses specifically
on using internal knowledge of the software to guide the selection of test data. Synonyms for 27 HORIZONTAL AGGREGATION & HIERARCHICAL CLUSTERING
Department of Computer Science & EngineeringCS010 807 Main Project Rajagiri School of Engineering and Technology, Rajagiri Valley
white-box include: structural, glass-box and clear-box. While black-box and white-box are
terms that are still in popular use, many people prefer the terms "behavioral" and "structural".
Behavioral test design is slightly different from black-box test design because the use of
internal knowledge isn't strictly forbidden, but it's still discouraged. In practice, it hasn't
proven useful to use a single test design method. One has to use a mixture of different
methods so that they aren't hindered by the limitations of a particular one. Some call this
"gray-box" or "translucent-box" test design, but others wish we'd stop talking about boxes
altogether. It is important to understand that these methods are used during the test design
phase, and their influence is hard to see in the tests once they're implemented. Note that any
level of testing (unit testing, system testing, etc.) can use any test design methods. Unit
testing is usually associated with structural test design, but this is because testers usually don't
have well-defined requirements at the unit level to validate. It is also known as Clear Box
Testing or Glass Box or Open Box. This is used to find internal functionalities of software
like conditional loops, statement coverage. It is mainly done by the Developers.
White box testing (clear box testing, glass box testing, and transparent box testing, or
structural testing) uses an internal perspective of the system to design test cases based on
internal structure. It requires programming skills to identify all paths through the software.
The tester chooses test case inputs to exercise paths through the code and determines the
appropriate outputs. In electrical hardware testing, every node in a circuit may be probed and
measured; an example Is in circuit testing(ICT).
Since the tests are based on the actual implementation, if the implementation changes,
the tests probably will need to change, too. For example ICT needs updates if component
values change, and needs modified/new fixture if the circuit changes. This adds financial
resistance to the change process, thus buggy products may stay buggy.
AUTOMATED OPTICAL INSPECTION (AOI) offers similar component level
correctness checking without the cost of ICT fixtures, however changes still require test
updates. While white box testing is applicable at the unit levels of the software testing
process, it is typically applied to the unit. While it normally tests paths within a unit, it can
also test paths between unit integration and testing. During integration, and between
28 HORIZONTAL AGGREGATION & HIERARCHICAL CLUSTERING
Department of Computer Science & EngineeringCS010 807 Main Project Rajagiri School of Engineering and Technology, Rajagiri Valley
subsystems during a system level test. Though this method of test design can uncover an
overwhelming number of test cases, it might not detect unimplemented parts of the
specification or missing requirements, but one can be sure that all paths through the test
object are executed.
Typical white box test design techniques include:
Control flow testing
Data flow testing
Branch testing
6.2.1 Module Testing
In computer programming, module testing is a software verification and validation
method in which a programmer tests if individual units of source code are fit for use. A unit
is the smallest testable part of an application. In procedural programming a unit may be an
individual function or procedure. Unit tests are typically written and run by software
developers to ensure that code meets its design and behaves as intended. Its implementation
can vary from being very manual (pencil and paper) to being formalized as part of build
automation The primary goal of module testing is to take the smallest piece of testable
software in the application, isolate it from the remainder of the code, and determine whether
it behaves exactly as you expect. Each unit is tested separately before integrating them into
modules to test the interfaces between modules. Unit testing has proven its value in that a
large percentage of defects are identified during its use.
The most common approach to unit testing requires drivers and stubs to be written.
The driver simulates a calling unit and the stub simulates a called unit. The investment of
developer time in this activity sometimes results in demoting unit testing to a lower level of
priority and that is almost always a mistake. Even though the drivers and stubs cost time and
money, unit testing provides some undeniable advantages. It allows for automation of the
testing process, reduces difficulties of discovering errors contained in more complex pieces
of the application, and test coverage is often enhanced because attention is given to each unit.
6.2.2 Integration Testing
29 HORIZONTAL AGGREGATION & HIERARCHICAL CLUSTERING
Department of Computer Science & EngineeringCS010 807 Main Project Rajagiri School of Engineering and Technology, Rajagiri Valley
Integrated design is a collaborative method for designing buildings which emphasizes
the development of a holistic design. Conventional building design usually involves a series
of hand-offs from owner to architect to builder to occupant. This path does not invite all
affected parties into the planning process, and therefore does not take into account their
needs, areas of expertise or insights. In some cases, using the conventional method,
incompatible elements of the design are not discovered until late in the process when it is
expensive to make changes. In addition to extensive collaboration, integrated design involves
a “whole building design” approach. A building is viewed as an interdependent system, as
opposed to an accumulation of its separate components (site, structure, systems and use). The
goal of looking at all the systems together to is make sure they work in harmony rather than
against each other.
6.3. Testing Strategies
Test Case 1:
Enter the table name PIVOT CONTENT
Enter the method, column to be aggregated and transpose column
6.4 Summary of Test Results
Case 1: PIVOTCONTENT is entered as table name. The dataset is displayed. Method
entered was sum, Column to be aggregated was A and transpose column was D2.
The aggregated output was displayed. The K-Means algorithm is executed and
displayed in table format and BIRCH clustering algorithm is displayed as a Bar
chart.
30 HORIZONTAL AGGREGATION & HIERARCHICAL CLUSTERING
Department of Computer Science & EngineeringCS010 807 Main Project Rajagiri School of Engineering and Technology, Rajagiri Valley
Figure 6.1: Optimising options
Figure 6.2: Comparison Options
31 HORIZONTAL AGGREGATION & HIERARCHICAL CLUSTERING
Department of Computer Science & EngineeringCS010 807 Main Project Rajagiri School of Engineering and Technology, Rajagiri Valley
Figure 6.3: PIVOTCONTENT
32 HORIZONTAL AGGREGATION & HIERARCHICAL CLUSTERING
D1 D2 A
3 x 9
2 y 6
1 y 10
1 y 0
2 x 1
1 x -1
3 x 8
2 x 7
Department of Computer Science & EngineeringCS010 807 Main Project Rajagiri School of Engineering and Technology, Rajagiri Valley
7. RESULTS AND DISCUSSION
The Application horizontally aggregated the datasets and produced the clustered output.
1. Results for the dataset PIVOTCONTENT :
Figure 7.1: Loading Table to be Aggregated
33 HORIZONTAL AGGREGATION & HIERARCHICAL CLUSTERING
Department of Computer Science & EngineeringCS010 807 Main Project Rajagiri School of Engineering and Technology, Rajagiri Valley
Figure 7.2: The Loaded Table
Figure 7.3: Method of Aggregation
Figure 7.4: Column to be Aggregated
34 HORIZONTAL AGGREGATION & HIERARCHICAL CLUSTERING
Department of Computer Science & EngineeringCS010 807 Main Project Rajagiri School of Engineering and Technology, Rajagiri Valley
Figure 7.5: Transpose Column
Figure 7.6: After Horizontal Aggregation
35 HORIZONTAL AGGREGATION & HIERARCHICAL CLUSTERING
Department of Computer Science & EngineeringCS010 807 Main Project Rajagiri School of Engineering and Technology, Rajagiri Valley
Figure 7.7: After K-Means Clustering
Figure 7.8: After Hierarchical BIRCH Clustering
36 HORIZONTAL AGGREGATION & HIERARCHICAL CLUSTERING
Department of Computer Science & EngineeringCS010 807 Main Project Rajagiri School of Engineering and Technology, Rajagiri Valley
Figure 7.9: Comparison of Hierarchical and BIRCH Clustering
The comparison between the results of the horizontal aggregation with pivot
operator and without the pivot operator was a tedious task hence comparison chart
for clustering with K-means and BIRCH was implemented.
The output after the BIRCH clustering was to be shown on an dot matrix graph.
Since it was difficult to be done ,a bar graph for BIRCH clustering output was
performed.
We had implemented horizontal aggregation and clustering on simple databases
since pivot operator works for simple ones
37 HORIZONTAL AGGREGATION & HIERARCHICAL CLUSTERING
Department of Computer Science & EngineeringCS010 807 Main Project Rajagiri School of Engineering and Technology, Rajagiri Valley
8. CONCLUSION
This project has a great chance of creating a revolution in the field of Data Mining.
Most of the data mining applications require data in summarized form.so the horizontal
aggregation of the data return aggregated columns in a horizontal layout. Most algorithms
require datasets with horizontal layout as input with several records and one variable or
dimensions per columns. Horizontal aggregation can be performing by using PIVOT operator
on tabular data that exchange rows, enable data transformations useful in data modelling, data
analysis, and data presentation.
The clustering of the data was done using the BIRCH algorithm which can handle
large amount of datasets and produces good clusters. It also removes all the outliers and
hence minimizes the running time. Thus the data clustering has wide applications in the field
of image processing, bioinformatics, market analysis etc.
We have implemented the horizontal aggregation using the PIVOT operator on
databases and the aggregated output was clustered using the BIRCH algorithm. The clustered
output results are displayed on a bar chart .The comparison between the clustering using
KMEANS and the BIRCH were performed .The results of the comparison was that ,time and
the space complexity of BIRCH clustering was very less compared to KMEANS. It proved to
be an efficient method for large datasets.
38 HORIZONTAL AGGREGATION & HIERARCHICAL CLUSTERING
Department of Computer Science & EngineeringCS010 807 Main Project Rajagiri School of Engineering and Technology, Rajagiri Valley
9. FUTURE ENHANCEMENT
The horizontal aggregation for methods min ,max ,count can be implemented in
future.
In future, the input to the system can be given as multiple tables rather than single
table.
This project can be extended to develop a more formal model of evaluation methods
to achieve better results. Also then we can be developing more complete cost effective
methods.
39 HORIZONTAL AGGREGATION & HIERARCHICAL CLUSTERING
Department of Computer Science & EngineeringCS010 807 Main Project Rajagiri School of Engineering and Technology, Rajagiri Valley
10. REFERENCES
Web references:
[1] www.hiv.lang.gov/content/sequence/HEATMAP/heatmap.html
[2] www.ines-conf.org/ines-conf/58-INES2004.pdf
[3]www.ijert.org/view.php?id=2378&title=horizontal-aggregations-in-sql-by-using-
clustering-as-a-mixture-gaussians
[4] www.ijarcsse.com/docs/papers/May2012/Volum2_issue5/V2I500441.pdf
Book references:
[1] Reda Alhajj : Advanced Data Mining and Applications , Copyright 2007
[2]Richard.T.Watson :Data Managements,Databases and Organizations,John Wiley& Sons
Publishing, Inc,Canada, Copyright 2002
[3]S.Sumathi.S.Esakkirajan: Fundamentals of Relational Database Management
Systems,Springer-Verlag Berlin Heidelberg,2007
[4] Charu C.Aggarwal,chandan K.Reddy: Data Clustering:Algorithms and Applications,CRC
press,2014.
[5] Jiawei Han,Micheline Kamber:Data Mining:Concepts and Techniques,Academic
press,2001
[6] Reda Alhajj : Advanced Data Mining and Applications , Copyright 2007
40 HORIZONTAL AGGREGATION & HIERARCHICAL CLUSTERING
Department of Computer Science & EngineeringCS010 807 Main Project Rajagiri School of Engineering and Technology, Rajagiri Valley
APPENDIX A
Uml Diagram
Activity Diagrams
Describes how activities are coordinated. Is particularly useful when you know that an operation has to achieve a
number of different things, and you want to model what the essential
dependencies between them are, before you decide in what order to do them. Records the dependencies between activities, such as which things can happen
in parallel and what must be finished before something else can start. Represents the workflow of the process.
Activity diagrams commonly contain
Activity states and action states Transitions Objects
Action states are atomic and cannot be decomposed. Work of the action state is not
interrupted
Activity states can be further decomposed
Their activity being represented by other activity diagrams
They may be interrupted when the action or activity of a state completes, flow of control
passes immediately to the next action or activity state
A flow of control has to start and end someplace initial state -- a solid ball stop state -- a solid ball inside a circle
A branch specifies alternate paths taken based on some Boolean expression.A branch
may have one incoming transition and two or more outgoing ones
41 HORIZONTAL AGGREGATION & HIERARCHICAL CLUSTERING
Department of Computer Science & EngineeringCS010 807 Main Project Rajagiri School of Engineering and Technology, Rajagiri Valley
Use Case Diagrams
Use case diagrams overview the usage requirements for a system. They are useful for
presentations to management and/or project stakeholders, but for actual development you will
find that provide significantly more value because they describe "the meat" of the actual
requirements. Use case diagrams depict:
Use cases: A use case describes a sequence of actions that provide something
of measurable value to an actor and is drawn as a horizontal ellipse.
Actors: An actor is a person, organization, or external system that plays a role
in one or more interactions with your system. Actors are drawn as stick
figures.
Associations: Associations between actors and use cases are indicated in use
case diagrams by solid lines. An association exists whenever an actor is
involved with an interaction described by a use case.
System boundary boxes (optional): You can draw a rectangle around the use
cases, called the system boundary box, to indicate the scope of your system.
Anything within the box represents functionality that is in scope and anything
outside the box is not.
Packages (optional): Packages are UML constructs that enable you to
organize model elements (such as use cases) into groups. Packages are
depicted as file folders and can be used on any of the UML diagrams,
including both use case diagrams and class diagrams.
42 HORIZONTAL AGGREGATION & HIERARCHICAL CLUSTERING
Department of Computer Science & EngineeringCS010 807 Main Project Rajagiri School of Engineering and Technology, Rajagiri Valley
43 HORIZONTAL AGGREGATION & HIERARCHICAL CLUSTERING