horizontal aggregation and hierarchical clustering

Department of Computer Science & EngineeringCS010 807 Main Project Rajagiri School of Engineering and Technology, Rajagiri Valley

1. INTRODUCTION

Most algorithms require datasets with horizontal layout as input with several records

and one variable or dimensions per columns. Managing large data sets without DBMS

support can be a difficult task. Trying different subsets of data points and dimensions is more

flexible, faster and easier to do inside a relational database with SQL queries than outside

with alternative tool. Horizontal aggregation can be performing by using operator, it can

easily be implemented inside a query processor, much like a select, project and join. PIVOT

operator on tabular data that exchange rows, enable data transformations useful in data

modeling, data analysis, and data presentation.

There are many existing functions and operators for aggregation in Structured Query

Language. The most commonly used aggregation is the sum of a column and other

aggregation operators return the average, maximum, minimum or row count over groups of

rows. All operations for aggregation have many limitations to build large data sets for data

mining purposes. Database schemas are also highly normalized for On-Line Transaction

Processing (OLTP) systems where data sets that are stored in a relational database or data

warehouse. But data mining, statistical or machine learning algorithms generally require

aggregated data in summarized form. Data mining algorithm requires suitable input in the

form of cross tabular (horizontal) form; significant effort is required to compute aggregations

for this purpose. Such effort is due to the amount and complexity of SQL code which needs

to be written, optimized and tested.

To analyze data efficiently, Data mining systems are widely using datasets with

columns in horizontal tabular layout. Preparing a data set is more complex task in a data

mining project, requires many SQL queries, joining tables and aggregating columns.

Conventional RDBMS usually manage tables with vertical form. Aggregated columns in a

horizontal tabular layout returns set of numbers, instead of one number per row. The system

uses one parent table and different child tables, operations are then performed on the data

loaded from multiple tables. PIVOT operator, offered by RDBMS is used to calculate

aggregate operations. PIVOT method is much faster method and offers much scalability.

1 HORIZONTAL AGGREGATION & HIERARCHICAL CLUSTERING


Partitioning large set of data, obtained from the result of horizontal aggregation, in to

homogeneous cluster is important task in this system. K-means algorithm using SQL is best

suited for implementing this operation. To overcome the problems in K-means algorithm, we

are optimizing it by using hierarchical clustering.

1.1 Project Scope

The objective of this project is to find an efficient clustering algorithm that overcomes

the defects of commonly used K-means algorithm for portioning large set of data, obtained

as a result of horizontal aggregation into homogeneous clusters.

Data mining, statistical or machine learning algorithms generally require aggregated

data in summarized form. To analyze data efficiently, Data mining systems are widely using

datasets with columns in horizontal tabular layout. Preparing a data set is more complex task

in a data mining project, requires many SQL queries, joining tables and aggregating columns.


horizontal tabular layout returns set of numbers, instead of one number per row. The system

uses one parent table and different child tables, operations are then performed on the data

loaded from multiple tables. Partitioning large set of data, obtained from the result of

horizontal aggregation, in to homogeneous cluster is important task in data mining analysis.

The proposed implementations allow clustering large data sets in a relational DBMS

eliminating the need to ex-port or access data outside the DBMS. Horizontal aggregation

represents a template to generate SQL code from a data mining tool. This SQL code reduces

manual work in the data preparation phase in data mining related project.

Automatically generated code, which is more efficient than end user written SQL

code. Thus datasets for the data mining projects can be created in less time. Data sets can be

created entirely inside the DBMS .



1.2 Design and Implementation Constraints Managing large data sets without DBMS support can be a difficult task.

All aggregations operation has many limitations to build data sets for data mining

purposes

SQL code needs to be written, optimized and tested.

1.3 Assumptions and Dependencies Data in numerical form is preferred.

Data should be applicable for performing horizontal aggregation.

It is assumed that there is no missing attributes.

The obtained data sets after horizontal aggregation should be suitable for performing

clustering.

The data is considered to be in numerical form.

1.4 Advantages Works well with outliers

Produces good clusters with a single scan of data set

Minimizes running time

Linearly scalable

Less space complexity



2. LITERATURE SURVEY

2.1 Background

Horizontal aggregation is new class of function to return aggregated columns in a

horizontal layout. Most algorithms require datasets with horizontal layout as input with

several records and one variable or dimensions per columns. Preparing a data set is more

complex task in a data mining project, requires many SQL queries, joining tables and

aggregating columns. Horizontal aggregation can be performed by using the PIVOT operator.

It can easily be implemented inside a query processor, much like a select, project and join.

PIVOT operator on tabular data that exchange rows, enable data transformations useful in

data modeling, data analysis, and data presentation. PIVOT method is much faster method

and offers much scalability.


homogeneous cluster is important task in this system. K-means algorithm using SQL is best

suited for implementing this operation. To overcome the problems in K-means algorithm, we

are optimizing it by using hierarchical clustering.

Similar IEEE papers:

(i) Horizontal Aggregations in SQL by Using Clustering as a Mixture

Gaussians

To analyze data efficiently, Data mining systems are widely using datasets with

columns in horizontal tabular layout. Preparing a data set is more complex task in a data

mining project, requires many SQL quires, joining tables and aggregating columns.


horizontal tabular layout returns set of numbers, instead of one number per row [3].The

system uses one parent table and different child tables, operations are then performed on the

data loaded from multiple tables. PIVOT operator, offered by RDBMS is used to calculate



aggregate operations. PIVOT method is much faster method and offers much scalability.


homogeneous cluster is important task in this system. Gaussians algorithm using SQL is best

suited for implementing

(ii) Optimization of Horizontal Aggregation in SQL by Using K-Means

Clustering

Horizontal aggregation is new class of function to return aggregated columns in a

horizontal layout. Horizontal aggregation can be performing by using operator, it can easily

be implemented inside a query processor, much like a select, project and join. PIVOT

operator on tabular data that exchange rows, enable data transformations useful in data

modeling, data analysis, and data presentation. K-means clustering algorithms are used to

cluster the attribute, that attribute is the result of horizontal aggregation. Clustering using the

K-Means algorithm requires the user to specify the number of clusters, which is a major

disadvantage[4].



3. HARDWARE AND SOFTWARE SPECIFICATIONS

Minimum Software requirements:

IDE: Netbeans 7.01

SQL SERVER 2005

JDK 1.6

Minimum Hardware requirements:

1 GB RAM

3.1 Java Development Kit

A Java Development Kit (JDK) is a program development environment for writing

Java applets and applications. It consists of a runtime environment that "sits on top" of the

operating system layer as well as the tools and programming that developers need to compile,

debug, and run applets and applications written in the Java language.

3.2 SQL Server 2005

Microsoft SQL Server is a relational database management system developed by

Microsoft. As a database, it is a software product whose primary function is to store and

retrieve data as requested by other software applications, be it those on the same computer or

those running on another computer across a network (including the Internet). There are at

least a dozen different editions of Microsoft SQL Server aimed at different audiences and for

workloads ranging from small single-machine applications to large Internet-facing

applications with many concurrent users. Its primary query languages are T-SQL and Slit

included native support for managing XML data, in addition to relational data. For this

purpose, it defined an xml data type that could be used either as a data type in database

columns or as literals in queries. XML columns can be associated with XSD schemas; XML

data being stored is verified against the schema. XML is converted to an internal binary data


http://en.wikipedia.org/wiki/XSD

http://en.wikipedia.org/wiki/Literal_(computer_science)

http://en.wikipedia.org/wiki/Data_type

http://en.wikipedia.org/wiki/Relational_database

http://en.wikipedia.org/wiki/XML

http://en.wikipedia.org/wiki/Transact-SQL

http://en.wikipedia.org/wiki/Query_language

http://en.wikipedia.org/wiki/Microsoft

http://en.wikipedia.org/wiki/Relational_database_management_system

http://searchcio-midmarket.techtarget.com/definition/operating-system

http://searchsoa.techtarget.com/definition/applet

http://searchsoa.techtarget.com/definition/Java


type before being stored in the database. Specialized indexing methods were made available

for XML data. XML data is queried using XQuery; SQL Server 2005 added some extensions

to the T-SQL language to allow embedding XQuery queries in T-SQL. SQL Server 2005 has

also been enhanced with new indexing algorithms, syntax and better error recovery systems.

Data pages are check summed for better error resiliency, and optimistic concurrency support

has been added for better performance. Permissions and access control have been made more

granular and the query processor handles concurrent execution of queries in a more efficient

way. Partitions on tables and indexes are supported natively, so scaling out a database onto a

cluster is easier. SQL CLR was introduced with SQL Server 2005 to let it integrate with

the .NET Framework. SQL Server 2005 introduced "MARS" (Multiple Active Results Sets),

a method of allowing usage of database connections for multiple purposes.

3.3 Net Beans 7.01

NetBeans is an integrated development environment (IDE) for developing primarily

with Java, but also with other languages, in particular PHP, C/C++, and HTML5. It is also an

application platform framework for Java desktop applications and others. The NetBeans IDE

is written in Java and can run on Windows, OS X, Linux, Solaris and other platforms

supporting a compatible JVM. The NetBeans Platform allows applications to be developed

from a set of modular software components called modules.

Applications based on the NetBeans Platform (including the NetBeans IDE itself) can

be extended by third party developers. The NetBeans Platform is a reusable framework for

simplifying the development of Java Swing desktop applications. The NetBeans IDE bundle

for Java SE contains what is needed to start developing NetBeans plugins and NetBeans

Platform based applications; no additional SDK is required. NetBeans IDE is a free, open-

source, cross-platform IDE with built-in-support for Java Programming Language.

The NetBeans IDE is written in Java and can run on Windows, OS X, Linux, Solaris

and other platforms supporting a compatible JVM. The NetBeans Platform allows

applications to be developed from a set of modular software components called modules.


http://en.wikipedia.org/wiki/Software_component

http://en.wikipedia.org/wiki/Java_Virtual_Machine

http://en.wikipedia.org/wiki/Java_Swing

http://en.wikipedia.org/wiki/Software_framework

http://en.wikipedia.org/wiki/Third_party_developer

http://en.wikipedia.org/wiki/Software_component

http://en.wikipedia.org/wiki/Java_Virtual_Machine

http://en.wikipedia.org/wiki/Platform_(computing)

http://en.wikipedia.org/wiki/HTML5

http://en.wikipedia.org/wiki/C%2B%2B

http://en.wikipedia.org/wiki/C_(programming_language)

http://en.wikipedia.org/wiki/PHP

http://en.wikipedia.org/wiki/Java_(programming_language)

http://en.wikipedia.org/wiki/Integrated_development_environment

http://en.wikipedia.org/wiki/Cluster_(computing)

http://en.wikipedia.org/wiki/Checksum

http://en.wikipedia.org/wiki/T-SQL

http://en.wikipedia.org/wiki/XQuery


NetBeans Platform (including the NetBeans IDE itself) can be extended by third party

developers.

4. SYSTEM ANALYSIS AND DESIGN

4.1 Existing System

Existing system refers to a system which presently we are using or provided and

contains certain limitations which are researched so as to find a solution to it. The existing

system consists of performing horizontal aggregation using pivot and clustering using K-

Means. PIVOT operator on tabular data that exchange rows, enable data transformations

useful in data modeling, data analysis, and data presentation. Data mining algorithm requires

suitable input in the form of cross tabular (horizontal) form; significant effort is required to

compute.

K- Means is initialized from some random or approximate solution. Each step assigns

each point to its nearest cluster and then points belonging to the same cluster are averaged to

get new cluster centroids. Each step successively improves cluster centroids until they are

stable. This is the standard version of K-Means technique used.

But this clustering Technique has many disadvantages. Firstly the learning

algorithm requires a priori specification of the number of cluster centers. If there are two

highly overlapping data then k-means will not be able to resolve that there are two clusters.

Euclidean distance measures can unequally weight underlying factors. The learning algorithm

provides the local optima of the squared error function. Randomly choosing of the cluster

center cannot lead us to the fruitful result. It is applicable only when mean is defined i.e. fails

for categorical data. It is unable to handle noisy data and outliers and fails for non-linear data

set.

4.2 Requirement Analysis

SQL Server 2008





SQL server is a RDBMS developed by Microsoft. This is the RDBMS that we have

used for the project. It supports pivot operation. The databases are created in SQL SERVER.

Connection is established with SQL server and then databases are read, aggregated and

clustered accordingly.

Database

Databases are created to operate large quantities of information by inputting, storing,

retrieving and managing that information. The databases used for horizontal aggregation and

clustering should contain numerical data.

4.3 Proposed System

Proposed system refers to a system where we apply our ideas or do slight

modifications to the Existing System. Here we are choosing a more efficient algorithm for

clustering i.e. BIRCH (Balanced Iterative reducing and Clustering using Hierarchies), which

is an unsupervised data mining algorithm used to perform hierarchical clustering over

particularly large data-sets. An advantage of BIRCH is its ability to incrementally and

dynamically cluster incoming, multi-dimensional metric data points in an attempt to produce

the best quality clustering for a given set of resources (memory and time constraints). In most

cases, BIRCH only requires a single scan of the database. In addition, BIRCH is recognized

as the first clustering algorithm proposed in the database area to handle 'noise' (data points

that are not part of the underlying pattern) effectively.

This Clustering Algorithm has many advantages. It is local in that each clustering

decision is made without scanning all data points and currently existing clusters. It exploits

the observation that data space is not usually uniformly occupied and not every data point is

equally important. It makes full use of available memory to derive the finest possible sub-

clusters while minimizing I/O costs. It is also an incremental method that does not require the

whole data set in advance.

4.4. Module Division

1. Interface Design

2. Horizontal Aggregation


http://en.wikipedia.org/wiki/Time_constraint

http://en.wikipedia.org/wiki/Data_point

http://en.wikipedia.org/wiki/Data_clustering

http://en.wikipedia.org/wiki/Data_mining


3. BIRCH Clustering

4. Comparison using K-Means.

4.4.1 Interface Design

The Interface of our project is basically made with NETBEANS 7.01. NetBeans is an

integrated development environment (IDE) for developing primarily with Java, but also with

other languages, in particular PHP, C/C++, and HTML5. It is also an application platform

framework for Java desktop applications and others. Desktop Application of Netbeans has

been used for our project to give a professional style. It consists of a Menubar with a set of

menu options. One of the menu option is “Optimising” in this we have different menu items

to display the required table, to perform Horizontal Aggregation, K-Means Clustering and

BIRCH Clustering.

When the Dataset menuItem is clicked, the user is prompted for the dataset name. On

entering a valid Dataset name, the corresponding Dataset is displayed. If a valid name is not

entered, no Table will be displayed and is an error.

When the Horizontal Aggregation menuItem is clicked, the user is prompted for the

METHOD i.e. what aggregation function is to be used for the Horizontal Aggregation using

pivot , COLUMN TO BE AGGREGATED i.e. which on column values should the

aggregation function be applied and then the TRANSPOSE i.e. which columns values has to

be transposed into separated columns. Once valid entries are made, the horizontally

aggregated output will be displayed in a table format. If illegal values are entered, no table

will be displayed.

When the optimizing menuItem is clicked, using the previously entered values the K-

means clustering algorithm is executed and the output is displayed in table format. The K-

Means clustering algorithm is performed for comparison purpose.

When the BIRCH clustering menuItem is clicked, the BIRCH clustering code is

executed. The output consists of displaying the number of data points within a specified

cluster. This output is displayed as a Bar chart.


http://en.wikipedia.org/wiki/Platform_(computing)

http://en.wikipedia.org/wiki/HTML5

http://en.wikipedia.org/wiki/C%2B%2B

http://en.wikipedia.org/wiki/C_(programming_language)

http://en.wikipedia.org/wiki/PHP

http://en.wikipedia.org/wiki/Java_(programming_language)

http://en.wikipedia.org/wiki/Integrated_development_environment


Another menu Option in this menu is for comparison. It contains a menu item to

display the comparison chart. When this menu Item is pressed, the comparison chart is

displayed. This contains the comparison of the time complexity and space complexity of both

the clustering algorithms.

4.4.2 Horizontal Aggregation

The second module of this project is performing horizontal Aggregation. Horizontal

aggregation is new class of function to return aggregated columns in a horizontal layout.

Most algorithms require datasets with horizontal layout as input with several records and one

variable or dimensions per columns. Managing large data sets without DBMS support can be

a difficult task. Trying different subsets of data points and dimensions is more flexible, faster

and easier to do inside a relational database with SQL queries than outside with alternative

tool. Horizontal aggregation is done using PIVOT operator. The PIVOT operator on tabular

data exchange rows, enable data transformations useful in data modeling, data analysis, and

data presentation.

There are several advantages for horizontal aggregation. First one is horizontal

aggregation represent a template to generate SQL code from a data mining tool. This SQL

code reduces manual work in the data preparation phase in data mining related project.

Second is automatically generated code, which is more efficient than end user written SQL

code. Thus datasets for the data mining projects can be created in less time. Third advantage

is the data sets can be created entirely inside the DBMS.

K D1 D2 A

1 3 X 9

2 2 Y 6

3 1 Y 10

4 1 Y 0

5 2 X 1

6 1 X Null

7 3 X 8

8 2 X 711 HORIZONTAL AGGREGATION & HIERARCHICAL CLUSTERING

D1 D2X D2Y

1 null 10

2 8 6

3 17 Null


Figure 4.1: Horizontal aggregation of Table F

To perform horizontal Aggregation in our project, the user is prompted for the

METHOD i.e. what aggregation function is to be used for Horizontal Aggregation using

pivot , COLUMN TO BE AGGREGATED i.e. which on column values should the

aggregation function be applied and then the TRANSPOSE i.e. which columns values has to

be transposed into separated columns. Once valid entries are made, the horizontally

aggregated output will be displayed in a table format. If illegal values are entered, no table

will be displayed.

4.4.3 BIRCH Clustering

BIRCH is a clustering algorithm which deals with large datasets by first generating a

more compact summary that retains as much distribution information as possible, and then

clustering the data summary instead of the dataset.

BIRCH summarizes a dataset into a set of sub clusters to reduce the scale of the

clustering problem using the concept of clustering feature.

Clustering feature: Clustering feature (CF) entry is a triple summarizing the

information that we maintain about a sub cluster of a data points.

CF-tree: A CF-tree is a height-balanced tree with two parameters: branching factor

( B for non leaf node and L for leaf node) and threshold T. Each nonleaf node contains atmost

B entries of the form [CFi, childi], where i=1,2,…,B, ’childi’ is a pointer to its i-th child node,

and CFi is the CF entry of the sub cluster represented by this child. So a nonleaf node

represents a sub cluster made up of all the sub clusters represented by its entries. In addition,

each leaf node has two pointers, ‘prev’ and ‘next’, which are used to chain all leaf nodes

together for efficient scans. All entries in a leaf node must satisfy a threshold requirements,

with respect to a threshold value T: the diameter( alternatively , the radius) of each leaf entry

has to be less than T.



The tree size is a function of T. The larger T is, the smaller the tree is. A node is

required to fit in a page of size P, where P is a parameter of BIRCH. Once the dimension d of

the data space is given, the sizes of leaf and nonleaf entries are known, and then B and L are

determined by P. So P can be varied for performance tuning.

Phases of BIRCH algorithm:

The four phases of BIRCH algorithm are:

1. Loading :

The main task of Phase 1 is to scan all data and build an initial in-memory CF-

tree using the given amount of memory and recycling space on disk.

2. Optional Condensing:

This phase serves as a cushion between phase one loading and phase three. We

scan the leaf entries in the initial CF tree to rebuild a small CF Tree while

removing more outliers and grouping more crowded sub clusters into larger

ones.

3. Global Clustering:

We use an existing global or semi global algorithm to cluster all the leaf

entries across the boundaries of different nodes. After this phase we obtain a

set of clusters that captures the major distribution of patterns in the data.

4. Cluster refining:

It uses the centroids of the clusters produced by the previous phase as seeds

and redistributes the data points to its closest seed to obtain a set of new

clusters. It also ensures that all the copies of the given data point goes to the

same cluster.

4.4.4 Comparison Using K-Means

Comparison of the performance of BIRCH clustering algorithm is done with another

clustering algorithm such as K-Means algorithm. The comparison is done by comparing the

time complexities and space complexities of the algorithms. The time complexities are



calculated by obtaining the difference between the starting and ending times while running

both the algorithms.

K-means is initialized from some random or approximate solution. Each step assigns

each point to its nearest cluster and then points belonging to the same cluster are averaged to

get new cluster centroids. Each step successively improves cluster centroids until they are

stable. This is the standard version of K-Means technique used. The algorithm is designed as

follows:

K-means algorithm based on classification technique uses horizontal aggregation as

input. Pivot operator is used to calculate the aggregation of particular data values from

distinct fact tables. Optimization provides for PIVOT for large number of fact table. The

database connectivity and choosing different tables with .mdb extension is the first step in

this system.

Horizontal aggregation can be evaluated by choosing transpose column and aggregate

operation .Pivot operator automatically transforms table to horizontal layout. This is the main

advantage of this particular algorithm.

The k-means algorithm is the best-known squared error based clustering algorithm

with input as horizontal aggregation The algorithm consist of mainly four steps.1) Selection

of the initial k means for k clusters from attribute of datasets obtained from horizontal

aggregation operation.2) Calculation of the dissimilarity between an object and the Mean of a

cluster.3) Allocation of an object to the cluster whose mean is nearest to the object.4) Re-

calculation of the mean of a cluster from the objects allocated to it so that the intra cluster

dissimilarity.

4.5 System Overview4.5.1 Use Case Description

Use Case 1:

Name: Horizontal Aggregation Module



Description: To horizontally aggregate a specified dataset so as to use it as input for the

clustering algorithms.

Flow of events:

a) Main flow

1. User logs in to SQLSERVER.

2. Connection is established.

3. The necessary datasets are loaded into SQLSERVER.

4. User is prompted for the required aggregating function, the aggregated column

and the transpose column.

5. The code for horizontal aggregation is executed.

6. Output is obtained and displayed in table format.

Activity diagram:

Figure 4.2:Activity Diagram


User views the result

Query is executed and horizontally aggregated table obtained

User is prompted for method, transpose and aggregate column

Loads tables into DB

User logs in to DB


Use Case Diagram :

User

Figure 4.3: Use Case Diagram for Horizontal Aggregation

Specific Requirements

SQL SERVER should contain the required datasets

Aggregate functions should be a valid one

Column to be aggregated should contain only numerical data

Pre conditions

The required dataset should be loaded into the SQLSERVER

Connection should be established between SQLSERVER and the Application.

Post conditions

The Horizontally aggregated output of the dataset is displayed in table format

Use Case 2:

Name: BIRCH Clustering Module

Description: To perform Clustering using BIRCH algorithm. Its performance is compared

with K-Means Algorithm.

Flow of events

a) Main flow:

1. Scan the horizontally aggregated data.

2. Create Clustering Features for each Data Point and insert it into the CF Tree.


Enter the column to be aggregated

Enter the aggregate function

Enter the transpose column

Enter the table name


3. Condense the CF Tree

4. Perform global clustering using any clustering algorithm such as K-Means.

5. Refine the output obtained

6. Check whether more clustering can be done on the data points

7. If yes, repeat the above steps from step 4 to 7

b) Alternative flow:

8.a: If no, then exit and display the clusters using a Bar Graph.

Activity Diagram:

17 HORIZONTAL AGGREGATION & HIERARCHICAL CLUSTERING Cluster more?

User logs in to DB

User obtains aggregated data

Initial Scanning

Create CF tree

Refine the tree

Perform Global Clustering

Condense the Clustering Feature Tree

yes


no

Figure 4.4: BIRCH Clustering

Use Case Diagram:

User

Figure 4.5: Use case diagram for BIRCH ALGORITHM

Post conditions

1. View and analyze the clustered results

4.6 Algorithm18 HORIZONTAL AGGREGATION & HIERARCHICAL CLUSTERING

Refine the tree

Perform Global Clustering

Condense the clustering feature

Create CF tree


Connection to the SQL Server

Step 1: Start.

Step 2: Connect the SQLSERVER.

Step 3: Load tables which is to be clustered into the database.

Step 4: Establish connection between the database and this application.

Step 5: Stop

Horizontal Aggregation Using Pivot

Step 1: Start.

Step 2: Read a valid table name and read the transpose column name.

Step 3: Read the column name which is to be aggregated.

Step 4: Read the name of the aggregate function which is to be applied on the column.

Step 5: Display the datasets in table format.

Step 6: Execute the query of PIVOT operator.

Step 7: Display the horizontally aggregated output in table format.

Step 8: Stop.

K-Means Algorithm

Step 1: Start.

Step 2: Initialize from some random or approximate solution.

Step 3: Assigns each point to its nearest cluster.

Step 4: Points belonging to the same cluster are averaged to get new cluster centroids.

Step 5: Repeat steps 2 and 3 successively which improves cluster centroids until they are stable.

Step 6: Stop.

BIRCH Algorithm

Step 1: Start.

Step 2: Scan all data and build an initial in-memory CF-tree using the given amount of

memory and recycling space on disk.



Step 2.1: Identifying the appropriate leaf: Starting from the root,, it recursively

descends the CF tree by choosing tile closest child node according to a chosen

distance metric D0.

Step 2.2: Modifying the leaf: When it reaches a leaf node, it finds the closest leaf

entry, say Li, and then tests whether Li can “absorb” “ Ent” without violating the

threshold conditions. If SO, the CF vector for Li is updated to reflect this, If not,, a

new entry for “Ent,” is added to the leaf. If there is space on the leaf for this new

entry, we are done, otherwise we must, .split the leaf node. Node splitting is done by

choosing the farthest pair of entries as seeds, and redistributing the remaining entries.

Step 2.3: Modifying the path to the leaf: After inserting “Ent” into a leaf, update the

CF information.

Step 3: Scan the leaf entries in the initial CF tree to rebuild a small CF Tree while removing

more outliers and grouping more crowded sub clusters into larger ones.

Step 4: Use an existing global or semi global algorithm to cluster all the leaf entries across

the boundaries of different nodes.

Step 5: Use the centroids of the clusters produced by the previous phase as seeds and

redistributes the data points to its closest seed to obtain a set of new clusters.

Step 6: Stop.

4.7. System Architecture


DATABASE

DATA SET 1

DATA SET 1

TREE CONDENSING

INITIAL SCANNING

HORIZONTAL AGGREGATION

USING PIVOT


Figure 4.6: System architecture

5. IMPLEMENTATIONImplementation is an activity that is contained throughout the development phase. It is

a process of bringing a developed system into operational use and turning it over to the user.

The new system and its components are to test in a structured and planned manner. A

successful system should be delivered and users should have confidence that the system

would work efficiently and effectively. The more complex the system being implemented the

more involved will be the system analysis and design effort required for implementation.

Implementation is the stage of the project when the theoretical design is turned into

the working system. It is the most crucial stage in achieving a successful new system and

giving the users confidence that the new system will work and be effective. The coding

process performs this task. Coding is viewed as a natural consequence of design.

5.1. Top down Design

A top-down approach is essentially the breaking down of a system to gain insight into

its compositional sub-systems. In a top-down approach an overview of the system is first

formulated, specifying but not detailing any first-level subsystems. Each subsystem is then

refined in yet greater detail, sometimes in many additional subsystem levels, until the entire

specification is reduced to base elements. A top-down model is often specified with the

assistance of "black boxes", these make it easier to manipulate. However, black boxes may



fail to elucidate elementary mechanisms or be detailed enough to realistically validate the

model.

Top-down approaches emphasize planning and a complete understanding of the

system. It is inherent that no coding can begin until a sufficient level of detail has been

reached in the design of at least some part of the system. The Top-Down Approach is done by

attaching the stubs in place of the module. This, however, delays testing of the ultimate

functional units of a system until significant design is complete. Bottom-up emphasizes

coding and early testing, which can begin as soon as the first module has been specified. This

approach, however, runs the risk that modules may be coded without having a clear idea of

how they link to other parts of the system, and that such linking may not be as easy as first

thought. Re-usability of code is one of the main benefits of the bottom-up approach.

Top-down is a programming style, the mainstay of traditional procedural languages,

in which design begins by specifying complex pieces and then dividing them into

successively smaller pieces. Eventually, the components are specific enough to be coded and

the program is written. This is the exact opposite of the bottom-up programming approach

which is common in object-oriented languages such as C++ or Java.

The technique for writing a program using top-down methods is to write a main

procedure that names all the major functions it will need. Later, the programming team looks

at the requirements of each of those functions and the process is repeated. These

compartmentalized sub-routines eventually will perform actions so simple they can be easily

and concisely coded. When all the various sub-routines have been coded the program is done.

By defining how the application comes together at a high level, lower level work can

be self-contained. By defining how the lower level abstractions are expected to integrate into

higher level ones, interfaces become clearly defined.

Practicing top-down programming has several advantages:

Separating the low level work from the higher level abstractions leads to a

modular design.



Modular design means development can be self-contained.

Having "skeleton" code illustrates clearly how low level modules integrate.

Fewer operations errors (to reduce errors, because each module has to be

processed separately, so programmers get large amount of time for

processing).

Much less time consuming (each programmer is only involved in a part of the

big project).

Very optimized way of processing (each programmer has to apply their own

knowledge and experience to their parts (modules), so the project will become

an optimized one).

Easy to maintain (if an error occurs in the output, it is easy to identify the

errors generated from which module of the entire program).

5.2. Bottom up Approach

A bottom-up approach is the piecing together of systems to give rise to grander

systems, thus making the original systems sub-systems of the emergent system. In a bottom-

up approach the individual base elements of the system are first specified in great detail.

These elements are then linked together to form larger subsystems, which then in turn are

linked, sometimes in many levels, until a complete top-level system is formed. This strategy

often resembles a "seed" model, whereby the beginnings are small but eventually grow in

complexity and completeness. However, "organic strategies" may result in a tangle of

elements and subsystems, developed in isolation and subject to local optimization as opposed

to meeting a global purpose

In a bottom-up approach the individual base elements of the system are first specified

in great detail. These elements are then linked together to form larger subsystems, which then

in turn are linked, sometimes in many levels, until a complete top-level system is formed.

This strategy often resembles a "seed" model, whereby the beginnings are small, but

eventually grow in complexity and completeness.



Object-oriented programming (OOP) is a programming paradigm that uses "objects"

to design applications and computer programs. This bottom-up approach has one weakness.

We need to use a lot of intuition to decide the functionality that is to be provided by the

module. If a system is to be built from existing system, this approach is more suitable as it

starts from some existing modules.

5.3. Software Development Process Model

A software development process, also known as a software development life cycle

(SDLC), is a structure imposed on the development of a software product. Similar terms

include software life cycle and software process. It is often considered a subset of systems

development life cycle. There are several models for such processes, each describing

approaches to a variety of tasks or activities that take place during the process. Some people

consider a life-cycle model a more general term and a software development process a more

specific term. For example, there are many specific software development processes that 'fit'

the spiral life-cycle model. ISO/IEC 12207 is an international standard for software life-cycle

processes. It aims to be the standard that defines all the tasks required for developing and

maintaining software.

There are many different models and methodologies, but each generally consists of a series of

defined steps or stages. Few popular models are:

Waterfall Model

V-Model

Spiral Model

Prototyping

Iterative (Incremental Delivery)

Rapid Application Development

The engineering process that we used to complete our project was the

V- model or the Validation model. This is an elaboration of Waterfall model and emphasizes

on validation activities corresponding to each of the activities that create the products of the

project. It can be seen as expanding the testing activity in the waterfall model. Each step has a

matching validation process that can give feedback (in case defects are found) to the



corresponding development stage and a reworking of the succeeding steps. For example, the

system designer might have written that a calculation be carried out in a certain way.

The person who has written the program design might have misunderstood what was

required and written the pseudo code in a different way. At system testing stage, the system

designer would carry out checks to ensure that the software is doing what was specified in the

design document and would discover the program designer’s misreading of that document.

So, all the steps from program design has to be repeated for correcting Defect.



Figure 5.1: V- Model

6. TESTING STRATEGIES



6.1 Black Box TestingBlack-box and white-box are test design methods. Black-box test design treats the

system as a "black-box", so it doesn't explicitly use knowledge of the internal structure.

Black-box test design is usually described as focusing on testing functional requirements.

Synonyms for black-box include: behavioral, functional, opaque-box, and closed-box. Black

Box Testing: It is also known as Skin Box was testing or Closed Box Testing.

Black testing takes an external perspective of the test object to derive test cases. These

tests can be functional or non-functional, though usually functional. The test designer selects

valid and invalid inputs and determines the correct output. There is no knowledge of the test

object's internal structure. The higher the level, and hence the bigger and more complex the

box, the more one is forced to use black box testing to simplify. While this method can

uncover unimplemented parts of the specification, one cannot be sure that all existent paths

are tested. Typical black box test design techniques include:

Decision table testing

Pairwise testing

State transition tables

Use case testing

Cross-functional testing

Black testing takes an external perspective of the test object to derive test cases. The

test designer selects valid and invalid inputs and determines the correct output. There is no

knowledge of the test object's internal structure. The higher the level, and hence the bigger

and more complex the box, the more one is forced to use black box testing to simplify. While

this method can uncover unimplemented parts of the specification, one cannot be sure that all

existent paths are tested.

6.2. White Box TestingWhite-box test design allows one to peek inside the "box", and it focuses specifically

on using internal knowledge of the software to guide the selection of test data. Synonyms for 27 HORIZONTAL AGGREGATION & HIERARCHICAL CLUSTERING


white-box include: structural, glass-box and clear-box. While black-box and white-box are

terms that are still in popular use, many people prefer the terms "behavioral" and "structural".

Behavioral test design is slightly different from black-box test design because the use of

internal knowledge isn't strictly forbidden, but it's still discouraged. In practice, it hasn't

proven useful to use a single test design method. One has to use a mixture of different

methods so that they aren't hindered by the limitations of a particular one. Some call this

"gray-box" or "translucent-box" test design, but others wish we'd stop talking about boxes

altogether. It is important to understand that these methods are used during the test design

phase, and their influence is hard to see in the tests once they're implemented. Note that any

level of testing (unit testing, system testing, etc.) can use any test design methods. Unit

testing is usually associated with structural test design, but this is because testers usually don't

have well-defined requirements at the unit level to validate. It is also known as Clear Box

Testing or Glass Box or Open Box. This is used to find internal functionalities of software

like conditional loops, statement coverage. It is mainly done by the Developers.

White box testing (clear box testing, glass box testing, and transparent box testing, or

structural testing) uses an internal perspective of the system to design test cases based on

internal structure. It requires programming skills to identify all paths through the software.

The tester chooses test case inputs to exercise paths through the code and determines the

appropriate outputs. In electrical hardware testing, every node in a circuit may be probed and

measured; an example Is in circuit testing(ICT).

Since the tests are based on the actual implementation, if the implementation changes,

the tests probably will need to change, too. For example ICT needs updates if component

values change, and needs modified/new fixture if the circuit changes. This adds financial

resistance to the change process, thus buggy products may stay buggy.

AUTOMATED OPTICAL INSPECTION (AOI) offers similar component level

correctness checking without the cost of ICT fixtures, however changes still require test

updates. While white box testing is applicable at the unit levels of the software testing

process, it is typically applied to the unit. While it normally tests paths within a unit, it can

also test paths between unit integration and testing. During integration, and between



subsystems during a system level test. Though this method of test design can uncover an

overwhelming number of test cases, it might not detect unimplemented parts of the

specification or missing requirements, but one can be sure that all paths through the test

object are executed.

Typical white box test design techniques include:

Control flow testing

Data flow testing

Branch testing

6.2.1 Module Testing

In computer programming, module testing is a software verification and validation

method in which a programmer tests if individual units of source code are fit for use. A unit

is the smallest testable part of an application. In procedural programming a unit may be an

individual function or procedure. Unit tests are typically written and run by software

developers to ensure that code meets its design and behaves as intended. Its implementation

can vary from being very manual (pencil and paper) to being formalized as part of build

automation The primary goal of module testing is to take the smallest piece of testable

software in the application, isolate it from the remainder of the code, and determine whether

it behaves exactly as you expect. Each unit is tested separately before integrating them into

modules to test the interfaces between modules. Unit testing has proven its value in that a

large percentage of defects are identified during its use.

The most common approach to unit testing requires drivers and stubs to be written.

The driver simulates a calling unit and the stub simulates a called unit. The investment of

developer time in this activity sometimes results in demoting unit testing to a lower level of

priority and that is almost always a mistake. Even though the drivers and stubs cost time and

money, unit testing provides some undeniable advantages. It allows for automation of the

testing process, reduces difficulties of discovering errors contained in more complex pieces

of the application, and test coverage is often enhanced because attention is given to each unit.

6.2.2 Integration Testing


http://en.wikipedia.org/wiki/Build_automation

http://en.wikipedia.org/wiki/Build_automation

http://en.wikipedia.org/wiki/Software_developer

http://en.wikipedia.org/wiki/Software_developer

http://en.wikipedia.org/wiki/Procedural_programming

http://en.wikipedia.org/wiki/Source_code

http://en.wikipedia.org/wiki/Computer_programming


Integrated design is a collaborative method for designing buildings which emphasizes

the development of a holistic design. Conventional building design usually involves a series

of hand-offs from owner to architect to builder to occupant. This path does not invite all

affected parties into the planning process, and therefore does not take into account their

needs, areas of expertise or insights. In some cases, using the conventional method,

incompatible elements of the design are not discovered until late in the process when it is

expensive to make changes. In addition to extensive collaboration, integrated design involves

a “whole building design” approach. A building is viewed as an interdependent system, as

opposed to an accumulation of its separate components (site, structure, systems and use). The

goal of looking at all the systems together to is make sure they work in harmony rather than

against each other.

6.3. Testing Strategies

Test Case 1:

Enter the table name PIVOT CONTENT

Enter the method, column to be aggregated and transpose column

6.4 Summary of Test Results

Case 1: PIVOTCONTENT is entered as table name. The dataset is displayed. Method

entered was sum, Column to be aggregated was A and transpose column was D2.

The aggregated output was displayed. The K-Means algorithm is executed and

displayed in table format and BIRCH clustering algorithm is displayed as a Bar

chart.



Figure 6.1: Optimising options

Figure 6.2: Comparison Options



Figure 6.3: PIVOTCONTENT


D1 D2 A

3 x 9

2 y 6

1 y 10

1 y 0

2 x 1

1 x -1

3 x 8

2 x 7


7. RESULTS AND DISCUSSION

The Application horizontally aggregated the datasets and produced the clustered output.

1. Results for the dataset PIVOTCONTENT :

Figure 7.1: Loading Table to be Aggregated



Figure 7.2: The Loaded Table

Figure 7.3: Method of Aggregation

Figure 7.4: Column to be Aggregated



Figure 7.5: Transpose Column

Figure 7.6: After Horizontal Aggregation



Figure 7.7: After K-Means Clustering

Figure 7.8: After Hierarchical BIRCH Clustering



Figure 7.9: Comparison of Hierarchical and BIRCH Clustering

The comparison between the results of the horizontal aggregation with pivot

operator and without the pivot operator was a tedious task hence comparison chart

for clustering with K-means and BIRCH was implemented.

The output after the BIRCH clustering was to be shown on an dot matrix graph.

Since it was difficult to be done ,a bar graph for BIRCH clustering output was

performed.

We had implemented horizontal aggregation and clustering on simple databases

since pivot operator works for simple ones



8. CONCLUSION

This project has a great chance of creating a revolution in the field of Data Mining.

Most of the data mining applications require data in summarized form.so the horizontal

aggregation of the data return aggregated columns in a horizontal layout. Most algorithms

require datasets with horizontal layout as input with several records and one variable or

dimensions per columns. Horizontal aggregation can be performing by using PIVOT operator

on tabular data that exchange rows, enable data transformations useful in data modelling, data

analysis, and data presentation.

The clustering of the data was done using the BIRCH algorithm which can handle

large amount of datasets and produces good clusters. It also removes all the outliers and

hence minimizes the running time. Thus the data clustering has wide applications in the field

of image processing, bioinformatics, market analysis etc.

We have implemented the horizontal aggregation using the PIVOT operator on

databases and the aggregated output was clustered using the BIRCH algorithm. The clustered

output results are displayed on a bar chart .The comparison between the clustering using

KMEANS and the BIRCH were performed .The results of the comparison was that ,time and

the space complexity of BIRCH clustering was very less compared to KMEANS. It proved to

be an efficient method for large datasets.



9. FUTURE ENHANCEMENT

The horizontal aggregation for methods min ,max ,count can be implemented in

future.

In future, the input to the system can be given as multiple tables rather than single

table.

This project can be extended to develop a more formal model of evaluation methods

to achieve better results. Also then we can be developing more complete cost effective

methods.



10. REFERENCES

Web references:

[1] www.hiv.lang.gov/content/sequence/HEATMAP/heatmap.html

[2] www.ines-conf.org/ines-conf/58-INES2004.pdf

[3]www.ijert.org/view.php?id=2378&title=horizontal-aggregations-in-sql-by-using-

clustering-as-a-mixture-gaussians

[4] www.ijarcsse.com/docs/papers/May2012/Volum2_issue5/V2I500441.pdf

Book references:

[1] Reda Alhajj : Advanced Data Mining and Applications , Copyright 2007

[2]Richard.T.Watson :Data Managements,Databases and Organizations,John Wiley& Sons

Publishing, Inc,Canada, Copyright 2002

[3]S.Sumathi.S.Esakkirajan: Fundamentals of Relational Database Management

Systems,Springer-Verlag Berlin Heidelberg,2007

[4] Charu C.Aggarwal,chandan K.Reddy: Data Clustering:Algorithms and Applications,CRC

press,2014.

[5] Jiawei Han,Micheline Kamber:Data Mining:Concepts and Techniques,Academic

press,2001

[6] Reda Alhajj : Advanced Data Mining and Applications , Copyright 2007


http://www.hiv.lang.gov/content/sequence/HEATMAP/heatmap.html


APPENDIX A

Uml Diagram

Activity Diagrams

Describes how activities are coordinated. Is particularly useful when you know that an operation has to achieve a

number of different things, and you want to model what the essential

dependencies between them are, before you decide in what order to do them. Records the dependencies between activities, such as which things can happen

in parallel and what must be finished before something else can start. Represents the workflow of the process.

Activity diagrams commonly contain

Activity states and action states Transitions Objects

Action states are atomic and cannot be decomposed. Work of the action state is not

interrupted

Activity states can be further decomposed

Their activity being represented by other activity diagrams

They may be interrupted when the action or activity of a state completes, flow of control

passes immediately to the next action or activity state

A flow of control has to start and end someplace initial state -- a solid ball stop state -- a solid ball inside a circle

A branch specifies alternate paths taken based on some Boolean expression.A branch

may have one incoming transition and two or more outgoing ones



Use Case Diagrams

Use case diagrams overview the usage requirements for a system. They are useful for

presentations to management and/or project stakeholders, but for actual development you will

find that provide significantly more value because they describe "the meat" of the actual

requirements. Use case diagrams depict:

Use cases: A use case describes a sequence of actions that provide something

of measurable value to an actor and is drawn as a horizontal ellipse.

Actors: An actor is a person, organization, or external system that plays a role

in one or more interactions with your system. Actors are drawn as stick

figures.

Associations: Associations between actors and use cases are indicated in use

case diagrams by solid lines. An association exists whenever an actor is

involved with an interaction described by a use case.

System boundary boxes (optional): You can draw a rectangle around the use

cases, called the system boundary box, to indicate the scope of your system.

Anything within the box represents functionality that is in scope and anything

outside the box is not.

Packages (optional): Packages are UML constructs that enable you to

organize model elements (such as use cases) into groups. Packages are

depicted as file folders and can be used on any of the UML diagrams,

including both use case diagrams and class diagrams.


horizontal aggregation and hierarchical clustering

Documents

Transcript of horizontal aggregation and hierarchical clustering