Page 1
Graphical Visualisation of a
Website's Structure
Graphical Visualisation of a Website's Structure
Author: Greg J Preece
Project Period: Sept 2007 – May 2008
Department: School of Computing
Project Supervisor: Dr Roy Ruddle
The candidate confirms that the work submitted is their own and the appropriate credit has been given
where reference has been made to the work of others.
I understand that failure to attribute material which is obtained from another source may be
considered as plagiarism.
(Signature of Student)_______________________________________________
Page 2
Contents
1. Research and Investigation Page 4
1.1 Aim Page 5
1.2 Objectives Page 5
1.3 Minimum Requirements Page 5
1.4 Possible Extensions Page 6
1.5 Project Schedule Page 6
1.5.1 Milestones Page 6
1.6 Background Research Page 7
1.7 2D Visualisations Page 7
1.7.1 Vertical Trees Page 7
1.7.2 Horizontal Trees Page 9
1.7.3 Expandable Trees Page 9
1.7.4 Circular Trees Page 9
1.8 3D Visualisations Page 10
1.8.1 3D “Molecular” Trees Page 10
1.8.2 3D Cone Trees Page 11
1.9 Visualisation Techniques Page 11
1.9.1 Picture-in-picture Page 11
1.9.2 Fish-eye diagrams Page 11
1.10 Chosen Method Page 12
2. Software Implementation Page 13
2.1 Requirements of System Page 14
2.1.1 Technical Requirements Page 14
2.1.2 Empirical Requirements Page 14
2.2 Programming Language/Platform Page 15
2.3 Visualisation Methodologies Page 16
2.3.1 Manual Creation Page 16
2.3.2 The Java Universal Network/Graph Framework Page 16
2.3.3 The Prefuse Visualisation Toolkit Page 17
2.4 Design Methodology Page 17
2.4.1 First Iteration – Prototype Page 18
2.4.2 Second Iteration – Production Page 18
Page 3
2.4.3 Third Iteration – Extension Page 18
3. First Iteration – Development of Prototype Page 20
3.1 Aim Page 20
3.2 Key Design Elements Page 20
3.2.1 Use of the Factory Pattern in FileParserFactory Page 20
3.3 Utility Libraries Page 21
3.3.1 Log4J Page 21
3.3.2 Junit Page 22
3.3.3 jMimeMagic Page 22
3.4 Evaluation Page 22
3.5 Tagging Page 23
4. Second Iteration: Development of Production System Page 24
4.1 Aim Page 24
4.2 Key Design Changes Page 24
4.2.1 Use of BalloonTreeLayout Page 24
4.2.2 Overview Window Page 24
4.2.3 Parse Status Observers/Multi-threading Page 25
4.3 Evaluation Page 26
5. Evaluation Page 27
5.1 Technical Evaluation Page 28
5.2 Empirical Evaluation Page 32
6. Bibliography and References Page 34
7. Appendix A: Personal Reflection Page 36
Page 5
1. Introduction
1.1 Aim
The aim of this project is to create a piece of software that is capable of clearly and accurately
visualising the structure of a large website. A website's structure can be defined as the layout of the
hyperlinks that connect all its pages together into a cohesive entity. These links can be both
unidirectional and bi-directional, and link structures often loop back on themselves. Also, these links
can sometimes be grouped according to purpose or subject matter by analysing the filenames and/or
content of the pages they link. Once these links have been catalogued and categorised accordingly,
they form an interconnecting network-like structure, similar to that of a computer network. My goal
is to create an effective visualisation of this structure that is both useful in the information it provides,
as well as being easy to view and navigate.
1.2 Objectives
The objectives of the project are to:
1. Develop a technique for accurately and effectively graphically visualising the link structure of
a website, such that it can be navigated by a user.
2. Build a piece of software based on this technique that can read in the link structure of a
website from a data source, and render an appropriate visualisation of that structure.
(Please note that the objectives/minimum requirements outlined here will have changed from the
initial submission, as the original version assumed the use of a three-dimensional interface, along with
OpenGL, which was an error. The type of interface used, as well as the languages/libraries with
which it will be constructed, are to be decided during the course of the project.)
1.3 Minimum Requirements
The minimum requirements of this project are:
1. To find an effective technique for visualising a website's link structure.
2. To develop a piece of software that can read in link data for a website and create a
visualisation for it.
Page 6
3. To accurately assess the effectiveness of my developed software
1.4 Extensions
Possible extensions for this project include:
1. The creation of an XML spider/parser that can take a Uniform Resource Indicator, go to the
website at that URI, and fetch the link data itself, which will then be read in by the second
stage software.
2. User controls to allow different types of user to display different information according to
their interest. For example, a developer might want page load times, number of inbound
links, etc, to be displayed, whereas a browsing user might simply want to see the structure and
use external linking to visit the pages on the web.
3. Page preview functionality, where the contents of a page within the visualisation are
represented alongside its entry within the visualisation.
1.5 Project Schedule
Task Task Dates Time Elapsed
Preliminary Investigation/Research 22/10/2007 - 22/11/2007 4 Weeks
Write Initial Project Report 16/11/2007 - 22/11/2007 1 Week
Write Final Mid-Project Report 22/11/2007 – 06/12/2007 2 Weeks
Personal Training for Software Development 22/10/2007 – 26/11/2007 5 Weeks
Planning for Prototype Development 26/11/2007 – 30/11/2007 5 Days
Prototype Development (1st Iteration) 01/12/2007 – 14/12/2007 2 Weeks
Prototype Testing and Evaluation 15/12/2007 – 07/01/2008 3 ½ Weeks
2nd Iteration Planning 19/01/2007 – 01/02/2008 2 Weeks
2nd Iteration Development 01/02/2008 – 01/03/2008 1 Month
2nd Iteration Evaluation 01/03/2008 – 14/03/2008 2 Weeks
Final Write-Up 14/03/2008 - 14/04/2008 1 Month
1.5.1 Milestones
Mid Project Report Submitted: 7th December 2007
Begin Development of First Iteration: 1st December 2007
Finish Prototype: 14th December 2007
Page 7
Begin Development of Second Iteration: 1st February 2008
Finish Second Iteration: 1st March 2008
Finish Write-up: 14th April 2008
1.6 Background Research
1.6.1Basis for Comparison
In order to effectively judge different visualisation techniques by their strengths and weaknesses, it is
necessary to first note the criteria on which these judgements will be made. There are several
important aims that a visualisation must strive towards in order to be effective, and these are noted
below:
1. The visualisation must strive for clarity from all perspectives as often as possible. Where
there are multiple nodes/symbols in a 2D or 2.5D visualisation, they should not overlap each
other, especially if there is associated text. In 3D diagrams, it should be possible at all times
for an observer to discern one node, symbol or link from another.
2. If the visualisation is interactive, it must be easy to navigate. This is especially difficult in 3D
spaces, as a lack of outside points of reference can make orienting a viewpoint difficult, and
returning to a particular spot is not always easy.
3. The visualisation should preferably show some clear definition between different types of
symbol, link etc. Colour coding is often a good way to achieve this. A user should be
immediately able to discern between internal and external webpages, for example (assuming
that this separation has been made).
With these loose criteria in mind, I began to
investigate previous attempts to solve the problem.
1.7 2D Visualisations
1.7.1 Vertical Trees
Vertical tree systems are a once-common method of
visualising website structures. By organising the Diagram 1: An example of a Balloon Tree graph layout, from [12]
Page 8
web pages in this hierarchical fashion, it is possible to quickly see the parent pages and their child
links. This form of tree is normally good for navigational work, as the quickest path to a child page
can be easily discerned.
However, this form of tree has its problems. While the conceptual structure of many websites is
hierarchical, with the index page leading out to child pages, and so on, the actual link structure does
not mirror this, with navigational controls causing many links to become bi-directional. If bi-
directional links are represented on this kind of visualisation, they are often not immediately apparent.
If they are not represented, then that structural information is lost.
One more fundamental and obvious problem, however, comes when a page “layer” becomes too wide
for the available space. As noted in [1], as websites grow in size, and the number of 2nd stage, 3rd
stage etc pages increases in size, the page icons cannot fit alongside each other in the space provided.
There are also many more pages in each layer than there are layers, so the tree tends to grow
horizontally, in the direction it already suffers.
This problem is compounded if the icons feature page names, as they often do, because the text
greatly increases the width of the icons, and they inevitably end up overlaying each other. This leads
to these trees often being rotated into horizontal trees, as detailed below. (Obviously, cultures where
the text is written vertically, such as Japanese Katakana/Hiragana, will do the opposite)
As a tool for finding navigational routes, a tree such as this has its uses, but is almost always better
represented as a horizontal tree.
Diagram 2: An example of a vertical tree, taken from [12]
Page 9
1.7.2 Horizontal Trees
These operate in the same way as vertical trees,
but are shown from left-to-right. This allows a
greater number of pages to be represented in
each layer, as they occupy less space
horizontally. They still suffer from the problem
of there being far more pages in each layer than
there are layers, so vertical scrolling is often a
feature of these trees.
Again, while they are useful for tracing routes, they are too cluttered and vertically expansive for
effective navigation of the visualisation.
1.7.3 Expandable Trees
In an expandable tree, the horizontal tree is modified so that a
given link subset is only expanded at the user's request. This
reduces both the horizontal and the vertical scrolling over the
standard tree structure. This greatly aids navigation of the tree,
and textual labels can usually be displayed in full, so the clarity
of the tree is good.
However, bi-directional links are still difficult to easily
represent, and if the user is deep within the tree, they may lose
orientation within the structure, as they cannot see it all at once.
Also, links between pages in different subsections of the tree
cannot be represented, where they might be on an expanded,
interconnected tree. As a result, the clarity of the structure is at
the cost of some structural data.
1.7.4 Circular Trees
Diagram 4: An example of an expandable tree, taken from the Explorer program, in my own installation Microsoft Windows XP.
Diagram 3: An example of a 3D cone tree, taken from [13]
Page 10
Circular trees are also known as radial trees, with a subset known as “balloon trees.” In a circular
tree, the starting page is placed at the centre, and linked pages are branched out around it. As links are
discovered from these pages, they are branched out in turn, and so on. This creates an initially very
clear structure, where the links between each node are immediately apparent. The layers on a
horizontal/vertical tree become rings in this structure, and interlinking between them is also simple, so
the full link structure is also visible.
However, there are downsides to a circular tree structure. They generally take up a lot of space due to
the spacing between rings, so ease of navigation becomes an issue, as the user could easily lose their
orientation within a structure that is larger than the screen space available. Also, placing textual labels
at nodes inevitably leads to overlay, making this text difficult to read and reducing the diagram's
clarity.
1.8 3D Visualisations
1.8.1 3D “Molecular” Tree
These are a three-
dimensional version of
the Circular Tree
structures discussed
above. By utilising depth
to increase the available
space, three dimensional
structures can display a
larger amount of data at
once, and as such are
often used for large
websites where thousands
of pages are being
spidered. Prime examples
of this can be found in [3]
and [4], where huge
amounts of data are
visualised in a complex
Diagram 5: A molecular graph from WebTracer2
Diagram 6: A screenshot of Astra SiteManager, showing the inset "Pan Window."
Page 11
three-dimensional structure, allowing for users to conceptually visualise massive structures.
However, this ability to display colossal data sizes comes at a cost. Navigation of these structures is
often difficult – a prime example being in WebTracer2, an excellent visualisation tool that is
handicapped by its frankly hideous control system – and a user's orientation with the greater structure
can be easily lost. Also, overlapping of nodes and text is inevitable, and with alpha blending used to
make the presence of a deeper structure apparent, these diagrams can quickly become confusing if not
properly implemented. Finally, 3D visualisations such as this, especially those making use of alpha
blending, require large amounts of processing power/hardware acceleration in order to function.
1.8.2 3D Cone Trees
Cone Trees are a direct mapping of hierarchical 2D tree structures into a 3D space. They were
designed to increase the number of pages that could be displayed within a layer, without changing the
overall design of the tree. They accomplish this well, and many more pages can be represented
onscreen at once. However, because one page will be behind another, data labels will be obscured,
leading to some loss of clarity. Also, as with the trees this structure is based on, links between pages
in different layers are not represented. This said, these cone trees are often easier to navigate than
molecular trees, as they have a tightly ordered structure.
1.9 Visualisation Techniques
1.9,.1 Picture-in-Picture
During my research, I investigated Astra SiteManager. This program utilises a 2D Circular Tree
structure, as detailed above. However, in order to aid in navigation in larger visualisations,
SiteManager features1 a picture-in-picture display, where a small overview of the whole structure is
constantly displayed, along with the user's current viewpoint within it.
By utilising this method, the program effectively prevented users from losing their place in the overall
structure, and greatly improved navigation within a visualisation.
1.9.2 Fish-Eye Diagrams
1 Or rather, “featured” - Astra SiteManager was discontinued several years ago. I have acquired a copy to use in my project evaluation.
Page 12
Fish-eye diagrams are (generally) circular tree visualisations that operate by enlarging the user's
current focus area, along with directly linked pages, in the centre of the screen. Other pages are
pushed to the outer edges. This allows the user to see their currently focused page with much greater
clarity, but might in turn reduce the clarity of page clusters that are not linked to the current focus.
1.10 Chosen Method
After reviewing the different types of visualisation available, I have decided upon a circular tree
structure for displaying my link structures, utilising a picture-in-picture navigational aid, as used in
Astra SiteManager. My initial preference was to a 3D molecular tree structure, as I thought it would
be effective for the large websites I will be displaying. However, after using tools such as
WebTracer2, I realised that navigating in these 3D spaces is often extremely difficult, and returning to
a previously selected node is a challenge in itself.
Other structures are simply unsuitable for the amount of data I will be displaying. With target
websites in excess of 30000 pages, hierarchical trees would simply grow beyond the point where they
are still manageable, growing exponentially in one direction or their other, depending on their
orientation. This also applies to Cone Trees, though they would not suffer as badly.
Naturally, this structure will work much better at higher resolutions, but with the picture-in-picture
navigational aid, it is my hope that it will still be usable at lower resolutions.
Page 14
2. Implementation: Key Design Decisions
2.1 Requirements of System
Having now completed my background research, prior to building the software solution, it is
important to lay out several key requirements for the software to meet if it is to be considered a
success. As this is a visualisation solution, it makes sense to break these requirements down into two
sections: technical requirements, which deal with the software's efficiency and ability to provide a
suitable user experience under stress; and empirical requirements, which deal with the program's
visual aspects, such as whether nodes are overlapping onscreen, and so on.
2.1.1 Technical Requirements
I believe that, in order for the system to be considered usable, it must:
1. Be able to parse in data sets at a minimum of 10 000 records per second. Given the
specification of the workstation on which the software will be developed, I believe this to be
an achievable number. Parsing at 10 000 records per second ensures that even very large
graphs load quickly, with half a million nodes being loaded in under a minute. This is
important for user productivity.
2. The software must be able to produce a minimum frame rate of ten frames per second when
rendered.
3. The software must be able to cope with data sets containing half a million nodes. While a
small minority of websites have more nodes than this (a larger percentage than normal in the
education sector), the vast majority of websites will not, and I believe half a million nodes is
an achievable target within the scope of this project.
2.1.2 Empirical Requirements
While my own knowledge allows me to easily define technical requirements for the system, in order
to define empirical requirements I researched the work of others in the field, particularly the writings
of Edward Tufte [15] [16]. In [15] he describes several qualities required for what he calls “graphical
excellence,” which consists of “complex ideas communicated with clarity, precision and efficiency.” I
have utilised his deductions, where relevant, in the creation of my empirical requirements list.
Page 15
1. The display should show the data set loaded as a graph. This is a basic requirement, and
seems to be stating the obvious, but it is of course an essential requirement of the software,
and should not be overlooked.
2. Display many nodes onscreen (“many,” as the size of the screen used may vary) without the
nodes overlapping or node edges crossing, as this could lead to confusion on the part of the
user.
3. The display should allow the user to gain a conceptual overview of the website, including
node clusters/groups, etc.
2.2 Programming Language/Platform
The first and most important decision to be made is what programming language I should use to
develop the program. I have several options available to me from my skillset: PHP, C++ and Java.
Firstly, I have a great deal of experience with the PHP scripting language, having worked with it for 8
years. This would enable me to quickly create the application without needing to learn new
programming skills or techniques, especially if I were to use a pre-written visualisation library, such
as JpGraph [7]. Also, as PHP is an interpreted run-once language, it would be very fast at building
and outputting the visualisation. However, as PHP requires a configured PHP environment or web
server in order to run, this would limit its portability. Also, once displayed, the script-based nature of
PHP may limit user interactivity. It would take a large amount of AJAX coding in order to make the
visualisation as interactive as a desktop equivalent.
C++ is a second option. This language is portable (with a re-compile) across platforms, and as it is a
widely used programming language, pre-built visualisation libraries exist, such as VTK [8]. It would
allow for greater interaction than PHP, and is faster than Java once compiled, as it does not require the
use of a runtime. However, my experience in C++ is not as great as the other two languages, and I am
unsure that my abilities in this language are up to the task.
As a result, Java is the natural option for me when developing this program. It is highly portable by
design, it has many third-party libraries available to assist with development, and its memory
management abilities will help efficiently manage large visualisations whilst keeping performance
loss to a minimum. I also have a good deal of experience in the language, so I can ensure that the
design and structure of the program is appropriate and open to extension.
Page 16
To assist me in authoring the program, I will be using the IntelliJ IDEA Integrated Development
Environment (IDE). This IDE, I believe, is superior to rival IDEs in many ways, not least its
integrated support for Subversion (which I will also be using for version control), Ant and other
development tools.
All the software created will be developed and tested on my custom built workstation. For
comparative purposes, this machine runs on two dual-core Opteron 270s @ 2GHz per core, has 4GB
of DDR-400 RAM, and utilises two nVidia GeForce 7900 GTX graphics cards to power a 3840x1024
super-widescreen desktop.
2.3 Visualisation Methodologies
After a short amount of research, I have identified two possible ways of proceeding with the creation
of the main visualisation. These are manual creation, or utilisation of an existing visualisation
framework.
2.3.1 Manual Creation
Manually creating the code required to output a full visualisation would allow for the greatest level of
control over the final product. It would allow me to create a solution dedicated to a given type of
visualisation. This would likely result in a more efficient, dedicated set of visualisation algorithms
being used, as well as ensuring there are no limitations in the program's capability caused by the
capabilities of a library. However, it would also take a great deal of time and effort to implement such
a solution, and given that 3rd party libraries are likely to have been created over a period of time by a
team of developers, it is possible their implementations would be superior to my own. I have
experimented with several such libraries/frameworks, and have noted my conclusions here:
2.3.2 The Java Universal Network/Graph Framework
The Java Universal Network/Graph Framework (JUNG) [10] is a 3rd party Java framework for
creating graphs from pre-computed data sets. Output display objects are created from pre-written
classes, associated with one of several possible layout objects, each representing a graph type, and
then these layouts have nodes/edges added to them. This allows for a quick, easy creation process, and
additional controls can be added to the visualisation at will from the pool provided.
Page 17
However, as JUNG is an open source project, its documentation is somewhat lacking, and its user
controls are somewhat clunky, with demo applications requiring the user to switch from
“transforming” to “picking” mode in order to select a node for examination. Getting around the latter
of these two problems may be a greater challenge because of the existence of the former.
2.3.3 The Prefuse Visualisation Toolkit
Despite its name, Prefuse [9] is, like JUNG, a framework for quickly developing Java-based
visualisations of collected data. Like JUNG, it features pre-written classes to represent nodes and
edges, as well as implementations of various layout algorithms to create the output.
Where Prefuse differs significantly from JUNG is in its usability. User controls are built directly into
the visualisation engine that are very easy to use with minimal learning required – holding the left
hand mouse key and dragging moves the graph around, and holding the right hand mouse key zooms
the viewpoint in and out. This simplicity in both implementation and use is very appealing.
As well as being superior to JUNG in terms of usability, Prefuse has native support for applying
multiple viewpoints to a single map, which would be very useful for developing the “picture-in-
picture” overview in Iteration 2. It also appears to have far more detailed documentation than the
JUNG framework, which should help reduce the amount of time required to familiarise myself with
its inner workings.
I have decided to utilise the Prefuse framework within my project, to speed development and provide
a good balance between accuracy, speed and usability.
2.4 Design Methodology
Once I had decided on a software platform, frameworks and so forth, the final step was to decide on
the methodology I would be subscribing to in order to develop the software, and produce an
approximate schedule/task list.
The first methodology available to me, and one that I have used in the past, is the waterfall model.
However, the waterfall model is quite static. As noted in [18], “there is no formal way to make
changes to the project as requirements change” - you only have one chance to get the project correct,
and if you fail, there’s no point-of-return.
Page 18
Another possible methodology I am familiar with is the Top-Down model, where functionality is built
into a project incrementally. The downside of this is that vital functionality, rather than peripheral
functionality, could be being built into the project right up to the end of its life. Also, there are no
distinct phases in each cycle for project management and assessment.
To organise my development of the program, I will be using an approximation of the iterative
waterfall development methodology, also known as the spiral model. This methodology allows for
multiple iterations of a solution to be developed, each more advanced than the last, until the software
is complete, much like the top-down model. However, each phase is organised as if it were a self
contained (if somewhat abbreviated) waterfall model. Each iteration will consist of design,
implementation and evaluation phases, with each evaluation forming a basis for the design phase of
the next iteration. The aims of each iteration are shown below:
2.4.1 First Iteration – Prototype
This will be a basic working system, accepting data from a cropped Academic Web Link Database
file, parsing it into the appropriate data structure, and displaying a basic interconnected radial graph.
2.4.2 Second Iteration – Production
This will be a fully featured system, with all the available functionality outlined in the research phase.
It will be able to take an AWLD file of any size, parse it into a data structure, and display a full
interconnected graph. Parsing progress will be reported via the use of a status window, and this will
also allow for benchmarking of the software. An overview window will allow for quick navigation
across large structures.
2.4.3 Third Iteration - Extension
If time allows, I will attempt a third iteration of the software. This third iteration will expand on the
functionality of the production version by allowing users to “spider” their own link database of a
website, and then display that website onscreen.
This will be done by adding a multi-threaded spider program, which will index the pages on a given
website and store the link data using a custom XML schema. A new XML parser will be added to the
Page 19
display program, to read in these data files to the appropriate data structures. From there, the existing
program structure will be able to render the graph in exactly the same way as the AWLD files.
Page 20
3. First Iteration: Development of Prototype
3.1 Aim
The aim of this first iterative cycle is to develop a prototype of the visualisation program. It will
display a small graph based on data extracted from a larger sample, and will have only basic
functionality, featuring no overview window, status information, etc.
3.2 Key Design Elements
Rather than cluttering this report with a full class diagram, UML test cases and so forth, I have elected
to document here only the key design aspects of the system, to show the main considerations and
features of the software's development, as well as outline the division between my own creations and
the Prefuse framework.
3.2.1 Use of the Factory Pattern in FileParserFactory
One of the key design caveats in modern Object Oriented programming is that the code written must
be flexible, and allow for extension. As noted by Freeman and Freeman [11], OO programmers
should “program to an interface, not an implementation,” in order to ease future re-use and extension.
As the program could (and will) be extended to accept data from sources other than the Academic
Web Link Database, it makes sense to create a FileParser interface for file parsers to adhere to, and
use a factory to select the correct parser at runtime based on the characteristics of the target file or
data source. This takes full advantage of Java's polymorphic abilities, allowing new parsers to be
added to the system without changing the rest of the code.
For example, if a plain text file were provided, with the *.txt extension and the text/plain MIME type,
the parser factory would select the WebLinkParser class, as the file is most likely from the Academic
Web Link Database. However, if an invalid file were provided, the factory would attempt to find an
appropriate parser, and throw a NoValidParserException if none were found. If a new file format
needs to be supported, a new parser class can be written, implementing the same FileParser interface,
and added to the FileParserFactory class. The rest of the project code will see the new parser as
simply a FileParser implementation, and will be able to utilise it without modification.By utilising the
factory pattern in this way, I can ensure that, if I have time to complete the third development
Page 21
iteration, I can add support for a new file format without altering the code or architecture of the
existing software. This also removes the need for many of the tests already completed in the first
iteration, and helps to make the software easier to maintain.
3.3 Utility Libraries
3.3.1 Log4J
To assist in logging user interaction with the program, as well as providing important debugging and
benchmarking data, I will be utilising the Log4J library from the Apache Commons collection. This is
Diagram 7: A UML fragment showing the relationship between the FileParserFactory and its products.
<<FileParser>>
+parse(input:String):Tree+parse(file:File):Tree
+registerObserver(observer:ParseObserver):void+removeObserver(observer:ParseObserver):void
+notifyObservers():void
WebLinkParser
-logger:Logger-observers:ArrayList<ParseObserver>
-currentFile:String-totalRecords:int
-currentRecord:int-currentNode:String
AnotherParser
-logger:Logger-observers:ArrayList<ParseObserver>
-currentFile:String-totalRecords:int
-currentRecord:int-currentNode:String
FileParserFactory
+getFileParser(REQUIRED_PARSER:int):FileParser+getFileParser(inputFile:File):FileParser
Page 22
a logging library that is very easy to set up, utilising its own properties file. It also features several
different pre-defined “levels” of logging, such as “DEBUG” and “WARNING,” that allow me to
utilise the library for more advanced purposes. By switching the logger into debug mode, I can dump
timestamp information to the logs that will allow me to benchmark the software in the evaluation
phase.
(Log4J also requires Apache Commons Lang and Lucene to operate.)
3.3.2 JUnit
As with most projects of this nature, I will be utilising the JUnit testing framework during the
software's construction to ensure that, on a coding level at least, it functions correctly.
3.3.3 jMimeMagic
I will be using the jMimeMagic library within the FileParserFactory to detect the MIME type of files
read into the software. Once this MIME type has been found, it can be used, among other factors, to
find the right parser to return.
3.4 Evaluation
Having constructed a basic version of the software, I am encouraged by the results of the process. By
sticking to my initial design conventions and referencing the Prefuse documentation where necessary,
I was able to quickly construct a working prototype of the final software system.
To facilitate a quick build process and allow myself time to familiarise myself with Prefuse, I initially
used a ForceDirectedLayout layout manager to set out the graph, as suggested in the Prefuse tutorial.
This worked well, and the program accepts a small link sample and displays it correctly. User
controls are as simple as expected, and the entire graph can be easily manipulated via the mouse.
Frame rates are high, and in terms of usability the system works exceedingly well.
However, the ForceDirectedLayout class uses weighted edges within the graph, and then uses a
ForceSimulator class to work out how the nodes repel each other, spring forces present within the
graph, drag co-efficients, etc. These values are then used to animate the graph as the forces move the
nodes around.
Page 23
This is an unnecessary step in my intended design, as all edges within my graph would have the same
weight, making such calculations redundant. The ForceSimulator has minimal effect on the small
data set I used when constructing the software, but an initial test of a full size data set, comprising
3000 records, had the effect of rendering the visualisation unusable. The rendering frame rate
dropped so low that the program stopped responding to user input, and I was forced to terminate the
program via the system process list.
In order to increase the efficiency of the system so that these problems are not encountered in the
production version, I have elected to move away from the ForceDirectedLayout and opt for a
BalloonTreeLayout instead. This still provides the radial graph I desire, with the same level of control
as before, but does not use any kind of animation or force simulation. This unweighted graph should
require far fewer calculations per frame than previously, which will reduce the probability of low
frame rates encountered when attempting to manipulate large graphs.
Note that it will not completely remove the problem of low frame rates – or “lag” - when displaying
large graphs. It is always possible to overload such a system to the point where the processing power
available becomes insufficient to maintain a smooth frame rate. My aim is to reduce the likelihood of
this occurring as much as possible.
3.5 Tagging
The first iteration prototype has been tagged within the Subversion system as version 0.1. This tagged
copy will be used for comparison against later versions, and for archive purposes.
Page 24
4. Second Iteration: Development of Production System
4.1 Aim
In the second iteration, I will attempt to create a fully featured, production version of the required
software, based on my prototype from Iteration One. In addition to the functionality of the first
version of the software, this production version will feature an overview window for ease of
navigation, status reporting, benchmarking, and a more efficient graph rendering system, utilisng the
BalloonTreeLayout class as described previously.
4.2 Key Design Changes
4.2.1 Use of BalloonTreeLayout
The change of layout manager from ForceDirectedLayout to BalloonTreeLayout has already been
documented, but it should be mentioned again here. The BalloonTreeLayout will remove many
unnecessary calculations from the rendering of a single frame, leading to a much higher frame rate.
One impact of this change is that the existing parser code will need to be modified slightly. Where
ForceDirectedLayout could accept any interconnected graph, a BalloonTreeLayout will only accept a
single Tree structure, so the return type of the parsers must be modified.
This has, incidentally, exposed a flaw in my previous parsing code. Due to a mistake in the prototype
code, the parsers were returning many small, separated graph structures instead of one, large structure.
The ForceDirectedLayout had been compensating for this flaw, masking it from view. In the second
iteration this bug will be patched, and the parsers shall return a single Tree structure containing all
subtrees.
4.2.2 Overview Window
As the Prefuse framework allows multiple visualisations of the same data set to be performed
simultaneously, and multiple displays of a visualisation to be rendered simultaneously, the creation of
an overview window is surprisingly simple. By creating a JDialog instance and using the existing
visualisation configuration methods from the prototype to create a new display on it, I can very
quickly create a dialog that contains a copy of the main window's contents at a different zoom level.
Page 25
Of course, the JDialog instance will be extended to provide the different user controls expected in the
overview window.
4.2.3 Parse Status Observers/Multi-threading
One major change within the software structure implements the Observer pattern to provide feedback
on the progress of a file's parsing. Methods will be added to the FileParser interface, forcing parsers
that implement it to allow observers to be registered, and to notify those observers appropriately. A
new interface will then be defined for the observers to follow, though it will have only one method,
allowing observers to receive status updates from the FileParser implementations and act
appropriately.
The main functional purpose of this implementation will be to provide visual feedback to the user on
the progression of a parsing process. A status dialog box, complete with JProgressBar instances, will
show the user the percentage of the process complete and how far there is to go. Visual feedback such
as this is important to the user to assure them that the program is functioning correctly – to simply
display a blank screen while the file is parsed in would inevitably lead to confusion.
There is a second, more covert purpose behind the observer pattern implementation. By utilising
Log4J within an observer, it becomes possible to dump timestamp information to the program's logs,
essentially creating benchmarking data without a great deal of effort. I will use this method to
benchmark the AWLD parser in the evaluation phase, and a similar methodology could be used in the
third iteration to determine the frames-per-second rendering speed of the visualisation under load.
However, creating the ParseObserver interface has lead to a possibly unexpected but necessary
offshoot. Running large tasks within the main Swing event handling thread – such as reading in and
parsing a file – block the thread from performing any other actions until the task is complete. This
prevents the GUI from responding to user requests, and impairs its ability to handle interrupts or
repaint windows (such as the dialog box containing the progress monitor). Obviously, this is not a
situation to be in, so through necessity the parsers will now run within their own thread, reporting
back to status observers in the main thread as the parse progresses. Fortunately, as the thread
separation is only utilised to prevent blocking, there will be none of the problems often associated
with multi-threading, such as deadlock events.
4.3 Evaluation
Page 26
One issue that must be immediately raised concerns the amount of time taken to complete the second
iteration. Unfortunately, this second implementation phase took a lot longer to complete than initially
anticipated. The reason behind this is both simple and frustrating. When researching the Prefuse
framework, and utilising it during the first development iteration, I had noted Prefuse's large
documentation base. It has since transpired that while the content index and first chapter of the
documentation were complete, the rest of the documentation, when accessed, is listed as being under
construction. As a result, I was forced to spend far more time than expected familiarising myself with
the more advanced aspects of the framework in order to complete the software. This in turn led to an
over-run on the project timetable that has removed the possibility of a third development iteration. I
am very disappointed at this, but in future I will ensure that all the documentation on a library or
project is actually present, so I regard the mistake as a learning experience.
Once I had overcome this hurdle, the software's development progressed as before. The final product
works well. Moving to the static layout manager has had the expected increase in efficiency. To
initially test the software, I loaded the same 3000 node link file as I had in the first iteration.
Previously, the software had “locked up,” and I was forced to manually abort the process. I was
extremely pleased, therefore, to see that the second iteration software was not only able to render the
graph, but maintained a usable framerate at about 10-15 frames per second. This is excellent, and
gives me great confidence for later benchmarking.
The overview window was created with all the simplicity expected, and functions well. Now
wherever the user is in the main window, and at whatever zoom level, they can easily move to another
node by selecting it in the overview. The parser status monitor also works as expected, although I did
tweak its operation slightly. When I initially developed the observer implementation, it called the
notifyObservers() method after each node was parsed, and the observers were updated accordingly.
However, the method overhead of updating all the observers with each node was causing the parsing
process to be significantly slower than previously. As a result, I modified the parser to only notify
observers once every five, then ten, and eventually one hundred nodes parsed, rather than at every
node. This freed up processing time to be concentrated on the parsing, and massively increased parse
speed under the observer implementation.
Page 28
5. Evaluation
The aim of this section of the report is to accurately assess the effectiveness of the software created in
achieving the goals of the project. To make a full assessment, the evaluation will be split into several
sections:
1. Technical evaluation, where the ability of the software to quickly read in, parse and render a
graph are tested, along with its ability to maintain a usable frame rate.
2. User/empirical evaluation, where the software's visualisation is judged against a set of pre-
determined criteria to see if it would be useful to an end user.
3. Personal evaluation, where the success of the overall project is assessed.
To re-iterate from the research phase, all tests on the software will be carried out on a quad-core
Opteron system with 4GB RAM, two GeForce 7900GTX graphics cards and a 3840x1024 desktop
area.
5.1 Technical Evaluation
To test how the system copes with increasingly large data sets, I have selected several such sets from
the AWLD sample. I will load each of these sets into the software and observe how the system copes
with the load placed on it – specifically with regard to how quickly the data is parsed in, how much
RAM is used by the program while running, and what frame rate is achieved when navigating the
visualisation. I can then use this collected data to assess the technical merits of the system.
The data sets I have chosen for the assessment are:
Chichester College, at 3980 entries
Harper Adams University College, at 18860 entries
Liverpool Hope University, at 80664 entries
The University of Bournemouth, at 182467 entries
The University of Coventry, at 352143 entries
The University of Glamorgan, at 533613 entries
The University of Exeter, at 722513 entries
The University of Cardiff, at 964995 entries
The University of Edinburgh, at 2011756 entries
Page 29
The University of Cambridge, at 3267262 entries.
These ten data sets should give me a good spread of data over the available range, and allow me to
make a statistical analysis of the program's effectiveness. I have run the initial benchmarking on each
data set, and the results are shown below:
Data Set Entry Count
Parse Time (ms)
Parse Rate(Records per second)
Memory Footprint
(MB)
Frames-per-second
Chichester 3980 250 15920 63.82 25
Harper Adams 18860 750 25147 82.1 20
Liverpool Hope
80664 1875 43021 161.1 21
Bournemouth 182467 3313 55076 258.1 15
Coventry 352143 6407 54962 445.6 14
Glamorgan 533613 9843 54212 448.4 14
Exeter 722513 13485 53579 614.3 10
Cardiff 964995 18140 53197 875.5 7
Edinburgh 2011756 45656 44063 1124.4 3
Cambridge 3267262 - - - -
(Note that the memory footprint of the program when idle is 42MB, running inside the IDE.)
You'll notice that no statistics are shown for the last data set. This is because the file proved to be
simply too big to load, and the Java heap overflowed. Even after increasing the heap size to the
maximum possible under my IDE, this continued to happen. This is a limitation of the software that
should be considered – Java installations with a standard heap size or computers without larger
amounts of available RAM may not be able to use the larger visualisations. However, I do not
consider this a major flaw in the software solution, as it is rooted in the Java Virtual Machine, and
users attempting to parse graphs with over 3 million nodes are likely to understand that a more
powerful machine is required.
At first glance, the statistics appear to progress as expected, with parse time increasing with the
number of records entered. However, to ensure that no anomalies are missed, I have produced several
visual depictions of the data, shown overleaf.
Page 30
Diagram 8: A chart showing the time required to parse a data set given its size
Diagram 9: A chat showing the memory footprint required for each data set
Entry Count 3980 18860 80664 182467 352143 533613 722513 964995 2011756
0
200
400
600
800
1000
1200
Memory Footprint Required
Column B
Number of Entr ies in Fi le
Mem
ory
Foo
tpri
nt (
MB
)
Entry Count 3980 18860 80664 182467 352143 533613 722513 964995 2011756
0
5000
10000
15000
20000
25000
30000
35000
40000
45000
50000
Time Taken to Parse Data
Column B
Number of Entries in File
Tim
e t
ak
en
(m
s)
Page 31
As you can see from the charts, the strain that the visualisation places on the host computer increases
almost linearly with the size of the data file parsed, precisely as one would expect, though a slight
increase in the gradient of diagrams 8 and 10 at the higher end of the scale might suggest that
additional pressures are encountered with very large data sets.
While it is only within the remit of this project for me to speculate on a reason behind this, it is
possible that as the memory footprint of the program becomes very large, around 1GB or so, more
CPU time is required by the operating system to correctly manage the data. The impact of this would
be felt in far greater terms on computers with installed RAM capacities lower than that required by the
Java heap. In these cases, large amounts of virtual memory may be used, and in order to render the
visualisation data would be read back and forth from virtual memory. This would have an impact on
both parse time and FPS rates, as virtual memory is far slower to access than Random Access
Memory, and could lead to “disk thrashing,” which occurs when data is being constantly read from
and written to a hard drive. (For an explanation of disk thrashing, see [14].)
Diagram 10: A chart showing the achieved FPS rate from each data set test
Entry Count 3980 18860 80664 182467 352143 533613 722513 964995 2011756
0
5
10
15
20
25
30
Frames per Second Achieved
Column B
Entries in Data File
Fra
me
s p
er
seco
nd
Page 32
With regards to the technical requirements laid out at the beginning of the implementation phase, the
software is a success. The target parsing rate was 10000 records per second on a high-spec
workstation. This was in fact not only surpassed, but thoroughly beaten, with parse rates hovering
around the 50000 per second mark (very small data sets parsed in under a second, so with the
additional file handling overhead these appear to have lower parse rates – this is not the case).
Again, the minimum FPS rate allowed for the software to be considered useful was set at 10 frames
per second. Thanks to Prefuse's built-in boundary culling, this frame rate was sustained until over
seven hundred thousand nodes were present within the display, at which point the software rendering
system could no longer cope in terms of usability, but continued to produce a displayed graph in
which user interactions still functioned. This also demonstrates that the software passes the third
technical requirement: that the software must be able to parse in and display at least half a million
nodes. I am pleased that the software exceeded the requirements, and as such can be considered a
technical success.
On reflection, were I to run the project again, with more time available to me, I may have decided to
build the software using JOGL/OpenGL after all, to take advantage of the hardware acceleration
capabilities
5.2 Empirical Evaluation
With regards to the empirical/aesthetic aspects of the visualisation, I regard it largely as a success,
with a few points for improvement. Firstly, the system does indeed display the visualisation data in
the expected way, with all the site nodes interconnected. The visualisation can be navigated, panned
and zoomed using the mouse controls.
Utilising the BalloonTreeLayout, the nodes of the graph are laid out appropriately, with enough
distance between them to make distinction of nodes easy. The way in which child nodes are laid out
around their parent allows for easy conceptualisation of node clusters, which relates in physical terms
to the underlying file structure of the website. This enables users to see how the files of the website
relate to each other in terms of a navigational structure, categories/content sections, etc, which would
not be possible with a simple file listing or web page resource.
However, there is a slight flaw in the visualisation as it stands, which means that in some cases it does
not meet the second requirement – that no nodes overlap and no edges cross. As the full file path of a
Page 33
node is being used as its display name, these paths are sometimes quite long, and if a large number of
them are arranged around a single parent node, then display overlap of the nodes and and has
occurred.
But, as it stands, I believe that overlays of this type in graphs of this nature are fairly inevitable, and
that the only way to entirely prevent them in all circumstances would be to increase the distance
between child and parent nodes significantly. Such an increase in distance between child and parent
would lead to a possible visual disassociation of data – ie, the relationship between a parents and its
immediate children may not be as readily apparent as before, and this would break the third
requirement of the visualisation. It would also reduce the amount of data onscreen at a given time –
in essence, wasting screen real estate. This, I believe, would be more unacceptable than occasional
node overlay problems, and so the issue becomes one of compromise for best results.
Page 34
6. Bibliography and References
1. Bajaj, Chandrajit, Data Visualization Techniques, Sussex, John Wiley & Sons (Sept 1999).
2. Chen, Chaomei, Information Visualization: Beyond the Horizon (Second Edition), Springer-
Verlag London Ltd (2004).
3. Dodge, Martin, An Atlas of Cyberspace
(http://www.cybergeography.org/atlas/web_sites.html) (2004).
4. Dodge, Martin & Kitchin, Rob, Mapping Cyberspace, Routledge (Oct 2000)
5. Spence, Robert, Information Visualisation, New York, ACM Press Books (2001).
6. Statistical Cybermetrics Research Group, Academic Web Link Database Project.
Wolverhampton University, England, http://cybermetrics.wlv.ac.uk/database/index.html (17th
Nov 2007).
7. Aditus Consulting, What is JpGraph? (http://www.aditus.nu/jpgraph/) (2007)
8. Martin, Schroeder and Lorensen, What is VTK? (http://public.kitware.com/VTK/what-is-
vtk.php) (2007).
9. Heer, Card and Landay, prefuse: a toolkit for interactive information visualisation, University
of California/ACM Press (2005)
10. Madadhain, Fisher, Nelson, Overview, Java Universal Network/Graph Framework
(http://jung.sourceforge.net/index.html) (2007)
11. Freeman, Freeman, Head First Design Patterns, O'Reilly Media (2004)
12. Herman, Melancon, Marshall, Graph Visualisation and Navigation in Information
Visualisation, Centre of Mathematics and Computer Sciences, Amsterdam (1998)
13. Benelli, Caporali, Rizzo, Rubegni, Design concepts for learning spatial relationships,
Page 35
University of Siena, Italy (Date not recorded)
14. Markatos, Dramitinos, Using Remote Memory to avoid Disk Thrashing: A Simulation Study,
ICS, FORTH, Science and Technology Park of Crete Vassilika Vouton (1996)
15. Tufte, The Visual Display of Quantitative Information, 2nd Edition, Graphics Press,
Connecticut (2002)
16. Tufte, Visual Explanations, Graphics Press, Connecticut (1997)
17. Sorensen, A Comparison of Software Development Methodologies, Software Technology
Support Center (1995)
18. Purcell, Comparison of Software Development Lifecycle Methodologies, The SANS Institute
(2007)
Page 37
Appendix A: Reflection
Over the course of my project, I have encountered several challenges and problems, which I have
attempted to overcome to the best of my ability. Some areas of the project progressed well, whereas
in others serious problems arose, which eventually culminated in the project being behind schedule.
Looking back on the experience, I can outline several key failings where improvements would have
helped alleviate some of the scheduling problems.
My key failure, and the one that led to the greatest delay, was in not correctly investigating the Prefuse
framework before attempting to utilise it within my code base. When comparing the various
frameworks that were available for my use, I read through their various websites, checked for the
existence and apparent completeness of documentation, and tried the sample applications available.
However, when later attempting to integrate with the framework, I discovered that the documentation
for Prefuse was fully indexed, but most of it was not present or complete, making the majority of it
useless. By this time, I had already integrated Prefuse to the point where I could not go back and
choose another framework. This led to considerable delay in the implementation phase, as I was
forced to devote time to reading the API documentation, and experimenting with small test
applications in order to complete development of the software.
In future, I would devote more time to the initial investigation, to ensure that such a mistake would
not be repeated. To do this, I could produce small test applications in each available library or
framework, both to familiarise myself with them, and to assess how easy they are to work with. This
would also have prevented more hidden problems, such as the one I encountered with documentation.
I am also displeased with my timekeeping abilities. Several times, due to procrastination or poor
timekeeping, I found myself working close to project deadlines to overrunning the deadlines I set for
myself. I believe that if I had kept to a strict working timetable earlier on in the project, it would have
helped alleviate my problems in the implementation phase.
I was surprised at this change in myself, frankly. I have worked in the industry previously and I do
not normally have such poor timekeeping. In future, I should ensure that it does not happen again,
perhaps by setting out a set time in the week at which to work, rather than working in my spare time.
By creating a set working schedule for myself and ensuring I stick to it, I can improve my
timekeeping and reduce missed deadlines.
Page 38
Overall, I am unsatisfied with my progress on the project. However, I am pleased with the solution
produced. I have learned how to apply recently acquired skills in Java, and I was satisfied with my
ability to look “under the bonnet” of Prefuse despite a lack of full user documentation. I am satisfied
that the code written is efficient, meets the requirements of the project, and is relatively robust,
generally only failing when placed under extreme strain.
In the future, I must learn to place as much emphasis on research and investigation as I do on
implementation. I believe that this character flaw stems from my past experience as a programmer,
which was largely self-taught as a hobbyist. It appears that – at least in part – I have not quite
outgrown my experimental philosophy of “steal it, break it, mend it,” referring to the way in which I
used to learn advanced coding techniques, by acquiring another programmer's code, often at random,
and modifying it until I understood it. While this is all fine and well within the realms of the hobbyist,
in a professional project I must learn to have the patience to fully research a subject before diving in.
With these lessons in mind, I believe that this project has made me a better programmer, if for
no other reason than it has made me appreciate the value of following established guidelines
when developing a software solution, and that patience is as important as productivity.
Top Related