Data Visualisation of Scottish Demographic …maths.straylight.co.uk/dataviz.pdfFinally, we present...

The School of Mathematics

Data Visualisation of ScottishDemographic Information

by

Graeme Taylor

Dissertation Presented for the Degree of

MSc in Operational Research

August 2014

Supervised by

Dr Belen Martin-Barragan and Dr Esther Roughsedge

Abstract

This project explores the use of interactive visualisations to augment the extensive data published by

the National Records of Scotland. Good visualisation can illustrate key trends in statistical data, in-

creasing impact and accessibility; great visualisation can go further, and enable us to identify and ex-

plore unexpected connections. Data visualisations can therefore support operational research, but we

will see that producing them also entails solving problems of an OR flavour.

We survey the existing literature for principles of good design in presenting data visually; much of this

is aimed at hand-produced imagery for print, so we examine how it can be best used in the new context

of procedurally-generated, interactive visualisations for the web. In the first instance, we consider this

for chart types which have proven popular or successful for static visualisations, particularly if already

used by NRS.

This leads us to investigate more complicated data sets which can be interpreted as having a graph

theoretic structure. We will show how the constrained layout of networks of vertices with an associated

size can be posed as an optimisation problem, and develop a visualisation that operates under such

constraints. Further, we will consider the use of geographic clustering to represent migration flow,

describing and implementing a novel ‘re-wiring’ algorithm to generate tree structures that produce

better visualisations than standard agglomerative approaches.

Finally, we present a portfolio of visualisations created for NRS that follow the design principles iden-

tified and make use of the software tools developed during the project.

Own Work Declaration

I declare that this thesis was composed by myself and that the work contained therein is my own,

except where explicitly stated otherwise in the text.

Edinburgh, August 18, 2014

Place, DateMyself

Contents

1 Data Visualisation 1

1.1 Terminology . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2

1.2 D3 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3

1.3 Design principles . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4

1.3.1 Graphical Integrity . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5

1.3.2 Use of Colour . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6

1.3.3 Transitions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8

2 Charts 10

2.1 Small Multiples . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10

2.2 Tree Maps . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 12

2.3 Choropleth Maps . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 15

2.4 Frequency Plots . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 17

3 Graphs 20

3.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 20

3.2 Graph Layout . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 22

3.3 Flow Maps . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 27

3.3.1 Flow graphs . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 28

3.3.2 Star graphs . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 29

3.3.3 Hierarchical clustering . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 31

3.3.4 Flow graph layout . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 35

3.3.5 Rewiring . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 37

3.4 Chord Diagrams . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 38

i

4 Portfolio 43

4.1 Migration Flow Map . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 43

4.2 The Cause of death explorer . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 43

4.3 Cause of Death Treemap . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 46

4.4 Distributions with cohort effects: Fertility Data . . . . . . . . . . . . . . . . . . . . . . . . . 49

5 Conclusions 51

Bibliography 52

Appendices I

A Guide to Electronic Appendices I

A.1 Flow Map Construction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . I

A.2 Cause of Death Explorer . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . I

A.3 Cause of Death Zoomable Treemap . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . I

A.4 Fertility Data (cohort effects) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . II

A.5 Popular Names . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . II

A.6 Life Expectancy . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . II

A.7 Gender distribution by age (Frequency plot) . . . . . . . . . . . . . . . . . . . . . . . . . . . II

A.8 Migration within Scotland (Chord Diagram) . . . . . . . . . . . . . . . . . . . . . . . . . . . II

i

List of Figures

1.1 The 2011 State of the Union Address . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6

1.2 If Bush Tax Cuts Expire . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6

1.3 A misleading rainbow. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8

2.1 Air pollution in Southern California. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10

2.2 Slice-and-dice treemap of Scottish population data. . . . . . . . . . . . . . . . . . . . . . . 12

2.3 Squarified treemap of Scottish population data. . . . . . . . . . . . . . . . . . . . . . . . . . 13

2.4 Example of tile placement in the squarified algorithm. . . . . . . . . . . . . . . . . . . . . . 14

2.5 The Singing Mondrian. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 15

2.6 Choropleth map of migration to Scotland. . . . . . . . . . . . . . . . . . . . . . . . . . . . . 16

2.7 Number of males and females per 100 centenarians, Scotland 2012. . . . . . . . . . . . . . 17

2.8 Possible outcomes of breast cancer screening. . . . . . . . . . . . . . . . . . . . . . . . . . . 19

3.1 Four views of the Petersen Graph. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 21

3.2 Non-planar and planar diagrams for the same graph. . . . . . . . . . . . . . . . . . . . . . . 22

3.3 Radius versus bounding box packing of circles. . . . . . . . . . . . . . . . . . . . . . . . . . 24

3.4 Minard’s map of exports of French wine in 1864. . . . . . . . . . . . . . . . . . . . . . . . . . 27

3.5 Minard’s 1861 visualisation of Napoleon’s Russian campaign of 1812. . . . . . . . . . . . . 28

3.6 Selecting branch point location to minimise chart ink. . . . . . . . . . . . . . . . . . . . . . 30

3.7 Computer generated flow maps of migration from California 1995-2000 . . . . . . . . . . 30

3.8 Star-graph flow map of migration to Scotland. . . . . . . . . . . . . . . . . . . . . . . . . . . 31

3.9 Bounding-box dendrogram for sources of migration to Scotland. . . . . . . . . . . . . . . . 33

3.10 Algorithmically-generated flow map of migration to Scotland . . . . . . . . . . . . . . . . . 36

3.11 User-adjusted flow map of migration to Scotland. . . . . . . . . . . . . . . . . . . . . . . . . 36

3.12 Re-assigning root for sources of migration to Scotland. . . . . . . . . . . . . . . . . . . . . . 38

ii

3.13 Visualizing information flow in science. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 39

3.14 Chord diagram of migration within Scotland 2011-2012. . . . . . . . . . . . . . . . . . . . . 40

3.15 Chord diagram of migration between Scottish councils and the rest of the UK 2011-2012. 41

3.16 Chord diagrams with internal flow. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 42

4.1 User-adjusted rewired flow map of migration to Scotland. . . . . . . . . . . . . . . . . . . . 44

4.2 Cause of death data . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 45

4.3 The Cause of Death Explorer. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 47

4.4 Treemaps of cause of death data. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 48

4.5 Live births per 1,000 women, by age, selected years. . . . . . . . . . . . . . . . . . . . . . . . 49

4.6 Interactive visualisation of fertility data. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 50

iii

Chapter 1

Data Visualisation

We live in an era of big data; in a typical minute, a hundred hours of footage are added to Youtube [44],

whilst over 200 million emails are exchanged [37]. But data is not the same as information, and only

after processing to make it meaningful to an intended audience can it qualify as the latter. Even then,

much of it simply passes us by. The World Bank makes tens of thousands of its reports - often produced

with the goal of influencing government policy or public debate - available online, but estimates that

nearly a third have never been downloaded [10].

McCandless, in the introduction to Information is Beautiful [23], describes being “swamped by infor-

mation [and] searching for a better way to see it all and understand it”. The key phrase, perhaps, being

‘to see’; as he did, we will consider ways in which visualisation of data can both direct our attention to

interesting content, and help us understand it better.

We will focus on data produced and maintained by the National Records of Scotland. Formed out of

the merger of the General Register Office and the National Archives, the NRS “plays a central role in

the cultural, social and economic life of Scotland, supporting several of the Scottish Government’s key

National Outcomes and measuring its Population Purpose Target” [28]. In particular, it performs “the

registration and statistical functions of the Registrar General of Scotland, including responsibility for

demographic statistics and census”; it is this demographic data from the GRO that we will draw upon.

This report is structured as follows. In the remainder of this chapter we fix terminology for some key

concepts; introduce and justify use of the Javascript library D3 which will be used for creating visuali-

sations; and identify a series of guiding principles for good design. In Chapter 2 we consider a variety

of popular ‘static’ visualisation techniques, and how they can be translated to an interactive context.

This will already require us to consider problems of mathematical optimisation to satisfy our design

goals. In Chapter 3, this interplay between design and mathematics is pushed further, as we examine

1

how data sets can be interpreted as graphs. We will pose graph layout as an optimisation problem,

and extend this to create ‘flow maps’ of tree structures. These are created from a clustering and ‘re-

wiring’ process we developed and implemented for this visualisation task. Combining the ideas of

earlier chapters, a selection of visualisations developed from NRS data sets is presented in Chapter 4,

with further examples in the electronic appendices. We conclude with a summary of our methodology

and the project outcomes in Chapter 5.

1.1 Terminology

Data visualisation is multi-disciplinary in nature, drawing upon (amongst others) mathematics, statis-

tics, computer science, art and design. But this can lead to competing notation and terminology, so

for convenience we will fix a few definitions here. In particular, by graph we will mean a collection of

vertices linked by edges, rather than plots such as time series; the latter being considered an example

of a chart. Similarly, we will use graphical to describe methods related to the use of graphs in under-

standing data, rather than simply through visual depictions, and avoid reference to computer graphics

entirely.

By a data visualisation we will mean a representation of a data set as an image, generated automatically

by algorithm. This is in contrast to images produced ‘by hand’, and thus requiring manual intervention

by the designer to update for new data values (sometimes described as infographics). We will describe

a data visualisation as static if the end result is an image, and as interactive if the end result is software,

with features that allow the user to manipulate the image and explore the data further.

For a fixed data set, an interactive visualisation may effectively consist of a great many possible static

visualisations of lower-dimensional subsets. The user selects from these by choosing certain control

parameters; we will describe this as taking a slice of the data set through the corresponding point in

parameter space. We may also animate motion along a control dimension by iteratively presenting

slices; in effect, time becomes one of the display dimensions of our data visualisation, and we can think

of the individual static visualisations as frames of a movie. This is a powerful advantage of interactive

visualisations over static ones; however, care must be taken to establish whether there might be a better

static visualisation capable of capturing ‘the big picture’ all at once.

2

1.2 D3

Data-Driven Documents, introduced in [4] and almost always refered to simply as D3, is a javascript

library for creating data visualisations (static or interactive) that can be accessed with a standard web

browser. Moreover, it does so by building upon the existing frameworks for modern web content. A

well-designed web page draws a distinction between content (tagged using HTML) and style (deter-

mined by CSS); interactivity is typically enabled through use of Javascript. These technologies and

others can be interwoven thanks to a shared representation of the resulting page, the Document Ob-

ject Model (DOM). So, for instance, an HTML fragment may specify that there should be a particular

line of ‘header’ text; the CSS will specify what a header looks like; and Javascript can then monitor

this object to trigger an action when the user clicks on it. D3 extends this by allowing existing objects

to be bound to data values drawn from elsewhere, with new objects created or old ones destroyed as

the data set changes. Object styling can then be determined by the associated values, which can be

updated based on user actions.

By building D3 visualisations we ensure access to the existing audience for online content currently

produced by NRS. Visualisations in earlier web-based systems such as Prefuse (2005) or Flare (2007)

were developed in other languages, and required plug-ins to be rendered on the page (Java and Flash

respectively). This requirement is detrimental to accessibility, particularly in the long term; plug-in

support can erode due to security concerns (by default, a modern Java installation will block access to

applets that would have been treated as safe just a few years ago), platform incompatibility (Flash is

not supported by Apple iOS devices, for instance) or simply obsolesence (NRS data sets may remain of

interest for decades, so data should not be locked up in proprietary formats).

A potential concern is that working in D3 is more difficult than systems such as Flare, which can offer

a higher level of abstraction. We will not attempt to give a tutorial on D3 here, as the best approach

will likely depend on background. With programming experience but only passing familiarity with

HTML/CSS and no experience of Javascript, the guides [22] and [26] were a useful introduction. From

there, it was easy to modify or adapt to new data the wealth of examples available online1 before finally

constructing new work from scratch. We may hope for some of our own visualisations to be similarly

instructive, and thus provide a useful basis for future work at NRS. There are also extensions to D3

- most notably C3 [36] - that provide common charts, thus avoiding the need to reinvent the wheel

entirely for standard visualisation tasks.

Moreover, if a D3-based page is designed appropriately then any one of the data, other content, styling,

1Particularly those by D3 author Mike Bostock at http://bl.ocks.org/mbostock; or for specific requests the commu-nity at http://stackoverflow.com.

3

http://bl.ocks.org/mbostock

http://stackoverflow.com

interactive components or visualisation behaviour can be updated or modified by NRS staff even if

they are not familiar with the technologies involved in the others. For instance, a designer can restyle

the page without needing to learn D3; whereas a plug-in based visualisation (which is a single opaque

object to the DOM) must be rewritten in the appropriate language. Further, a statistician can update

to the latest figures without needing to know about any aspect of web development, by providing an

appropriately-formatted new data file. In this way some outputs of the project may remain of use

beyond its completion, by providing templates for visualising certain types of data rather than one-off

instances for the 2014 figures.

1.3 Design principles

Any definition of ‘good’ visualisation will inevitably include some component of personal taste, as well

as being dependent on the context it is presented in and the audience it is intended for. Nonetheless,

we can attempt to outline some general design principles - and we can often identify features that

make a visualisation ‘bad’!

The canonical references on this matter are Tufte’s two works [41] and [40]. Disdainful of the influ-

ence of designers in what he sees as a statistical field, Tufte takes an almost entirely practical view,

with visual appeal a potential side-effect but never the goal: “Occasionally artfulness of design makes

a graphic worthy of the Museum of Modern Art, but essentially statistical graphics are instruments to

help people reason about quantitative information” [41]. In his foreword to Lima’s work [21], Manovich

takes a broader view: “The space defined by the disciplines of science, design, or art [...] contain lots of

possibilities. A given visualization project can be situated anywhere in this space, depending on what it

privileges” and goes on to argue that the best examples “manage to combine all three” aspects of this

space. [21] pulls its examples from the online collection Visual Complexity, and this name is telling in

comparison to Tufte’s titles; many of the works presented are striking representations of almost over-

whelming complexity, but as such can only offer high level insights rather than serving as tools for

detailed quantitative analysis.

A point of agreement, however, is that aesthetic appeal should be derived from the data itself, not

decoration. Tufte distinguishes between “data ink” - that which would reduce information content if

deleted - and “chartjunk”, added for artistic reasons and superfluous to understanding the data. He

argues that “data graphics [...] stand or fall on their content, gracefully displayed. Graphics do not

become attractive and interesting through the addition of ornamental hatching and false perspective to

a few bars. Chartjunk can turn bores into disasters, but it can never rescue a thin data set. The best

designs [...] are intriguing and curiosity-provoking.” [41]. Similarly, McCandless describes his projects

4

in [23] as “A series of experiments in making information approachable and beautiful”, motivated by

the question “can a book with the minimum of text, crammed with diagrams, maps and charts, still be

exciting and readable?”

In line with these goals, a visualisation should make the viewer’s task in interpreting the data easier,

not harder: Tufte warns against the creation of “puzzle graphics” that have to be decoded through a

verbal train of thought rather than being visually self-evident. A sure sign of a bad visualisation is that

it is harder to work with than the original tables of data!

1.3.1 Graphical Integrity

Worse, though, is a visualisation that is not difficult to interpret, but easy to misinterpret - that is,

misrepresents the underlying data. In translating to visual form, the medical principle of first, do no

harm should be followed. Tufte devotes the second chapter of [41] to this issue of graphical integrity,

identifying various general principles; we highlight two in particular.

• The representation of numbers, as physically measured on the surface of the graphic itself,

should be directly proportional to the numerical quantities represented.

• Clear, detailed and thorough labeling should be used to defeat graphical distortion and ambi-

guity. Write out explanations of the data on the graphic itself. Label important events in the

data.

In many visualisations, a set of data items may be represented by simple two dimensional shapes or

icons; when each item has an associated size, it is natural to scale these representations accordingly.

However, a common mistake is to multiply both dimensions by the desired scale factor s, rather than

a single one, to preserve the aspect ratio; with the end result of rescaling the area by a factor of s2.

Deliberately or not, this simple error can be made at even the highest levels with straightforward data

sets, as Figure 1.1 shows.

This issue is further complicated by variations in perception: Tufte notes that “the perceived area of a

circle probably grows somewhat more slowly than the actual (physical, measured) area [...]; perceptions

change with experience; and perceptions are context-dependent.”. His second point seeks to address

this, arguing that visual depictions should be supplemented with numerical values where possible. We

note that for web-based visualisations there is almost limitless potential for this through the use of tool-

tips: additional content that appears only when hovering the mouse over an element. These can be

incorporated into otherwise static visualisations, and allow access to full data values without needing

5

to make a trade-off against ease of reading due to clutter as in visualisations for print. However, labeling

alone cannot rescue poor design, if the visual perception overwhelms the message of the numerical

values (as illustrated in Figure 1.2).

Figure 1.1: The 2011 State of the Union Address

(Capture from video [43]; the official annotation in blue erroneously scales the radius of each circle in

proportion to GDP. The green circles, which correctly scale the area, have been added by the author

and tell a rather different story).

Figure 1.2: If Bush Tax Cuts Expire

(Left: Fox Business, Cavuto via Media Matters [24], presents the change in top tax rate - from 35% to

39.6% - with accurate labelling but misleading relative areas due to a truncated axis. Right: a more

conventional presentation by Media Matters, with axis starting at zero.)

1.3.2 Use of Colour

Tufte offers considerable guidance on the use of colour in Chapter 5 of [40], as usual contrasting its

power when used well against the calamities that can occur when inexpertly deployed. He notes a

number of “fundamental uses of color in information design: to label (color as noun), to measure (color

6

as quantity), to represent or imitate reality (color as representation), and to enliven or decorate (color as

beauty)”.

Labeling by colour is a useful way to indicate a categorical data dimension, avoiding the clutter or

confusion or textual or numerical labels and providing a quick way to visually group related items.

This can work particularly well if the colours chosen are also representative - Tufte gives examples from

cartography, where there are natural interpretations of greens and blues - but this is also a hazard, if

such associations would be misleading, or simply clash with the data key.

A further complication arises in interactive visualisations, as we may not know which elements are cur-

rently on-screen, or (when designing general templates) how many categories (and thus colours) there

may even be. The approach in D3 is to once again divorce style from data; a selection of carefully cu-

rated palettes (of ten to twenty colours) are available as functions which map integer values to pleasant

colour choices. Additional palettes can be supplied as simple lists2 or functions of data values. Thus

abstract categories can be specified at the data level, then styled as appropriate given the state of the

visualisation (such as the depth of a hierarchy explored). For indicating that various data items are

related but not identical, [12] suggests colouring each with a small perturbation of a base colour (such

as that given by the palette); we implement Javascript code for doing so in the visualisation described

in Section 4.3.

Care is also required when colour is used to measure ordinal data, particularly when the underlying

quantity is continuous. Figure 1.3 demonstrates the risks: it seems to imply a sharp divide between the

eastern and western halves of the US, with a boundary line running between the green and yellow re-

gions. But closer inspection of the legend shows that these colours correspond to adjacent bands; the

data values could vary smoothly from east to west, but the rainbow colour scheme does not. Such

a scheme remains lamentably popular despite the many limitations outlined in [3]: “The rainbow

color map confuses viewers through its lack of perceptual ordering, obscures data through its uncon-

trolled luminance variation, and actively misleads interpretation through the introduction of non-data-

dependent gradients.” This last remark explains the failure of Figure 1.3; the abrupt changes between

hues are far more noticeable than the variations within each band, and misleadingly suggest a discon-

tinuity in the data, too. They note the gray scale as being perceptually ordered: “Increasing luminance

from black to white is a strong perceptual cue that indicates values mapped to darker shades of gray are

lower in value than values mapped to lighter shades of gray. This mapping is natural and intuitive.”

We can combine this useful feature of the gray scale with a chosen colour (for aesthetic reasons or to

indicate quantity simultaneously with category) in D3 by adjusting the alpha (or transparency) value.

2We note the usefulness of [5], particularly for generating palettes that are accessible for colourblind users.

7

Figure 1.3: A misleading rainbow.

(Figure 13, Estimated fraction of precipitation lost to evapotranspiration 1971-2000, of [34]).

1.3.3 Transitions

To a large extent, good design of interactive visualisations can be infered from the principles for the

static case. However, there is a further aspect to be considered, which is the transitions between each

state. Although not specific to visualisation, user interface guides such as [16] provide useful guidance.

If we wish to associate the size or colour of an object with the magnitude of the data item it is asso-

ciated with, then this association can be strengthened or weakened depending on how it is portrayed

when in motion. Larger, “heavier” objects should be less prone to disturbances, move slower, and both

accelerate and decelerate at a lower rate than smaller ones. Even without such an association, objects

should still move in physically authentic ways: from [16] we note that “animation with abrupt starts

and stops or rapid changes in direction appears unnatural and can be an unexpected and unpleasant

disruption for the user [...] transitioning between two visual states should be smooth, appear effortless,

and above all, provide clarity to the user, not confusion”.

8

Ideally, we wish to minimize the disruption to the user’s “mental map” by ensuring that visual changes

only occur to the extent required by changes in the underlying data. In particular, sudden appearance

or disappearance is not physically realistic; [16] suggests “When an object enters the frame, ensure it’s

moving at its peak velocity. [...] Similarly, when an object exits the frame, have it maintain its velocity,

rather than slowing down as it exits the frame. Easing in when entering and slowing down when exiting

draw the user’s attention to that motion.”

Much of the heavy lifting is handled in D3 by the inclusion of transitions: when the state of an object

is to be updated, the timescale for this change to occur can be specified and properties such as size,

colour and location will be smoothly interpolated from the start to end state. By automatically trigger-

ing such updates just as the previous transition ends, we can thus create a continuous animation along

a control parameter dimension merely by specifying a discrete set of data slices.

9

Chapter 2

Charts

In this chapter, we will examine a variety of popular visualisation techniques. We do so with a view to

identifying best practice for their use in line with the principles set out in the previous chapter; noting

any particular strengths or weaknesses of each. We also consider how they might be adapted to the

context of interactive visualisation on the web.

2.1 Small Multiples

Figure 2.1: Air pollution in Southern California.

From [41], work of G. McRae, California Institute of Technology via Los Angeles Times July 22, 1979.

As discussed in Section 1.1, we can ‘slice’ a high dimensional data set along a control dimension such

as time, giving a lower dimensional set at each t value that is hopefully easier to visualise. By varying t

a series of slices is obtained, which, like the frames of a movie, we can present sequentially to recover

10

some insight into how the data evolves through time. Such a sequence of static images is a simple

example of a “small multiple’- a series of structurally similar charts allowing for easy comparison along

the changing parameter. There is no reason why this parameter has to be time, as we can slice through

any dimension of the data set; moreover, we can take a two-dimensional slice by fixing the values of

two parameters and presenting the resulting visualisation as one cell of a grid. Figure 2.1 gives an

example of this grid structure, with different columns corresponding to time values, and different rows

to a categorical variable, the pollutant type. Each pair of time value and pollutant type specifies a data

set for a particular cell, but the method of visualisation (including features such as the range of axes) is

consistent across all cells.

Tufte strongly endorses the small multiple, devoting entire chapters of both [41] and [40] to them;

from the latter, he argues that “at the heart of quantitative reasoning is a single question: Compared

to what? Small multiple designs, multivariate and data bountiful, answer directly by visually enforcing

comparisons of changes, of the differences among objects, of the scope of alternatives. For a wide range

of problems in data presentation, small multiples are the best design solution.”

It is important to stress the difference between a small multiple and a series of full-scale versions of the

individual cells, arranged over multiple pages. Presentation in a grid ensures that “Information slices

are positioned within the eyespan, so that viewers make comparisons at a glance- uninterrupted visual

reasoning” [40]. As well as enabling this simultaneous view of the full data set we can, by restricting

our attention to the panels of a single row or column, assess how the data varies in that direction;

or, by comparing rows (or columns) to each other, how that variation itself varies with changes in the

other control parameter. Any sequential presentation naturally suppresses comparison along one of

the directions, and these higher level comparisons become impossible. Moreover, our ability to make

comparisons even along the favoured dimension is impaired by the lack of permanence - if we were

to present the twelve panels of Figure 2.1 over several pages, we might be able to compare consecutive

panels more effectively - but it is unlikely that we could recall the specifics of a panel from several pages

back.

These may seem like concerns that only apply in print - and an obvious criticism of small multiples is

that they are, well, small, even if we restrict our control parameters to taking just a handful of values.

Prizing data density, Tufte sees no problem with “illustrations of postage-stamp size” [40], but with

the luxury of limitless ‘pages’ in building web visualisations, it is tempting to scale up. We can treat

the control parameters in a more literal sense, and present a full-screen view of the current slice and

update it as the user alters the parameters, implicitly navigating around the larger grid. However, care

must be taken that we do not lose the benefits of comparison at a glance; animation can help with this.

11

2.2 Tree Maps

In a tree map, we seek to tile a fixed area so that the size of each tile is proportional to the data item

it represents. The original application by Shneiderman was to visualising hard disk usage across a

file system [18], with the name being chosen to convey “ the notion of turning a tree into a planar

space-filling map” [35]. Hierarchical structures with cumulative sizes such as directory structures are

particularly suited to tree mapping due to the natural recursion - once a tile has been assigned to a

directory, it can itself be sub-tiled to analyse its contents. The appropriate depth to be explored in this

way depends largely on user experience; Shneiderman notes that whilst “we were impressed to examine

thousands of nodes at 5-7 levels at once on the screen, [...] novices did better seeing 20-50 nodes at 1-3

levels.” [35].

As for other visualisations we will consider, there are competing measures, both mathematical and

aesthetic, of the quality of a treemap. For datasets where the categories have a natural order, we may

wish to preserve this through adjacency of tiles. An easy way to achieve this is illustrated in Figure 2.2,

which shows estimated Scottish population counts in mid-2013 by age groups five years wide , older

groups being denoted by greater opacity (in line with our observations in Section 1.3.2).

This (extremely simple to produce) treemap also scores highly for stability: in an interactive visualisa-

tion where each treemap corresponds to a slice, we may wish to resize tiles as the control parameter

varies, but without substantial changes to their position (which would disrupt the user’s mental map).

For this presentation, we can smoothly resize each column whilst preserving its place in the ordering.

Figure 2.2: Slice-and-dice treemap of Scottish population data.

However, it is a poor choice when considering the aspect ratio of the tiles - that is, max(

hei g htwi d th , wi d th

hei g ht

)- as the fixed height necessarily forces tiles for small data items to be very thin. Indeed, many would

not even recognise it as a treemap! This is not purely an aesthetic consideration, with various practi-

cal advantages of nearly-square tiles being identified in [6]. For instance, since squares minimize the

perimeter for a given area, display space is more efficienctly used when tile borders are required; thin

12

rectangles can cause aliasing errors in print and are hard to select / label in interactive visualisations;

and comparison between rectangles is easier when they have similar aspect ratios. Conversely, such

‘squarified’ treemaps tend to have low stability, as well as disrupting ordered data sets - see Figure

2.3, which presents the same data set as Figure 2.2. Finding rectangular tesselations with aspect ratio

as close to 1 as possible is also an NP-hard problem, although [6] presents an approximate algorithm

which (empirically) behaves well in practice. This is available in D3, and was used to produce Figure

2.3. We will describe this algorithm by walking through their example (reproduced as Figure 2.4) in

greater detail.

Figure 2.3: Squarified treemap of Scottish population data.

Example 2.2.1. Suppose we wish to pack tiles of area 6,6,4,3,2,2,1 in a rectangle R of width 6 and height

4. We place the tiles with largest area first.

As we progress, we will finalise the placement of some tiles which we consider to be ‘locked’; the unlocked

remainder of R is our working space R ′; this will always be a rectangle, and initially is all of R. Within

this, we will have chosen an ‘active side’, and will be forming a stack S of unlocked tiles along that side.

For instance, if the active side is vertical, then we will be stacking tiles into rows, and as more tiles are

added, the width of the stack must necessarily grow. This is the case in steps 1, 2 and 3- we have chosen

the left vertical as active side, and the first tile placed (area 6) thus gives us a stack of width 1.5. Adding

a second area 6 tile to the stack gives it width 3; and the third tile, of area 4, takes it to width 4.

However, when adding each tile to the stack, we note the aspect ratio that arises as a result; by placing

successively smaller tiles, the worst in stack is always that of the newly-placed tile. For instance, adding

the area 4 tile in step 3 gives it an aspect ratio of 4/1. Instead of placing it in the current stack, we could

instead start a new one in R ′\S with this tile, as shown in step 4. This has a preferable aspect ratio of 9/4,

so the stack with two area 6 tiles is locked; R ′\S becomes our new R ′, and we start a new stack S with the

area 4 tile. The choice of tiling direction is determined by the aspect ratio of R ′: we pick a vertical active

side if R ′ is wider than it is tall.

Thus in steps 4 to 6, with a new working space R ′ of width 3 and height 4, we place tiles along the

13

horizontal edge to form a stack of varying height. Introducing the area 3 tile gives it an acceptable aspect

ratio, so we do so (step 5), but continuing in this way with the area 2 tile in step 6 is rejected. Instead we

lock the area 4 and 3 tiles, shrink our working space, and start a new stack with the area 2 tile (step 7).

This we initially assign a vertical active side, but on placing the next tile (step 8) we find that the aspect

ratio is worse than locking the first area 2 tile and starting a new stack with the second one (step 9; note

that locking a single tile stack is effectively the same as swapping the orientation of the active side). Now

there is no choice in placing the final tile, and we are done (step 10).

Figure 2.4: Example of tile placement in the squarified algorithm.Originally Figure 4 of [6].

For static representations of unordered data sets (so neither stability nor ordering are a concern) squar-

ified tree maps are an effective way to communicate hierarchical data sets. Aesthetically, as Figure 2.5

shows, they can be potentially indistinguishable from art! They also offer practical advantages over

rival presentations, as discussed in [6]- with listings “it is hard to form a mental image of the overall

structure”, whilst tree diagrams (see Section 3.2) “use the display space inefficiently [as] most of the pix-

els are used as background”.

But from the very start [18] treemaps were intended to be used interactively, and this allows some of

14

the usability issues to be addressed. For instance, they can allow users to select a global tree depth they

are comfortable with. By offering the ability to zoom in to tiles, it becomes possible to apply a greater

level of detail to just the categories of interest, allowing for user-driven exploration of the data set. Due

to the stability issues mentioned care may be required when using them as slices, and although flat (as

opposed to hierarchical) categorical data can be presented, as with simpler area-based representations

such as pie charts, negative quantities cannot be effectively handled. Further, whilst “treemaps are very

effective when size is the most important feature to be displayed” [6], they are less useful when it is the

tree structure we are interested in, rather than the values assigned to its leaves. We identify further

issues with tree maps in our investigations in Section 4.3.

Figure 2.5: The Singing Mondrian.From [20], a treemap visualisation of popular artists on the music-tracking website Last.fm, styled

after Piet Mondrian’s compositions (colour-coding denotes genre).

2.3 Choropleth Maps

A choropleth map presents the values of an ordinal statistic associated with geographic regions by

means of a shaded or colour-coded map. We have already seen an example - albeit with a poorly

chosen colour scheme - in Figure 1.3. An advantage of conveying magnitude through colour (rather

than area, as with tree maps and other representations we shall consider) is that with an appropriate

scale we can indicate negative quantities.

15

In partial analogy with small multiples1, a very high density of local data can be presented whilst still

allowing easy at-a-glance comparisions within larger contexts; familiar geographic placement allowing

for easier navigation of the data than a grid or tabular representation. However, this familiarity being

based on existing geographic or political boundaries also presents a risk, in that the areas of regions

are naturally fixed. Whilst the colour scheme is meant to be the only data channel, it is inevitable

that size will influence our perception somewhat. Figure 2.6 gives an example; although appropriately

coloured, it is easy to interpret the migration from Canada as being more siginificant than that from

Germany, when in fact there are more than twice as many migrants from the latter.

Figure 2.6: Choropleth map of migration to Scotland.

One solution to this problem, as discussed in [40], is to use a mesh map, partitioning with an equally

spaced grid so that “arbitrary but statistically wise boundaries now cradle the micro-data”. However, if

the original statistical reporting has been aggregated to more conventional divisions (countries, coun-

cil regions, health wards etc.) then such a partition may not be possible to produce. We will consider an

alternative to area-based representation for migration flows from given regions in considerable detail

in Chapter 3.

1the term choropleth is derived from the greek words for ‘area/region’ and ‘multitude’.

16

2.4 Frequency Plots

One opportunity presented by visualisation is to move away from numerical values entirely, and in-

stead represent quantities of interest pictorially. Further, demographic statistics are often concerned

with populations of a size that is hard to comprehend; whilst the arrival of around 32,000 migrants to

Scotland in 2012 might sound large, it is only 0.6% of the overall population of around 5.31 million.

This is made yet easier when phrased as a frequency - that if a thousand Scots were selected at ran-

dom that year, only six would be new migrants. There is even anthropological evidence that humans

are only capable of keeping track of social groups of moderate size - known as “Dunbar’s number”2,

this cognitive limit is typically taken to be 150. Thus we may benefit from presenting percentage val-

ues literally as a visual proportion of 100, such as in Figure 2.7, which illustrates the gender divide in

Scotland’s centenarian population (of which 85% are women).

Figure 2.7: Number of males and females per 100 centenarians, Scotland 2012.

Originally Figure 5 of [14], see Appendix A.7 for an interactive version.

2or in popular discussion, “the monkeysphere”

17

We also note the particular success of natural frequency diagrams in understanding conditional prob-

ability when drawing from joint distributions (that is, overlapping populations) as in Figure 2.8. This

illustrates the outcomes of breast cancer screening for a population of 1000 women, given the follow-

ing:

• The probability of a woman having breast cancer is 1%

• If a woman has breast cancer, the probability of testing positive is 85% (sensitivity)

• If a woman does not have breast cancer, the probability of testing negative is 90% (specificity).

From these, a direct calculation via Bayes theorem and the law of total probability gives us the proba-

bility of actually having breast cancer, given a positive test result, as around 8%:

P (Cancer|+ ve) = P (Cancer∩+ve)

P (+ve)

= P (+ve|Cancer)P (Cancer)

P (+ve|Cancer)P (Cancer)+P (+ve|No cancer)P (No cancer)

= 0.01×0.85

0.01×0.85+0.1×0.99

≈ 0.08

However, in a test where medical practitioners were given similar data and asked to determine this

probability, almost half erroneously assess the cancer risk as being the sensitivity (in this example,

85%) [8]. This was despite the specificity being rephrased as the “false alarm rate” (10% of women

without breast cancer nonetheless testing positive); and multiple choice options being offered. Given

the controversial history of Bayes theorem, it is perhaps not surprising that such calculations prove

challenging even for specialists; the natural frequency presentation in Figure 2.8 makes the figures

much easier to grasp, which may lend support to the use of simple frequency charts such as Figure 2.7

too.

18

Fig

ure

2.8:

Po

ssib

leo

utc

om

eso

fbre

astc

ance

rsc

reen

ing.

Cap

ture

fro

m[4

2].

19

Chapter 3

Graphs

In this chapter, we turn our attention to the use of graph-theoretic structures - particularly trees - to

design visualisations, with the aim of overcoming some of the limitations identified in the previous

chapter. Of particular interest will be graph layout: firstly for given structures, such as hierarchical

data sets that we previously considered in the context of tree maps; then of structures computed from

the data. For the latter we will use clustering to generate trees from geographical data, enabling a

presentation as a flow map.

3.1 Introduction

To fix notation, by a graph we will mean a collection of objects - the vertex set V - with a linking structure

given by the edge set E ⊆V ×V . At times, it may be convenient to identify an edge e ∈ E by its end points

vi , v j ∈ V ; we will denote this by e = i ↔ j , and say that vi , v j are adjacent. Often we will think of the

vertex set V as simply a set of integers (so the i th vertex vi is identified simply as i ).

The vertices and edges of a graph lend themselves to a natural presentation as a diagram of points and

lines, and we often think of this diagram as being the graph. But it is important to realise that seemingly

different diagrams may just be different representations of the same abstract structure. For instance,

all of the diagrams in Figure 3.1 are - considering only adjacency - ‘the same graph’ (specifically, an

object known as the Petersen graph). It should be immediately apparent that there is no structural

difference between graphs (i) and (ii), since all that has changed are visual properties: colour, shape,

and labelling language (letters instead of numbers). Without the numbering of vertices, it might be

hard at first glance to verify that (i) and (iii) are the same, but by checking the neighbours of each, it

can be seen that the ten vertices have simply been repositioned in space. A similar analysis shows that

20

(iv) is just a repositioning (and fresh colour makeover) of the lettered vertices in (ii)- which was itself

equivalent to (i).

On the one hand, this level of abstraction makes graphs a powerful tool for visualisation. Many data

sets which do not have an obvious ‘vertices and edges’ nature may nonetheless have a graphical inter-

pretation which can give us insight into structure within the data. We can then present this with an

appropriate diagram, and we are free to use tools like colour and vertex shape/size to indicate further

aspects of the data. But this freedom also results in a number of challenges in producing - or even

defining - good diagrams for a given graph; some of these issues will be considered in Section 3.2.

(i )

12

3

4

56

7

8

9

10

(i i )

AB

C

D

EF

G

H

I

J

(i i i )

1

2

3

45

6

7

8

9 10

(i v)

A

B

CD

EF

GH

I

J

Figure 3.1: Four views of the Petersen Graph.

As mentioned, our vertices may correspond to items in our data set with further attributes. But we

may also be interested in assigning attributes to the edges between them. Generally we treat edges

as unordered pairs {i , j } (so two vertices are either adjacent or not); for a directed graph we consider

ordered pairs, so we may have an edge e = i → j ∈ E without the corresponding j → i being in E . We

will call i → j and out-edge of i and an in-edge of j .

For both undirected and directed graphs, we may also be interested in assigning a weight w(e) to each

edge e ∈ E , usually to indicate strength of ties between vertices beyond a binary classification of adja-

cent or not. In this general setting (or indeed any of the simpler ones), we can associate an n-vertex

graph with an n ×n adjacency matrix, where the entry Ai j is the weight of the edge i → j (zero if no

21

such edge in E , or if i = j ; conventionally 1 for edges in an unweighted graph). For certain tables of

data we can therefore immediately construct a corresponding graph by interpreting the table as an

adjacency matrix.

3.2 Graph Layout

Determining appropriate vertex placement (and possibly edge routing, if not simply straight lines) can

be formulated as an optimisation problem. To do so, an objective cost must be assigned to possible

vertex locations that captures the quality of the corresponding diagram. Care is required, however,

as innocent-looking criteria can give rise to intractable problems. For instance, we describe a graph

as planar if there is an embedding into the plane with no edge crossings. A graph with n-edges can

be tested for planarity in O(n) time, and if a suitable embedding exists then this can be recovered for

the same complexity; see [7]). However, if a graph fails to be planar, then the question of how many

crossings are neccessary to draw it is already NP-hard [13].

Worse, the concept of a ‘good’ diagram can depend on aesthetic judgments; with respect to crossing

number, both diagrams (i) and (iv) of Figure 3.1 are equally good; see also Figure 3.2, where the version

with crossings is likely preferable to the planar embedding. Many properties such as symmetry that are

useful for laying out examples in pure graph theory may not hold for graphs constructed from messier

real world data sets. Moreover, it is unlikely that the preferences of the designer will perfectly match

those of the viewers of a diagram or users of a visualisation incorporating one; in the latter case, user

requirements may vary as they explore a data set.

Figure 3.2: Non-planar and planar diagrams for the same graph.

For simple criteria, though, algorithmic optimisation of layout can work well, with the family of force-

directed placement algorithms - driven by an underlying ‘physical’ process - being successful exam-

ples. In [19], vertices are associated with particles linked by springs. By setting a desired (fixed) edge

22

length of L, the ideal geometric distance between two positions pi, pj is taken to be li j = L ×di j for

di j the graph theoretic distance - that is, length of shortest path - between vertices i and j . If pi,pj are

placed too far or close apart, this will require energy to either expand or compress the spring joining

them. Introducting a strength ki j for each spring - a value proportional to 1/d 2i j is suggested - a total

energy of

E =n−1∑i=1

n∑j=i+1

1

2ki j

(|pi −pj|− li j)2 (3.2.1)

can be assigned to each choice of positions p1, . . . ,pn. This varies continuously with the pi ’s, and min-

imizing E corresponds to reducing the discrepancy between ideal and actual spacing.

As an analytic solution to the n-dimensional nonlinear equations which arise is not possible, the au-

thors of [19] propose a heuristic approach based on iterative refinement of the best solution found so

far. At each step, the particle with position pm = (xm , ym) of greatest discrepancy ∆m is identified, and

all others fixed. A local minimum with respect to this point ( that is, the partial derivatives wrt. xm

and ym both being zero) can be attained to any desired precision, by iteratively solving a sequence of

2-dimensional linear sub-problems and relocating to some (xm+δx , ym+δy ) at each iteration. Another

step for some other m can then be taken if E is not yet sufficiently low. For interactive visualisations,

this threshold can be set by the user, either explicitly, or implicitly by taking further steps whenever

they relocate a particle (such as to resolve a visually-obvious local minimum). We also note that par-

ticles can be effectively fixed in position (again, by design or user selection) by excluding them from

consideration when identifying the particle of greatest discrepancy.

A limitation of this and similar approaches is that they treat the particles as infinitesimal points in

space, and only graph-distance plays a role in positioning. For visualisation purposes, it is likely that

further constraints will be desired. In [9] various examples are given, such as preventing overlap of

vertices, arranging groups of vertices in bands or clusters, or ensuring that directed edges have a con-

sistent orientation. The same paper describes how to extend force-directed placement to allow for sep-

aration constraints in each dimension, of the form u +d ≤ v where u and v are variables representing

horizontal or vertical position and d is a desired constant minimum separation. Although seemingly

limited, these separation constraints suffice for the examples discusssed and several others, and the

linearity is convenient: by modifying the force-directed approach to respect the separation constraints

during the minimization of energy at each relocation step, the subproblem becomes a quadratic pro-

gram.

However, we note that this does not suffice to handle constraints in terms of Euclidean distance. For

instance, with circular vertices of radii ri located at pi = (xi , yi ), a componentwise enforcement of non-

23

overlapping requires constraints of the form

xi + ri ≤ x j − r j for 1 ≤ i 6= j ≤ n,

yi + ri ≤ y j − r j for 1 ≤ i 6= j ≤ n.

This effectively separates not just the circles, but their bounding boxes; Figure 3.3 demonstrates the

limitations of this, in comparison with a genuine radius based separation with constraints of the (non-

linear) form

|pi −pj| ≥ ri + r j for 1 ≤ i 6= j ≤ n.

Therefore in [11] one of the authors of [9] remedies the two major limitations of that work: the separate

treatment of the horizontal and vertical axes in specifying constraints, which is resolved by describing

how to handle constraints of the general form

|pi −pj|(=,≤,≥)d ;

and difficulties in scaling to large graphs due to the quadratic time complexity of the constrained opti-

mization algorithm.

Figure 3.3: Radius versus bounding box packing of circles.

In tacking the second limitation, the author notes that a preoccupation with mathematical rigour is

partially to blame for the time complexity: the methods of the earlier paper can be proven to converge

to stable local minima, but “it is not clear that such rigor is necessary simpy to obtain an aesthetically

appealing layout” [11]. Instead, inspiration is drawn from the field of computer game graphics and

animation, where ad-hoc methods are used without rigorous justification, yet “by a miracle routinely

attributed to either Jacobi or Gauss-Seidel the method usually converges to a stable state in very few

24

iterations” [11]. The author goes on to explain that the appeal to either Jacobi or Gauss-Seidel methods

is mis-placed due to the lack of formal proofs of convergence or even correctness once constraints

have been introduced. Nonetheless, this deficiency of the academic literature is not the same as a

proof of incorrectness or nonconvergence, and the practical results seem satisfactory to the animation

community. For the graph layout problem, the approach taken is to alternate between unconstrained

optimization steps in line with the chosen force-directed approach, and constraint-satisfaction steps

based on a “simple, naïve, and yet effective heuristic” for skeleton-based animation of computer game

characters. The result is that incremental layout of n node, m edge graphs subject to c Euclidean

constraints can be performed with a time complexity of O(n logn +m + c), a vast improvement on [9]

which makes handling large graphs in real time feasible.

A further advantage of supporting Euclidean constraints is this allows more sophisticated physical laws

to be applied to the interaction of particles, where the force is a function of distance - such as the

effects of gravity, or the attraction / repulsion of electrical charges. The force layout features of D3 offer

exactly this: a global gravity parameter causes attraction to the center of the visualisation, whilst each

particle carries a charge (which can be constant of a function of the data associated with the vertex)

and edges have a target length; friction can also be applied to dampen movement. We note however

that the forces are applied to single-pixel points; additional code is required to implement features

such as constrained display region or non-overlapping vertices, although this is entirely possible. We

note here a general formulation as an optimization problem (which, viewed abstractly, has much in

common with standard problems such as facility location); in Section 4.2 we present a visualisation

which applies D3’s force layout under such constraints.

Definition 3.2.1 (Graph Layout Constraints). For a set of n vertices with associated radii ri , and graph

structure given by adjacency matrix A, layout in a region of width w and height h with a target edge

length of L requires the selection of positions pi = (xi , yi ) such that

1. |pi −pj| ≥ ri + r j for 1 ≤ i 6= j ≤ n (non-overlap of vertices);

2. ri ≤ xi ≤ w − ri for 1 ≤ i ≤ n (vertices wholly contained in width of region);

3. ri ≤ yi ≤ h − ri for 1 ≤ i ≤ n (vertices wholly contained in height of region);

4. |pi −pj| = L for 1 ≤ i 6= j ≤ n (edge length).

To turn this into an optimisation problem, we require an objective. Instead of the graph-distance based

formula 3.2.1, which considers all pairs of vertices, we can instead make use of the adjacency structure

to consider just the adjacent pairs, by relaxing condition 4 from Definition 3.2.1. In this way, we are

25

allowing for some discrepancy between the actual (euclidean) distances and the target L; we have var-

ious options in how we score this, two of which we present below.

Definition 3.2.2 (Graph Layout optimization, average discrepancy version). For vertices 1. . . ,n with

adjacency structure given by the matrix A and radii given by r = (r1, . . .rn), plus a target edge length L

and region of dimensions w ×h, the average discrepancy version of the graph layout problem is to

mi ni mi ze∑

1≤i< j≤nAi j

∣∣(|pi −pj|−L)∣∣

for decision variables pi = (xi , yi ), 1 ≤ i ≤ n, subject to

1. |pi −pj| ≥ ri + r j for 1 ≤ i 6= j ≤ n;

2. ri ≤ xi ≤ w − ri for 1 ≤ i ≤ n;

3. ri ≤ yi ≤ h − ri for 1 ≤ i ≤ n.

Definition 3.2.3 (Graph Layout optimization, greatest discrepancy version). For vertices 1. . . ,n with

adjacency structure given by the matrix A and radii given by r = (r1, . . .rn), plus a target edge length L

and region of dimensions w ×h, the greatest discrepancy version of the graph layout problem is to

mi ni mi ze max1≤i< j≤n

Ai j∣∣(|pi −pj|−L)

∣∣for decision variables pi = (xi , yi ), 1 ≤ i ≤ n, subject to

1. |pi −pj| ≥ ri + r j for 1 ≤ i 6= j ≤ n;

2. ri ≤ xi ≤ w − ri for 1 ≤ i ≤ n;

3. ri ≤ yi ≤ h − ri for 1 ≤ i ≤ n.

We note that in either definition 3.2.2,3.2.3, the edge length between circular vertices i , j is equated

with the distance between their centres pi,pj. In either case, if we wish to treat the edge as only starting

from the boundary of the circles, we can simply replace L in each term of the objective with L+ ri + r j .

Further, in a particular instance of either problem we may wish to specify fixed locations pi′ for a subset

of the particles I ′ ⊂ {1, . . . ,n}; these can simply be added as constraints of the form xi = x ′i , yi = y ′

i for

all i ∈ I ′, which allows us to preserve the objective function and adjacency matrix.

26

3.3 Flow Maps

In our discussion (Section 2.3 ) of choropleth maps for migration data we noted an inherent limitation

arising from the fixed dimensions of each country or region being considered. After all, it is not the

size of the location we are interested in, but of the flow of people to or from it. Cartographers have long

known how to resolve this issue, through the use of flow maps; see for example the work of Minard in

the 19th century visualising exports of French wine [25] (Figure 3.4; for a larger version see page 25 of

[41]) or Napoleon’s Russian campaign of 1812 (Figure 3.5, or Ibid. p.41).

Figure 3.4: Minard’s map of exports of French wine in 1864.

These early representations of flow were necessarily drawn by hand (although this allows for precise

geographic accuracy to yield to clarity, as in the repositioning of the UK in Figure 3.4); the paper of Phan

et al. [33] describes their process for generating such maps algorithmically from a list of locations and

the flow to each from a specified root. To do so they consider the geographic locations as vertices

of a graph, and use clustering to construct a tree from the desired root with the remaining vertices

as leaves. This abstract partitioning introduces further vertices as branch points; unlike the original

vertices, these can be placed arbitrarily on the map, but the quality of the visualisation will depend

strongly on the choices made for these.

27

Figure 3.5: Minard’s 1861 visualisation of Napoleon’s Russian campaign of 1812.Conveying six variables including both geographic position and size of the army in a readily

understood way, Tufte remarked that “it may well be the best statistical graphic ever drawn” [41].

3.3.1 Flow graphs

Suppose we are given a source location 0 and a set of n target locations each with a required flow fi ,

i = 1, . . . ,n; and that for each location we also have fixed position coordinates (xi , yi ).

Definition 3.3.1. We call T a directed flow tree if it contains: a nonempty set of leaf vertices correspond-

ing to each target i , with no out-edges and a single in-edge of weight equal to fi ; a single root vertex 0

with no in-edges and total weight over all out edges equal ton∑

i=1fi ; and a (possibly empty) set of branch

vertices which satisfy the balance condition that total flow in equals total flow out.

From such a T we can infer an undirected flow graph F . This suffices to produce an image such as

Figure 3.4, as once we have determined suitable locations for the branch vertices we can render the

graph with lines of thickness proportional to edge weight.

Following [33] we will use clustering techniques to introduce the branch vertices. This will result in

an agglomeration tree H with leaves 0, . . . ,n; we will show that a flow graph F can be generated from

this anyway, but may have undesirable properties with respect to layout. We will see how to resolve

this somewhat by re-wiring certain edges and thus eliminating branch vertices (based on the break-

down and reclustering of H employed in [33], but in a simpler-to-implement manner that works on

the undirected graph F instead).

28

3.3.2 Star graphs

We note that we can immediately produce a flow graph for any source and set of n targets: use the

star graph Sn (that is, the complete bipartite graph K1,n with n +1 vertices) where each of the n leaves

attached to the root 0 by an edge of weight fi . These are entirely straightforward to render, and turn

out to have two properties that would seem desirable for any flow map. Clearly, there will be no edges

that cross each other. Moreover, we minimize the amount of chart ink:

Definition 3.3.2. For a flow graph F with edge set E, the chart ink for a given rendering is the quantity

∑e∈E

le we ,

where le the length of edge e is the chart ink required for the rendering. (Note that for e joining vertices

i , j we have le =√

(xi −x j )2 + (yi − y j )2, i.e., determined by the chosen locations of each vertex in the

rendering, not just the branch structure of F ).

Proposition 3.3.1. For a given directed flow tree T , the chart ink is minimized by placing all branch

points at the same location of the root, so the corresponding flow map is a star graph.

Proof. If there are no branch vertices then we are done. Otherwise, let a, b be leaf vertices that share a

parent v , itself with a parent u (since we are considering the directed tree, this hierarchical interpreta-

tion of the vertices is possible). Then there are flows fa , fb along edges v → a, v → b respectively, and

for balance a flow f of at least1 fa + fb along u → v .

The total chart ink is therefore

f d(u, v)+ fad(v, a)+ fbd(v,b) ≥ ( fa + fb)d(u, v)+ fad(v, a)+ fbd(v,b)

= fa(d(u, v)+d(v, a))+ fb(d(u, v)+d(v,b))

≥ fad(u, a)+ fbd(u,b) by the triangle inequality

1There could be further children of v , introducing additional chart ink, so we give the general case; however, for theclustering methods we will employ there will only ever be two children.

29

Figure 3.6: Selecting branch point location to minimise chart ink.

But we can achieve this lower bound if v is both on the line u → a (triangle inequality applied to u, v, a,

as shown in Figure 3.6) and on the line u → b (for ∆u, v,b); this requires that v be precisely at u.

So the parent of every leaf should be located at its own parent. Iterating this process by pruning the

leaves and thus treating their parents as the leaves of a new smaller tree, we find that u (and hence v)

should be located precisely at its parent, and so on up to the root.

Reducing this quantity is beneficial in avoiding clutter and thus increasing legibility. However, we lose

the agglomerative nature of a flow map that features suitable branching and it can be hard to distin-

guish rays; consider Figure 3.8 or the comparison in Figure 3.7. Proposition 3.3.1 therefore implies

that we should not use minimization of chart ink as a metric for optimising the placement of branch

vertices if we want to present such a structure.

Figure 3.7: Computer generated flow maps of migration from California 1995-2000

Tobler’s approach [38, 39] (left) vs Phan et al.’s from [33] (right).

30

Figure 3.8: Star-graph flow map of migration to Scotland.

3.3.3 Hierarchical clustering

Clustering is a fundamental topic in machine learning, and we will only skim out a few relevant details

here (a comprehensive overview can be found in [17]). Given a set of elements of some space with a

measure of ‘distance’ between them, the goal is to group ‘similar’ elements together into subsets (the

clusters). This is generally an example of unsupervised learning, in that we are attempting to infer

appropriate divisions from the data rather than with respect to some existing grouping.

An algorithm which starts with each element as its own cluster then successively merges clusters to-

gether is described as agglomerative; if instead it starts from a single cluster of all elements then pro-

ceeds by splitting clusters, then it is instead divisive. In either approach there is generally a stopping

criterion, at which point further merges or splits are deemed detrimental to the quality of the cluster-

ing. Determining a suitable criterion is not a simple task - and likely will need to vary with the data set

- but we can sidestep this issue by running the process to completion: a single cluster for agglomera-

tive, or singleton clusters for divisive. The result in either case is a dendrogram; a tree structure with

the single cluster as root, and singleton clusters as leaves. We can then select a clustering of varying

coarseness by picking a height in the tree and partitioning with respect to the branches at that level.

Nearer the root we have fewer clusters of less similar elements; towards the leaves the elements of

clusters will be more similar to each other, but there may be (many) more clusters.

As an example, consider the 15 most common non-UK nationalities in Scotland. Assigning an x, y lo-

31

cation on a 2D map of the world to each country, we can cluster by distance in the geographic sense2.

Working agglomeratively, the comparison of singleton clusters {a}, {b} is straightforward; the Euclidean

distance d(a,b). But once a cluster contains more than one country, there are again various options

for defining distance. By keeping track of all elements in every cluster, we can consider single-link dis-

tance between clusters A and B as mina∈A,b∈B

d(A,B); the complete-link distance instead takes the max-

imum over these possible pairs of elements with one from each cluster. Both of these concepts can

be expanded further by considering all points within each country as part of the cluster, rather than a

single representative.

However, simpler data structures arise if we just synthesise an x, y value for each new cluster based on

those of the clusters being merged. Since in our application to migration data each location carries a

weight, we can consider a centre of mass; or even more simply just average the x and y coordinates

each time. In [33] another approach is taken - for each cluster we take its position to be the centre

of the bounding box of all its members. Following this approach, we find that clustering proceeds as

follows:

Merging Scotland with Republic of Ireland to create cluster 1

Merging Germany with France to create cluster 2

Merging Italy with cluster 2 to create cluster 3

Merging Poland with cluster 3 to create cluster 4

Merging Spain with cluster 4 to create cluster 5

Merging cluster 1 with cluster 5 to create cluster 6

Merging India with Pakistan to create cluster 7

Merging China with Hong Kong to create cluster 8

Merging USA with Canada to create cluster 9


Merging Nigeria with cluster 6 to create cluster 11

Merging South Africa with cluster 11 to create cluster 12

Merging Australia with cluster 10 to create cluster 13



The corresponding dendrogram is shown in Figure 3.9. From this we can deduce various clusterings;

for instance, two levels down from the root we get three clusters, comprising North America, Europe &

Africa, and Asia & Australasia.

2but it is important to realise that we can cluster on any notion of proximity / similarity.

32

For our purposes in constructing a flow graph we are nearly done - the dendrogram implies a set of

branch points, and we merely need to assign appropriate weight to the edges before turning to the

problem of layout. We note that each merger reduces the number of active clusters by one, and we are

done when there is only a single cluster; so in all n = 15 new clusters are created, each introducing a

branch vertex. Our flow graph therefore has 2n +1 vertices in total; since 2 edges are created during

each of n merges, we have 2n in total. So |E | = |V |−1; as the graph is connected (we can trace a path

from any vertex to the final branch vertex), it is therefore a tree as desired. The known vertex and edge

counts lead to convenient data structures; Algorithm 1 outlines a pseudocode description of how to

generate the flow graph during agglomerative clustering based on bounding box centre distances. A

Java implementation is given in Appendix A.1.

ZA NG Scot. IE DE FR IT PL ES IN PK CN HK AU US CA

dis

tan

ce

Figure 3.9: Bounding-box dendrogram for sources of migration to Scotland.

Proposition 3.3.2. The output of Algorithm 1 satisfies the conditions of Definition 3.3.1, i.e, T is a di-

rected flow tree.

Proof. We have already seen that we obtain a (2n + 1)-vertex tree structure, namely the dendrogram

H arising from the agglomerative clustering. However, the leaves of H are singleton clusters corre-

sponding to each of the locations, including the source; its root is the (2n +1)st cluster (containing all

locations). So the tree structure of T cannot be that of H .

Consider first the leaves of H , the singleton clusters corresponding to locations i = 0, . . .n. For each of

these we have a vertex i in T ; there will only be a single edge in T incident at this vertex, with the other

end being some branch vertex k. For the target locations i = 1, . . .n, we know W [i ] = fi > 0, so this edge

33

Algorithm 1: Flow Graph construction

Input: Source location 0 with coordinates (x0, y0).Input: Target locations i = 1, . . . ,n with coordinates (xi , yi ) and required flow fi .Output: A directed flow tree T = (V ,E) (with appropriately weighted edges).Initialise:Construct vectors X ,Y ,max X ,mi nX ,maxY ,mi nY ,W, i nC of length 2n +1 and all entries zeroSet V = {0, . . . ,2n} and E =;Set X [0] = max X [0] = mi nX [0] = x0

Set Y [0] = maxY [0] = mi nY [0] = y0

Set W [0] =−n∑

i=1fi

Set i nC [0] = 1for i=1,. . . n do

Set X [i ] = max X [i ] = mi nX [i ] = xi

Set Y [i ] = maxY [i ] = mi nY [i ] = yi

Set W [i ] = fi

Set i nC [i ] = 1

Set k = n +1while k ≤ 2n do

Get closest two active clusters to merge; this will give our k th cluster / (k −n)th branch vertex:Find 0 ≤ i < j ≤ 2n such that

i nC [i ] = i nC [ j ] = 1

and(X [i ]−X [ j ])2 + (Y [i ]−Y [ j ])2

is minimal over such i , j .Find bounding box, weight and centre of merged cluster:Set max X [k] = max(max X [i ],max X [ j ]), mi nX [k] = min(mi nX [i ],mi nX [ j ])Set maxY [k] = max(maxY [i ],maxY [ j ]), mi nY [k] = min(mi nY [i ],mi nY [ j ])Set X [k] = 1

2 (mi nX [k]+max X [k]), Y [k] = 12 (mi nY [k]+maxY [k])

Set W [k] =W [i ]+W [ j ]Update flow graph:if W [i ] < 0 then

Add edge i → k of weight |W [i ]| to Eelse

Add edge k → i of weight W [i ] to Eif W [ j ] < 0 then

Add edge j → k of weight |W [ j ]| to Eelse

Add edge k → j of weight W [ j ] to EUpdate active clusters:Set i nC [i ] = f al se, i nC [ j ] = f al se, i nC [k] = tr ueUpdate next cluster index:Set k=k+1

return T = (V ,E)

34

is directed k → i . Thus the target locations have no out-edges and single in-edge of weight fi ; they are

therefore the leaves of T .

Conversely, at the source i = 0 we have W [0] =−n∑

i=1fi < 0, so its incident edge is directed i → k. So it is

the root of T , with no in-edges and total weight over all out-edges (just the one!) equal to |W [0]| =n∑

i=1fi

as required.

The remaining vertices k = n +1, . . . ,2n are branch vertices. Note that any cluster k will have negative

weight W [k] if and only if the source 0 is an element of the cluster. Thus the merging of two clusters

i , j ∈ {0, . . . ,2n} to give some cluster k will always give a balanced flow at branch vertex k. To see this,

suppose W [k] = W [i ]+W [ j ] < 0. Then as 0 can only be in one of the clusters i , j , wlog i , we have

W [i ] < 0, W [ j ] ≥ 0, and thus an in-edge i → k of weight |W [i ]|−W [i ] and an out-edge k → j of weight

W [ j ]. But when k is itself merged into some cluster k ′ the edge created will be k → k ′ of weight |W [k]| =−W [k] = −W [i ]−W [ j ]. So the total flow out is W [ j ]+ (−W [i ]−W [ j ]) = −W [i ] and total flow in is

−W [i ], so flow is balanced at k. The analysis for W [k] ≥ 0 is much simpler; in this case both W [i ] and

W [ j ] are positive so we have out edges k → i , k → j with weights W [i ], W [ j ] respectively, for a total

flow out of W [k]; when k comes to be merged into some k ′ the edge created will run k ′ → k for an in

flow of W [k], ensuring balance. So all the conditions of Definition 3.3.1 are satisfied by T .

Remark 3.3.1. In practice (that is, the code given in Appendix A.1) the directionality of edges created is

always set from child cluster to parent; in effect, a weighted version of the dendrogram H. Suppressing

this directionality entirely still gives us a flow graph suitable for rendering, and tracking the hierarchy of

H is convenient for the re-wiring process described in section 3.3.5.

We may now turn our attention to issues of layout.

3.3.4 Flow graph layout

For each of the 2n +1 clusters considered during Algorithm 1 we assigned a bounding box and central

coordinates. However, it is only for the original n+1 locations that coordinates are fixed; we may place

the n branch vertices wherever we wish. As noted earlier in Proposition 3.3.1 the obvious ‘optimal’

choice just collapses to a star graph; ‘good’ positioning is therefore largely an aesthetic judgment.

In light of this, we make use of a D3 force layout, fixing the nodes corresponding to location vertices and

allowing the branch vertices to settle on positions in accordance with the physics processes described

in Section 3.2. A typical algorithmically-generated positioning is given in Figure 3.10. We build upon

this automated process by adding the ability for a user to manipulate the branch vertices interactively,

35

thus fine-tuning a layout before capturing a static version for print. An example of this is given in Figure

3.11, in which edge crossings have been eliminated - at the cost of much more chart ink!

Figure 3.10: Algorithmically-generated flow map of migration to Scotland

Parameters: gravity 0, friction 0.5, charge −20.

Figure 3.11: User-adjusted flow map of migration to Scotland.

However, we note from these examples an inherent limitaton of the flow map generated from the hi-

erarchy given in Figure 3.9. By driving the clustering to completion we force the merger of distant

36

clusters: specifically the USA&Canada pair with the rest of the locations, which results in a branch

point that is hard to place without crossings. This problem was noted in [33]; it is a side effect of the

hierarchy root not being our desired source location. Thus in the repositioning shown in Figure 3.11 it

is the root of H - now located just off the coast of South Africa - that draws our attention, rather than

the massed flow at the source location in Scotland.

3.3.5 Rewiring

Modifying Algorithm 1 as per Remark 3.3.1, we have a flow graph F corresponding to the dendrogram

H from the hierachical clustering. We wish to adapt this to one with the source vertex 0 as root; inspired

by [33] we do this by identifying the path from 0 to the root of H and the clusters that attach to it. We

differ from their approach by collapsing the path into a single vertex, thus merging 0 with the original

root and giving it multiple child clusters; rather than performing a second clustering subject to the new

merging rule they detail. The pseudo-code description is given in Algorithm 2.

Algorithm 2: Flow Graph rewiring

Input: V = {Source vertex 0,target vertices 1, . . . ,n,branch vertices n +1, . . . ,2n}.Input: Weight function w and edge set E such that H = (V ,E) is a weighted directed

dendrogram: (i → j ∈ E) ⇒ j > i .Output: A flow graph F = (V ′,E ′) rooted at the source vertex.Initialise:Construct vector X of length 2n +1 and all entries zeroSet V ′ = E ′ =;Set i = 0Identify vertices on Source→root path for exclusionwhile i<2n do

Find j such that i → j is in ESet X [ j ] = 1Set i = j

Form new vertex set:for i = 0, . . . ,2n do

if X [i ] = 0 thenAdd i to V ′

Form new edge set:for e = i → j in E do

if X [i ] = 0 thenif X [ j ] = 1 then

Rewire cluster i to sourceAdd edge i → 0 with weight w(e) to E ′

elseEdge within a surviving clusterAdd edge i → j with weight w(e) to E ′

return F = (V ′,E ′)

37

ZA NG Scot. IE DE FR IT PL ES IN PK CN HK AU US CA

Figure 3.12: Re-assigning root for sources of migration to Scotland.

The red path from Scotland to the dendrogram root will be collapsed into a single vertex, with child

clusters {DE,FR,IT,PL,ES}, {IN,PK,CN,HK,AU}, {US,CA}, {ZA}, {NG} and {IE}, within which the

dendrogram branching will be maintained.

Applying this process to the Scottish migration data, we see (from Figure 3.12) that six child clusters are

created, ranging from single locations to larger groupings with recogniseable continental structure -

North American, Asia&Australasian and European clusters. By anchoring each of these to Scotland, we

avoid the need for intercontinental edges to difficult-to-place branch points as encountered in Figures

3.10,3.11. Effectively, we are interpolating between the star-graph (guaranteed to have no crossings,

but with no aggregation of flow from branches) and the dendrogram from our initial clustering. Figure

4.1 in Chapter 4 illustrates the result.

3.4 Chord Diagrams

In the previous Section we were concerned with presenting the flows between a single source location

and multiple targets. However, we may be interested in the flows between all possible pairs of loca-

tions; that is, visualising general weighted graphs. The layout processes described in Section 3.2 are

one approach, but for dense graphs (those with many edges) we may wish to prescribe a particular

positioning of the vertices. Despite maximising crossings, arranging the vertices (or at least a majority

of them) on a circle is a popular option that can produce striking images such as Figure 3.13; many

38

more examples can be found throughout [21], particularly in the sections on communication/social

networks from Chapter 4, or the ‘radial convergence’ sections of Chapter 5.

Figure 3.13: Visualizing information flow in science.(adapted from [27] or see p.103 of [21].)

A further complication arises when the graph is directed as well as weighted, since now in principle we

may have two edges between each pair. However, through the use of a chord diagram the flow in each

direction can be accounted for in a single stroke. Let G be an n-vertex weighted directed graph with

non-negative adjacency matrix A, such that the diagonal entries Ai i are all zero (no flow from a vertex

to itself). Then we can construct a chord diagram by first partitioning a radius-r circle into n segments,

with the length of the i th segment proportional to the total flow out of vertex i ; that is, segment i has

length

2πr

n∑j=1

Ai j

n∑i=1

n∑j=1

Ai j

.

The chord between i and j is then drawn with width 2πr Ai j at segment i and width 2πr A j i at segment

j . Thus the chord will taper in accordance with the net flow between the two. As implemented in D3 -

such as the example given in Figure 3.14, lightly adapted from stock examples - this dominant direction

39

is also indicated by colouring the chord to match that of the end segment with greater out-flow.

Figure 3.14: Chord diagram of migration within Scotland 2011-2012.

We note from Figure 3.14 that in print form chord diagrams are more useful for capturing high level

trends than conveying specific detail - even with an understanding of the colour and sizing conven-

tions and the addition of a scale, it would be difficult to read off precise values. The situation is im-

proved as an interactive visualisation: tool-tips can convey the value associated with a single segment

or chord; whilst a single location can be investigated more easily by suppressing the chords from other

segments on user selection of a particular one. Both of these properties are illustrated in Figure 3.15

(or see the interactive version detailed in Appendix A.8).

40

Figure 3.15: Chord diagram of migration between Scottish councils and the rest of the UK 2011-2012.

Treating each chord diagram as a data slice for a particular parameter (such as time), change along

this extra dimension can be easily indicated in a visualisation by dynamic resizing of the segments

(colour and order should be preserved for clarity). The recent work [1] gives an effective example of

this. Their approach, described in [2], also describes how to tackle data sets with internal flow (that is,

non-zero diagonal entries in the adjacency matrix, or a loop in the graph) through offseting the start

point of a chord as in their Figure 3.16. This also shows how to summarise both in- and out-flows for

each region; the example is based on hypothetical data, however, and we note that this introduces

extra clutter whilst placing further interpretative burden on the user. Indeed, for their real data set as

visualised in [1] they suppress this feature.

41

Figure 3.16: Chord diagrams with internal flow.

42

Chapter 4

Portfolio

In this Chapter, we present a selection of the visualisations (not all successful!) developed from NRS

data sets. Since the visualisations presented are generally interactive, we encourage experimentation

with the ‘live’ versions; see the electronic appendices A, which also contain further examples either of

mathematically simpler content or visualisations already discussed in earlier chapters.

4.1 Migration Flow Map

Figure 4.1 shows the final visualisation developed using the clustering and rewiring techniques dis-

cussed in Section 3.3.

4.2 The Cause of death explorer

As part of the vital events reference tables [15], the General Register Office publishes annual statistics

on causes of death. This visualisation is based upon Table 6.1, which tabulates the total male and fe-

male deaths from particular causes each year 2002-2012; Figure 4.2 show a typical portion of the data

file. From this, we see that rather than a flat list of every possible cause, the reporting of deaths col-

lects them into increasingly broad categories, forming a hierarchy from individual causes of particular

interest up to all deaths. Thus one slice through the data is to fix a year and consider the distribution

of deaths for that year across the categories. The other direction is to fix a category, and see how the

number of deaths from those causes has varied through time.

The visualisation produced enables both of these slices to be explored simultaneously, using linked

charts; an example of it in action is given in Figure 4.3. The bulk of the area is devoted to a representa-

43

Fig

ure

4.1:

Use

r-ad

just

edre

wir

edfl

owm

apo

fmig

rati

on

toSc

otl

and

.

44

Figure 4.2: Cause of death data

tion of the category counts for a given year; due to the hierarchical data structure, this is a tree which

we can arrange using the graph layout techniques discussed in Section 3.2. Each circular data vertex

has an area proportional to the deaths from the causes it represents, for males, females or both de-

pending on user selection. However, if a data vertex represents a category for which sub-categories are

reported on in Table 6.1 of [15], then it can be expanded to reveal child vertices for each sub-category.

The original vertex is then re-sized and re-coloured to indicate its non-data status; the area of the orig-

inal data vertex is distributed appropriately across its children. This improves graphical integrity, as

otherwise the total area presented varies with tree depth explored, and a deep branch of the tree would

appear disproportionately important. As a further aid to comparison, tooltips give the precise count

for a category on hovering above its corresponding vertex.

Initially, only a single vertex - representing all deaths for the selected year - is presented. The user

can then explore the tree by expanding or contracting nodes, allowing the data displayed to be driven

by their level of interest; the fully expanded tree is a potentially overwhelming start point and makes it

difficult to locate particular causes. The basic layout algorithms in D3 have been built upon to be aware

of the vertex sizes (preventing overlap) and the bounding box; the user may also reposition and lock

vertices in place, with other unlocked vertices adjusting. For a consistent mental map, vertex positions

are preserved as much as possible when updating their size on selection of a different year or gender

category; successive expansion - contraction - expansion should also place child vertices in consistent

positions.

Vertex selection within the graph also drives the other two charts included, presenting the by-cause

slice through the data. The time series for deaths from the chosen cause for each gender across all

years in the data set is given by the line graph that occupies most of the bottom third. With both a year

and a cause specified, we can also assess the gender ratio, which is illustrated by the pie chart to the

left (and which updates with changes to the year).

45

To reduce clutter, shortened category names are used in the graph layout; these can be swapped for the

ICD10 category codes for users familiar with the data set, or turned off entirely. We therefore devote

the remaining portion of the visualisaton to a text field, which gives the precise category description

as per Table 6.1 of [15]. In Figure 4.2, we see that some entries have footnotes which are important

to integrity. For instance, the selection shown in Figure 4.3 is for poisonings, which appear to show a

sudden increase in 2011; Table 6.1 of [15] notes that this is due to a change in the category definition,

and thus so should the visualisation. An additional text field for per-category commentary is therefore

included for this purpose.

The data structure upon which the visualisation is built recreates Table 6.1 of [15], but the visualisation

is not limited to precisely that data set. In particular, values for additional years can be added to the

data lines and the visualisation will adjust to include them; a different selection of categories, including

a deeper or shallower tree structure, can also be specfiied purely at the data level rather than being

hardcoded into the graph structure. The figures for both genders are also computed automatically

from the entries for males and females.

4.3 Cause of Death Treemap

An earlier attempt at visualising the cause of death data from [15] is the zoomable tree map illustrated

in Figure 4.4. This uses the squarified algorithm described in Section 2.2; due to the stability issues

discussed, only a single slice-by-year is presented (the most recent figures, from 2012, although the

visualisation will adapt to any suitably formatted data file). By treating the counts for males and fe-

males as the leaf nodes for each category, we can convey both the aggregate totals and the values by

gender. Interaction allows for user-driven exploration, making maximal use of the display space for the

currently chosen category. Continuity between levels is assisted through the colour scheme - each of

the top level categories is assigned a colour, and further down the tree categories are assigned a small

random variation on the colour of their parent (as the second image in Figure 4.4 shows.

However, this visualisation reveals two limitations of area-based representations of size. Firstly, there

are obvious difficulties with labelling - not helped by the more obscure causes of death having longer

names but claiming less area to fit the label in. Tooltips offer some assistance - and can be used to give

the precise count - but it is possible for a child category to have such a low count that it is assigned

essentially no area. For instance, influenza accounts for less than 0.3% of deaths related to diseases

of the respiratory system, and whilst with the cause of death explorer a minimal vertex size can be

enforced, here the data necessarily takes priority.

46

Fig

ure

4.3:

Th

eC

ause

ofD

eath

Exp

lore

r.

47

Figure 4.4: Treemaps of cause of death data.Two different levels of the hierarchy are shown; selecting a portion of the tree map ‘zooms’ to a

treemap of its children.

48

The fixed representation area also introduces issues of data integrity, as the meaning (in number of

deaths) of a given area will vary depending on the total count for all causes currently on screen. Thus

a 50 : 50 ratio of two causes will appear the same regardless of whether they both claimed 10 or 1000

lives.

For these reasons, plus some technical challenges in implementing the D3 treemap on certain browsers,

we recommend the collapsible graph layout used in the cause of death explorer over the zoomable tree

map when working with hierarchical data sets.

4.4 Distributions with cohort effects: Fertility Data

As part of the vital events reference tables [15], the General Register Office publishes annual statistics

on fertility rates. This visualisation is based upon Table 3.6, which tabulates the total number of live

births per 1000 female population from 1973 to 2012 (plus 1951 and 1964), by age of mother in years.

Figure 4.5, shows a typical static visualisation of this data from the Annual Review [29]. Although 42

years of data are available, only five are shown, and this is already causing clutter and requires reference

to a key. Moreover, this is only one slice through the data set; there is also a ‘cohort effect’, in that we

may wish to track the changing fertility through time of women who share a birth year (i.e., as they

age).

Figure 4.5: Live births per 1,000 women, by age, selected years.(Originally Figure 2.4 of [29].)

Our interactive visualisation, illustrated in Figure 4.6, resolves both of these concerns. By allowing user

selection of the year, we need only show a single distribution. However, comparison between years

is made possible by animated transition from one distribution to another on selection of a new year.

‘Movie’ playback of the distributions in time order (from any starting year) chains these transitions

together to create a single smooth animation. Moreover, the transitions take account of the cohort

effect; it is not strictly accurate to say that 25 year old women in 2011 had a higher fertility rate than

49

they did in 2010. Rather, it is the case that women born in 1985 had fewer children at age 25 (in 2010)

than those born in 1986 did (in 2011). Thus the transition is shown as progression of a birth year cohort,

with a horizontal component, rather than just a rise or fall within an age category. This cohort effect is

also indicated by the colour coding - each cohort has a consistent colour which can be easily tracked

as the year varies, with no two visible cohorts sharing a colour. Their birth year is indicated by the

tool-tip, which also gives the precise birth rate figures for the active year.

Table 3.6 of [15] also reports the average age of mother at childbirth for each year, which we indicate

by a moving line; this also updates dynamically, with the animation between years again serving as a

visual shorthand for significant changes (or the lack of).

The visualisation supports arbitrary sets of consecutive years, so can be easily modified to include

future fertility data.

Figure 4.6: Interactive visualisation of fertility data.

50

Chapter 5

Conclusions

Throughout this project, we have sought to identify best practice in visualisation of data, implemen-

tation of which has raised a number of mathematical challenges. These have been met through the

creation of a variety of interactive visualisations based on NRS data, which will add to their existing

online resources. To conclude, we will summarise the key points identified and results of the project.

We began with an examination of principles for effective visualisation of data, and considered how

they should best be applied in interactive rather than static visualisations. In particular, we noted

• the use of data slices to reduce dimensionality

• the importance of being able to generate visualisations algorithmically, based on changing data;

and for design, functionality and data to be separated so as to enable rapid re-purposing of the

developed visualisations for new data sets

• potential risks to graphical integrity - avoiding conveying the wrong meaning through mislead-

ing or distorted visual cues such as area-based representation of quantity, inappropriate scales,

lack of labeling, or inconsistency across representations

• relatedly, the appropriate use of colour for either categorical or ordinal data

• the opportunity - and risk - for animation of transitions between data sets to emphasise change.

From this foundation, we turned our attention to several popular (static) methods of visualising data,

and considered how useful properties could be preserved or enhanced in moving to an interactive

setting. The small multiple added to our understanding of dimension reduction. Treemaps gave our

first example of how desirable properties from a design standpoint lead to mathematical optimisation

problems - and how different objectives can radically alter the nature of the images produced. We

51

identified that choropleth maps were at odds with some of our principles of graphical integrity when

presenting migration data. Finally, we considered the psychology of frequency plots as illustration of

the potential for visualisation to aid reasoning about mathematical concepts.

These investigations prepared us for more sophisticated visualisation techniques, and correspondingly

more difficult mathematical problems. The key to these was identifying that data sets could often be

associated with a graph-theoretic structure. Given a graph with data values associated with its vertices,

we demonstrated how the positioning of appropriately-sized vertices in a representation of the graph

structure could be posed as an optimisation task, and produced visualisations that dynamically solve

this problem. From the question of graph layout, we turned to the issue of ‘which graph?’. To rem-

edy deficiencies in the choropleth map of migration data, we developed a clustering algorithm for the

construction of rooted trees from weighted geographic data to allow automated generation of visually-

pleasing flow maps. Finally, we illustrated how general adjacency matrices of directed graphs could be

visualised by chord diagrams, with application to bi-directional migration flows.

Combining these two themes - design principles, and mathematical algorithms that enable them to be

achieved - a variety of interactive, data-driven visualisations were built for the NRS. These are in the

process of being made available on their website; in line with our design goals, they are constructed

to be of continued use after the end of this project due to easy update mechanisms. A number of the

more mathematically sophisticated visualisations - and the design issues they raise - were illustrated

in the portfolio, and further examples (with applications to NRS data) are given in the appendices.

52

Bibliography

[1] G. Abel, R. Bauer and N. Sander The Global Flow of People http: // www. global-migration.

info , 2014.

[2] G. Abel, R. Bauer, N. Sander and J. Schmidt Visualising Migration Flow Data with Circular Plots

Vienna Institute of Demography Working Papers, 02/2014.

[3] D. Borland and R. Taylor Rainbow Color Map (Still) Considered Harmful IEEE Computer Graph-

ics and Applications Vol. 27 Iss. 2 p14-17 2007.

[4] M. Bostock, J. Heer and V. Ogievetsky D3: Data-Driven Documents IEEE Transactions on Visual-

ization and Computer Graphics, IEEE Press, October 2011.

[5] C. Brewer, M. Harrower ColorBrewer 2.0 http: // colorbrewer2. org .

[6] M. Bruls, K. Huizing And J. Van Wijk Squarified treemaps Joint Eurographics and IEEE TCVG Sym-

posium on Visualization, IEEE Computer Society, 33-42, 2000.

[7] N. Chiba, T. Nishizeki, S. Abe and T. Ozawa A Linear Algorithm for Embedding Planar Graphs

Using PQ-trees, Journal of Computer and Systems Sciences 30(1): 54-76, 1985.

[8] W. Kremer Do doctors understand test results? http: // www. bbc. co. uk/ news/

magazine-28166019 July 2014.

[9] T. Dwyer, Y. Koren and K. Marriott IPSep-CoLa: an Incremental Procedure for Separation Con-

straint Layout of Graphs, IEEE Transactions on Visualization and Computer Graphics 12, 5:821-

828, 2006.

[10] D. Doemeland and J. Trevino Which World Bank reports are widely read? Policy Research Work-

ing Paper WPS6851, 2014.

[11] T. Dwyer Scalable, Versatile and Simple Constrained Graph Layout, Computer Graphics Forum

28:991-998, 2009.

[12] B Fry Visualizing Data O’Reilly Media, Inc, Sebastopol, California, First Edition 2007.

[13] M Garey and D Johnson, Crossing Number is NP-Complete, SIAM. J. on Algebraic and Discrete

Methods, 4(3):312-316, 1983.

[14] General Register Office for Scotland Centenarians in Scotland, 2002 to 2012 including mid-year

population estimates for those aged 90 and over, 21 March 2014.

53

http://www.global-migration.info

http://www.global-migration.info

http://colorbrewer2.org

http://www.bbc.co.uk/news/magazine-28166019

http://www.bbc.co.uk/news/magazine-28166019

[15] General Register Office for Scotland Vital Events Reference Tables http: // www.

gro-scotland. gov. uk/ statistics/ theme/ vital-events/ general/ ref-tables/

index. html Latest edition (2012).

[16] Google Material Design http: // www. google. com/ design/ spec/ material-design .

[17] A. Jain, M. Murty and P. Flynn Data Clustering: A Review, ACM Comput. Surv., ACM Press, 31:264-

323, 1999.

[18] B. Johnson and B. Shneiderman Tree-Maps: A Space-Filling Approach to the Visualization of

Hierarchical Information Structures Proc. of ACM CHI’86, Conference on Human Factors in com-

puting systems 16-23, 1986.

[19] T. Kamada and S. Kawai, An Algorithm for Drawing General Undirected Graphs, Information

Processing Letters 31:7-15, 1989.

[20] M. Kazi and B. Shneiderman The Treemap Art Project, http: // treemapart. wordpress. com/

2013.

[21] M. Lima Visual Complexity: Mapping Patterns of Information, Princeton Architectural Press,

New York, 1st ed. 2011.

[22] M. Maclean D3 Tips and Tricks https: // leanpub. com/ D3-Tips-and-Tricks .

[23] D. McCandless Information is Beautiful William Collins (Edition of 2012).

[24] Media Matters for America A History Of Dishonest Fox Charts http: // mediamatters. org/

research/ 2012/ 10/ 01/ a-history-of-dishonest-fox-charts/ 190225 October 2012.

[25] C. J. Minard, Tableaux Graphiques et Cartes Figuratives de M. Minard, 1845-1869, a portfolio of

his work held by the Bibliothèque de l’École Nationale des Ponts et Chaussées, Paris.

[26] S. Murray Interactive Data Visualization for the Web O’Reilly Media, March 2013.

[27] Moritz Stefaner, Visualizing Information Flow in Science http: // well-formed.

eigenfactor. org , 2009.

[28] National Records of Scotland About Us http: // www. nrscotland. gov. uk/ about-us (re-

trieved August 2014).

[29] National Records of Scotland Scotland’s Population 2012: The Registrar General’s Annual Re-

view of Demographic Trends 158th Edition SG/2013/208, 17 October 2013.

[30] National Records of Scotland Scottish life expectancy at its high-

est ever level http: // www. nrscotland. gov. uk/ news/ 2014/

scottish-life-expectancy-at-its-highest-ever-level April 2014.

[31] National Records of Scotland Wide variation in life expectancy be-

tween areas in Scotland http: // www. nrscotland. gov. uk/ news/ 2014/

wide-variation-in-life-expectancy-between-areas-in-scotland April 2014.

54

http://www.gro-scotland.gov.uk/statistics/theme/vital-events/general/ref-tables/index.html



http://www.google.com/design/spec/material-design

http://treemapart.wordpress.com/

https://leanpub.com/D3-Tips-and-Tricks

http://mediamatters.org/research/2012/10/01/a-history-of-dishonest-fox-charts/190225

http://mediamatters.org/research/2012/10/01/a-history-of-dishonest-fox-charts/190225

http://well-formed.eigenfactor.org

http://well-formed.eigenfactor.org

http://www.nrscotland.gov.uk/about-us

http://www.nrscotland.gov.uk/news/2014/scottish-life-expectancy-at-its-highest-ever-level

http://www.nrscotland.gov.uk/news/2014/scottish-life-expectancy-at-its-highest-ever-level

http://www.nrscotland.gov.uk/news/2014/wide-variation-in-life-expectancy-between-areas-in-scotland

http://www.nrscotland.gov.uk/news/2014/wide-variation-in-life-expectancy-between-areas-in-scotland

[32] Ohio State University Department of Political Science D3: Zoomable Treemap Explained

https: // secure. polisci. ohio-state. edu/ faq/ d3/ zoomabletreemap_ code. php

(retrieved August 2014).

[33] D. Phan, L. Xiao, R. Yeh, P. Hanrahan, T. Winograd Flow Map Layout , IEEE Information Visualiza-

tion (InfoVis), 219-224, 2005.

[34] W. Sanford and D. Selnick Estimation of Evapotranspiration Across the Conterminous United

States Using a Regression With Climate and Land-Cover Data JAWRA Journal of the American

Water Resources Association Vol. 49 Iss. 1 pages 217-230 2013.

[35] B. Shneiderman Treemaps for space-constrained visualization of hierarchies, http: // cs.

umd. edu/ hcil/ treemap-history/ .

[36] M. Tanaka C3.js | D3-based reusable chart library c3js. org .

[37] K. Temple (Intel) What Happens in an Internet Minute? http: // scoop. intel. com/

what-happens-in-an-internet-minute/ , March 2012.

[38] W. Tobler Experiments in Migration Mapping by Computer, American Cartographer, 1987.

[39] W. Tobler Movement Mapping. http://csiss.ncgia.ucsb.edu/clearinghouse/FlowMapper/ 2004.

[40] E. Tufte Envisioning Information Graphics Press, Cheshire, Connecticut, January 1990.

[41] E. Tufte The Visual Display of Quantitative Information, Graphics Press, Chesire, Connecticut

Sixteenth printing, January 1998.

[42] Understanding Uncertainty Screening tests (Breast screening) http: //

understandinguncertainty. org/ files/ animations/ BayesTheorem1/ BayesTheorem.

html .

[43] The White House The 2011 State of the Union Address: Enhanced Version http: // youtu. be/

kl2g40GoRxg .

[44] Youtube Press: Statistics https: // www. youtube. com/ yt/ press/ en-GB/ statistics.

html (retrieved August 2014).

55

https://secure.polisci.ohio-state.edu/faq/d3/zoomabletreemap_code.php

http://cs.umd.edu/hcil/treemap-history/

http://cs.umd.edu/hcil/treemap-history/

c3js.org

http://scoop.intel.com/what-happens-in-an-internet-minute/

http://scoop.intel.com/what-happens-in-an-internet-minute/

http://understandinguncertainty.org/files/animations/BayesTheorem1/BayesTheorem.html



http://youtu.be/kl2g40GoRxg

http://youtu.be/kl2g40GoRxg

https://www.youtube.com/yt/press/en-GB/statistics.html

https://www.youtube.com/yt/press/en-GB/statistics.html

Appendix A

Guide to Electronic Appendices

These appendices describe the files supplied on the attached CD; the file index.html should be opened

in a web browser to navigate to any of the collections described below. The files are also mirrored on-

line at http://maths.straylight.co.uk/mscapp/; depending on system configuration (in particu-

lar, for browsers other than Firefox), use of the latter may be required for access to the visualisations,

and is therefore recommended. Alternatively the appropriate source files can be copied to a webserver

(for instance, to access over an intranet); this also allows for modification to data files to observe the

effects.

For all files, the link to ‘view’ is the original file, and ‘source’ is a typset version that can be viewed in thebrowser. In this way, the source of any filetype can be viewed regardless of underlying system support.Note that for html files, following the ‘view’ link will load the corresponding webpage (typically thevisualisation); if the working source rather than typeset version is required, then the original should beaccessed using ‘save link as..’ on the ‘view’ link.

A.1 Flow Map Construction

The file buildWeightedTree.java provides an implementation in java of Algorithm 1, with the mod-ification described in Remark 3.3.1. This relies on the helper class Graph.java, which implementsAlgorithm 2. Geographic data should be supplied by the text file graph.in; output is written in json for-mat to graph.json. The latter can then be used in conjunction with the D3 visualisation flowmap.html.

A.2 Cause of Death Explorer

The visualisation described in Section 4.2.

The D3 visualisation is given in CODE.html, drawing on data files initial.json and scotland-other.json.

A.3 Cause of Death Zoomable Treemap


I

http://maths.straylight.co.uk/mscapp/

The D3 visualisation is given in codzoom.html, drawing on data file codzoom.json. Treemap codefrom [32], with the addition of code for random perturbation of colours specified as hex triples, plustool tips.

A.4 Fertility Data (cohort effects)


The D3 visualisation is given in fertility.html, drawing on data file fertility.json.

A.5 Popular Names

A visualisation of popular baby names; this also demonstrates the potential for data-driven visualisa-tions. Two data files - for names of boys (boys_top20_noclash.json) and girls(girls_top20_noclash.json)- are given, and the two visualisations - bnames.html and gnames.html

- differ in only a single line, where the data file to be used is specified.

A.6 Life Expectancy

An interactive version of an existing NRS visualisation [31], allowing any two council areas to be se-lected for comparison, and animating changes. The Scottish average is also provided throughout.

The D3 visualisation is given in lifeexp.html, drawing on data file lifeexp.json.

A.7 Gender distribution by age (Frequency plot)

An interactive version of an existing NRS visualisation [30], allowing any age category to be selected,and animating changes. A template for a two-variable version (illustrating how higher dimensionalslices can be taken) has also been produced.

The D3 visualisation is given in frequency.html, drawing on data file frequency.csv. The templatefor two variables is frequency2.html, which uses the (synthetic) data in frequency2.json.

A.8 Migration within Scotland (Chord Diagram)

The implementation illustrated in Figures 3.14, 3.15.

The D3 visualisation is given in chord.html, drawing on data files immig13.json and immig13-regions.csv.

II

Data Visualisation of Scottish Demographic …maths.straylight.co.uk/dataviz.pdfFinally, we present...

Documents

Transcript of Data Visualisation of Scottish Demographic …maths.straylight.co.uk/dataviz.pdfFinally, we present...